Issue
I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.
Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:
$ cat test.txt
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example' | wc --lines ; ) ; ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ; ## second run
and I end up with this:
$ echo $FIRST
3
$ echo $SECOND
2
Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!
The |tee
option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.
Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.
I have tried multiple ways using something like these below:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST=$( grep 'first example' | wc --lines ;);) \
>(SECOND=$(grep 'second example' | wc --lines ;);) \
>/dev/null ;
and using read
:
FIRST=''; SECOND='';
cat test.txt \
|tee >(grep 'first example' | wc --lines | (read FIRST); ); \
>(grep 'second example' | wc --lines | (read SECOND); ); \
> /dev/null ;
cat test.txt \
| tee <( read FIRST < <(grep 'first example' | wc --lines )) \
<( read SECOND < <(grep 'sedond example' | wc --lines )) \
> /dev/null ;
and with curly brackets:
FIRST=''; SECOND='';
cat test.txt \
|tee >(FIRST={$( grep 'first example' | wc --lines ;)} ) \
>(SECOND={$(grep 'second example' | wc --lines ;)} ) \
>/dev/null ;
but none of these allow me to save the line count into variables FIRST and SECOND.
Is this even possible to do?
Solution
tee
isn't saving any work. Each grep
is still going to do a full scan of the file. Either way you've got three passes through the file: two grep
s and one Useless Use of Cat. In fact tee
actually just adds a fourth program that loops over the whole file.
The various | tee
invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.
Every command in a |
pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.
As a rule of thumb, you can write variable=$(foo | bar | baz)
where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz
where it's on the inside. It won't work and you'll be sad.
But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.
Two greps
Getting rid of cat
yields:
first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)
This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.
One grep
You could improve this by using a single grep
call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:
total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)
Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:
matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)
Pure bash
You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read
and [[
can offer a nice speedup.
First, let's start with a while read
loop to process the file line by line:
while IFS= read -r line; do
...
done < test.txt
You can count matches by using double square brackets [[
and string equality ==
, which accepts *
wildcards:
first=0
second=0
while IFS= read -r line; do
[[ $line == *'first example'* ]] && ((++first))
[[ $line == *'second example'* ]] && ((++second))
done < test.txt
echo "$first" ## should display 3
echo "$second" ## should display 2
Another language
If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.
Answered By - John Kugelman