Tuesday, November 2, 2021

[SOLVED] Saving values in BASH shell variables while using |tee

November 02, 2021 bash, linux, sh, shell

Issue

I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.

Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:

$ cat test.txt 
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example'  | wc --lines ; ) ;  ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ;  ## second run

and I end up with this:

$ echo $FIRST
3
$ echo $SECOND
2

Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!

The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.

Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.

I have tried multiple ways using something like these below:

FIRST=''; SECOND='';
cat  test.txt                                                   \
    |tee  >(FIRST=$( grep 'first example'  | wc --lines ;);)    \
          >(SECOND=$(grep 'second example' | wc --lines ;);)    \
          >/dev/null        ;

and using read:

FIRST=''; SECOND='';
cat  test.txt                                                       \
   |tee  >(grep 'first example'   | wc --lines | (read FIRST);  );  \
         >(grep 'second example'  | wc --lines | (read SECOND); );  \
         > /dev/null                   ;



cat  test.txt                                                           \
      | tee  <( read FIRST  < <(grep 'first example'  | wc --lines ))   \
             <( read SECOND < <(grep 'sedond example' | wc --lines ))   \
             >    /dev/null             ;

and with curly brackets:

FIRST=''; SECOND='';
cat test.txt                                                     \
  |tee   >(FIRST={$( grep 'first example'  | wc --lines ;)} )    \
         >(SECOND={$(grep 'second example' | wc --lines ;)} )    \
         >/dev/null                           ;

but none of these allow me to save the line count into variables FIRST and SECOND.

Is this even possible to do?

Solution

tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.

The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.

Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.

As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.

But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.

Two greps

Getting rid of cat yields:

first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)

This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.

One grep

You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:

total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)

Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:

matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)

Pure bash

You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.

First, let's start with a while read loop to process the file line by line:

while IFS= read -r line; do
   ...
done < test.txt

You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:

first=0
second=0

while IFS= read -r line; do
    [[ $line == *'first example'* ]] && ((++first))
    [[ $line == *'second example'* ]] && ((++second))
done < test.txt

echo "$first"   ## should display 3
echo "$second"  ## should display 2

Another language

If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.

Answered By - John Kugelman

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0