Thursday, October 28, 2021

[SOLVED] Is it more efficient to grep twice or use a regular expression once?

Issue

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

I could do grep foo file.log | grep bar, but my concern is that it will be expensive running it twice.

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?


Solution

grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

  • Both sed and awk perform their own file reading; no need for pipe overhead
  • The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)



Answered By - pepoluan