Issue
I am currently working on a big file that could look like that:
8820 368926 0.5219 360106
8820 554838 0.5772 546018
103715 388299 0.6983 284584
103715 422030 0.6981 318315
103715 574002 0.5203 470287
334933 335275 0.6090 342
334933 335334 1.0000 401
334933 335929 0.5571 996
334933 336392 0.7012 1459
I would like, for each value in column 1 (so here, for 8820, 103715, etc) to keep all the rows that follow, as long as my values in column 3 do not fall beyond some threshold. If for example I fix the threshold at 0.6, I would like to have as output
103715 388299 0.6983 284584
103715 422030 0.6981 318315
334933 335275 0.6090 342
334933 335334 1.0000 401
For 8820, the first value in column 3 is already below the threshold, so I don't want those rows at all. For 103715, I keep the first two rows, and forget about the rest. And same for 334933.
I know how to keep the lines as long as the value is above some threshold using awk, e.g.
awk '{if ($3>0.6) print}' file
I also know how to select some rows based on a specific pattern, e.g.
awk '{$1 ~ /^8820/}' file
but here the pattern would be changing all the time (each new value on the first column). So I'm a bit confused...
Any idea? Awk is not mandatory, as long as the result is obtained rather quickly (we are talking about millions of rows, but I'm using a cluster).
Solution
awk '
p1 != $1 { ok=1; p1=$1 }
$3 < threshold { ok=0 }
ok;
' threshold=0.6 file
- reset state on each new group
- clear flag if below threshold
- print, if flag still set
Answered By - jhnc Answer Checked By - Terry (WPSolving Volunteer)