Saturday, January 27, 2024

[SOLVED] Subset the data based on one column then keep some rows based on another column

Issue

I am currently working on a big file that could look like that:

8820    368926  0.5219  360106
8820    554838  0.5772  546018
103715  388299  0.6983  284584
103715  422030  0.6981  318315
103715  574002  0.5203  470287
334933  335275  0.6090  342
334933  335334  1.0000  401
334933  335929  0.5571  996
334933  336392  0.7012  1459

I would like, for each value in column 1 (so here, for 8820, 103715, etc) to keep all the rows that follow, as long as my values in column 3 do not fall beyond some threshold. If for example I fix the threshold at 0.6, I would like to have as output

103715  388299  0.6983  284584
103715  422030  0.6981  318315
334933  335275  0.6090  342
334933  335334  1.0000  401

For 8820, the first value in column 3 is already below the threshold, so I don't want those rows at all. For 103715, I keep the first two rows, and forget about the rest. And same for 334933.

I know how to keep the lines as long as the value is above some threshold using awk, e.g.

awk '{if ($3>0.6) print}' file

I also know how to select some rows based on a specific pattern, e.g.

awk '{$1 ~ /^8820/}' file

but here the pattern would be changing all the time (each new value on the first column). So I'm a bit confused...

Any idea? Awk is not mandatory, as long as the result is obtained rather quickly (we are talking about millions of rows, but I'm using a cluster).


Solution

awk '
    p1 != $1 { ok=1; p1=$1 }
    $3 < threshold { ok=0 }  
    ok;
' threshold=0.6 file
  • reset state on each new group
  • clear flag if below threshold
  • print, if flag still set


Answered By - jhnc
Answer Checked By - Terry (WPSolving Volunteer)