Issue
I have an input file containing 1 string per line
input.txt (25k lines in reality)
one
two
three
Then i have a directory with many files (50 files in reality)
2022-04-01.csv
stuff;one;more_stuff
stuff;one;more_stuff
2022-04-02.csv
stuff;one;more_stuff
stuff;two;more_stuff
2022-04-03.csv
stuff;two;more_stuff
stuff;three;more_stuff
stuff;three;more_stuff
I need to extract the earliest date each pattern appears. So output in this case would be
one:2022-04-01.csv
two:2022-02-02.csv
three:2022-04-03.csv
I can use grep -l one *.csv
to get me a unqiue list of files the pattern appears in, but not for multple patterns and not the single earliest date. If i could just get a list of files each pattern occurs in then i could manually extract the earliest date i think, but im sure there must be a 1 liner to do it all ?
Solution
Using any awk:
awk '
BEGIN { FS=";"; OFS=":" }
NR==FNR {
vals[$0]
next
}
$2 in vals {
print $2, FILENAME
delete vals[$2]
}
' input.txt *.csv
one:2022-04-01.csv
two:2022-04-02.csv
three;2022-04-03.csv
The NR==FNR{...}
block stores all of values from input.txt as indices of the array a[]
which I'm using as a hash table. The other block executes for every line read from the CSVs and tests if the current 2nd field from that line exists as an index in a[]
(i.e. does a hash lookup) and, if so prints that value and the current file name then removes that index from a[]
so no later occurrence of that same value can match.
This only works because your CSV file names are named in such a way that they will be passed to awk in the correct date order by your shell.
If and only if it's guaranteed that every value from input.txt will always appear in at least one of the CSVs then this would probably make the execution a bit faster most of the time, as suggested by @RenaudPacalet:
awk '
BEGIN { FS=";"; OFS=":" }
NR==FNR {
vals[$0]
numVals++
next
}
$2 in vals {
print $2, FILENAME
delete vals[$2]
if ( --numVals == 0 ) {
exit
}
}
' input.txt *.csv
Answered By - Ed Morton Answer Checked By - Timothy Miller (WPSolving Admin)