Issue
I have been handed two files from a preprocessing pipeline.
#fileA.csv
87687,"institute Polytechnic, Brazil"
342424,"university of India, India"
24343,"univefrsity columbia, Bogata, Colombia"
82739, "Hero univetsity, greece"
....
<3million lines>
#fileB.csv
342424
82739
...
<some 2 million entries>
I want to filter fileA.csv
from fileB.csv
as in, I want to keep the lines which match fileB.csv
(these are IDs) in column 1 of fileA.csv
. in other words, for each in row in fileA.csv
if the first column entry is not present in fileB.csv
, delete the line.
I am not exactly sure how to go about this in bash (which id prefer) rather than writing this in python (for each row, see if first entry is in list of IDs and filter).
In the above trivial example, the output would just be:
#result.csv
342424,"university of India, India"
82739, "Hero univetsity, greece"
In python, id do (pseudocode):
fileBlist=<load fileB.csv>
for item in fileAcsv:
x=item[0]
yesy_no=x in fileB_list
<append entry>
UPDATE
I have tried the solutions posted but there seems to be some inconsistency (maybe it is my fault!). So,:
fileB.csv has 29206428 lines
fileA.csv has 32128236 lines.
I was hoping to see the result file to contain 29206428 lines, but instead the result file is 30932039 lines. This seems logically impossible( in other words, for each in row in fileA.csv if the first column entry is not present in fileB.csv, delete the line.
) :D and I wonder what is going on..
Solution
You may use this awk
solution:
awk -F, 'FNR == NR { exists[$1]; next }
$1 in exists' fileB.csv fileA.csv > result.csv
cat result.csv
342424,"university of India, India"
82739, "Hero univetsity, greece"
Answered By - anubhava Answer Checked By - Terry (WPSolving Volunteer)