Issue
I have two files:
candidates.csv
:
id,value
1,123
4,1
2,5
50,5
blacklist.csv
:
1
2
5
3
10
I'd like to remove all rows from candidates.csv
in which the first column (id
) has a value contained in blacklist.csv
. id
is always numeric. In this case I'd like my output to look like this:
id,value
4,1
50,5
So far, my script for identifying the duplicate lines looks like this:
cat candidates.csv | cut -d \, -f 1 | grep -f blacklist.csv -w
This gives me the output
1
2
Now I somehow need to pipe this information back into sed
/awk
/gawk
/... to delete the duplicates, but I don't know how. Any ideas how I can continue from here? Or is there a better solution altogether? My only restriction is that it has to run in bash.
Solution
What about the following:
awk -F, '(NR==FNR){a[$1];next}!($1 in a)' blacklist.csv candidates.csv
How does this work?
An awk program is a series of pattern-action pairs, written as:
condition { action }
condition { action }
...
where condition
is typically an expression and action
a series of commands. Here, the first condition-action pairs read:
(NR==FNR){a[$1];next}
if the total record countNR
equals the record count of the fileFNR
(i.e. if we are reading the first file), store all values in arraya
and skip to the next record (do not do anything else)!($1 in a)
if the first field is not in the arraya
then perform the default action which is print the line. This will only work on the second file as the condition of the first condition-action pair does not hold.
Answered By - kvantour