Tuesday, March 15, 2022

[SOLVED] Is there an easy and fast solution to compare two csv files in bash?

Issue

My Problem: I have 2 large csv files, with millions of lines.

The one file contains a backup of a database from my server, and looks like:

securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...

Now I have another CSV file, containing new codes like, with the exact same schema.

I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.

I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv

First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way

I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.

I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later: diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes

But I believe there has to be a better way.

So I ask you, if you have any ideas, how to improve this method.

EDIT:

comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.

But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like

securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...

I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.

I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^


Solution

Yes! a highly underrated tool comm is great for this. Stolen examples from here.

Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b

Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b

Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'

As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)

If you do prefer to stick with diff, you can get it to do what you want without the grep:

diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt

You could then alias that diff command to "comp" or something similar to allow you to just:

comp 1.txt 2.txt

That might be handy if this is a command you are likely to use often in future.



Answered By - axwr
Answer Checked By - Cary Denson (WPSolving Admin)