Issue
I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
Solution
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to @Barmar for pointing out my issue.
Answered By - Ben C Wang