Issue
I am trying to clean up my log file from duplicate lines. First of all I use sort command with uniq -d flag it help me to remove duplicates, but not soled my problem.
sort pnum.log | uniq -d
Output of the sort command.
PNUM-1233: [App] [Tracker] Text
PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1234: [App] [Tracker] Tex 123 ssd
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1234: [App] [Tracker] Text 123 ssd vp
Sort command remove duplicates, but unfortunately I also need to remove lines with repeated PNUM’s and keep only one unique PNUM with longes text in example output it would be “PNUM-1234: [App] [Tracker] Text 123 ssd vp” and 2 others line with PNUM-1234 should be removed from the file. How can be this achieved? Is there any linux commands like sort, which could help me to sort?
and expectation would be:
PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1234: [App] [Tracker] Text 123 ssd vp
Solution
sort | uniq -d
doesn't remove duplicates, it prints one of each batch of lines that are duplicates. You should probably be using sort -u
instead - that will remove duplicates.
But to answer the question you asked:
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text
The first awk
command just prepends each line with its length so the subsequent sort
can sort all of the lines longest-first, then the 2nd awk
only outputs the line when it's the first occurrence of the key field value (which now is the longest line with that key value) and then the cut
removes the line length that the first awk
added.
In sequence:
$ awk '{print length($0), $0}' file
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
39 PNUM-1236: [App] [Tracker] Text ddfg
36 PNUM-1236: [App] [Tracker] Text ddfg
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
38 PNUM-1235: [App] [Tracker] Text 1dbg
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
$
$ awk '{print length($0), $0}' file | sort -k1,1rn
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1234: [App] [Tracker] Tex 123 ssd
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
36 PNUM-1236: [App] [Tracker] Text ddfg
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++'
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text
You didn't say which line to print if multiple lines for the same key value are the same length so the above will just output one of them at random. If that's an issue then you can use GNU sort and add the -s
argument (for stable sort
) or change the command line to awk '{print length($0), NR, $0}' file | sort -k1,1rn -k2,2n | awk '!seen[$3]++' | cut -d' ' -f3-
- in both cases that would ensure the line output in such a conflict would be the first one that was present in the input.
Answered By - Ed Morton Answer Checked By - Candace Johnson (WPSolving Volunteer)