Sunday, December 24, 2023

[SOLVED] remove duplicate lines in log file

December 24, 2023 awk, bash, linux, sed

Issue

I am trying to clean up my log file from duplicate lines. First of all I use sort command with uniq -d flag it help me to remove duplicates, but not soled my problem.

sort pnum.log | uniq -d

Output of the sort command.

PNUM-1233: [App] [Tracker] Text
PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg   
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1234: [App] [Tracker] Tex 123  ssd
PNUM-1235: [App] [Tracker] Text 1dbg  
PNUM-1234: [App] [Tracker] Text 123 ssd vp

Sort command remove duplicates, but unfortunately I also need to remove lines with repeated PNUM’s and keep only one unique PNUM with longes text in example output it would be “PNUM-1234: [App] [Tracker] Text 123 ssd vp” and 2 others line with PNUM-1234 should be removed from the file. How can be this achieved? Is there any linux commands like sort, which could help me to sort?

and expectation would be:

PNUM-1233: [App] [Tracker] Text
PNUM-1236: [App] [Tracker] Text ddfg   
PNUM-1235: [App] [Tracker] Text 1dbg  
PNUM-1234: [App] [Tracker] Text 123 ssd vp

Solution

sort | uniq -d doesn't remove duplicates, it prints one of each batch of lines that are duplicates. You should probably be using sort -u instead - that will remove duplicates.

But to answer the question you asked:

$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text

The first awk command just prepends each line with its length so the subsequent sort can sort all of the lines longest-first, then the 2nd awk only outputs the line when it's the first occurrence of the key field value (which now is the longest line with that key value) and then the cut removes the line length that the first awk added.

In sequence:

$ awk '{print length($0), $0}' file
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
39 PNUM-1236: [App] [Tracker] Text ddfg
36 PNUM-1236: [App] [Tracker] Text ddfg
39 PNUM-1234: [App] [Tracker] Tex 123  ssd
38 PNUM-1235: [App] [Tracker] Text 1dbg
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
$
$ awk '{print length($0), $0}' file | sort -k1,1rn
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1234: [App] [Tracker] Tex 123  ssd
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
36 PNUM-1236: [App] [Tracker] Text ddfg
31 PNUM-1233: [App] [Tracker] Text
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++'
42 PNUM-1234: [App] [Tracker] Text 123 ssd vp
39 PNUM-1236: [App] [Tracker] Text ddfg
38 PNUM-1235: [App] [Tracker] Text 1dbg
31 PNUM-1233: [App] [Tracker] Text
$
$ awk '{print length($0), $0}' file | sort -k1,1rn | awk '!seen[$2]++' | cut -d' ' -f2-
PNUM-1234: [App] [Tracker] Text 123 ssd vp
PNUM-1236: [App] [Tracker] Text ddfg
PNUM-1235: [App] [Tracker] Text 1dbg
PNUM-1233: [App] [Tracker] Text

You didn't say which line to print if multiple lines for the same key value are the same length so the above will just output one of them at random. If that's an issue then you can use GNU sort and add the -s argument (for stable sort) or change the command line to awk '{print length($0), NR, $0}' file | sort -k1,1rn -k2,2n | awk '!seen[$3]++' | cut -d' ' -f3- - in both cases that would ensure the line output in such a conflict would be the first one that was present in the input.

Answered By - Ed Morton

Answer Checked By - Candace Johnson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 24, 2023

[SOLVED] remove duplicate lines in log file

Issue

Solution

Popular Posts

Labels