Issue
I have a file with two columns separated by tabs as follows:
OG0000000 PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,
I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
I tried to start this by using awk.
awk 'BEGIN{RS=ORS=","} !seen[$0]++' file.txt
But my output looks like this, where there are still some duplicates if the duplicated string occurs first.
OG0000000 PF03169,PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF07690,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,
I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!
Solution
This awk
should work for you:
awk -F '[\t,]' '
{
printf "%s", $1 "\t"
for (i=2; i<=NF; ++i) {
if (!seen[$i]++)
printf "%s,", $i
}
print ""
delete seen
}' file
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
Answered By - anubhava Answer Checked By - Willingham (WPSolving Volunteer)