Issue
I have a file like this:
reference 25038 A G 39134 1 TPPH54 TPPH49 TPPH50 TPPHL51 TPPH52 TPPH53 TPPH55 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 TPPH49 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 TPPH30 TPPH32 p.Gly48Gly
and I would like to get:
reference 25038 A G 39134 1 TPPH54 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 p.Gly48Gly
How to remove in awk/sed/grep patterns after the first one (always $7) all those having the same beggining??
I was thinking something like:
only print the 7 first columns and the last one
paste <(awk '{print $1, $2, $3, $4, $5, $6, $7}' file) <(awk '{print ????}' file-tmp) > file-final
but I don't know how to get the last one because the number can be different at each raw
- or 'scan' the file until having 'TPPH' beginning expression, keep the first one and remove the other ones for each raw. I'm not sure how to do it
Thank you very much in advance for your help!
Solution
Using sed
$ sed -E ':a;s/(([^ \t]*[ \t]+){6}TPPH[0-9]+)[ \t]+TPPH[^ \t]*[ \t]+/\1\t/;ta' input_file
reference 25038 A G 39134 1 TPPH54 p.Thr10198Thr
reference 77940 T C 5131 1 TPPH54 p.Asn898Asp
reference 77940 T C 5131 1 TPPH29 p.Gly48Gly
Answered By - HatLess Answer Checked By - Gilberto Lyons (WPSolving Admin)