Thursday, October 20, 2022

[SOLVED] How do I retain only strings after a pattern in a column

October 20, 2022 awk, sed

Issue

I will appreciate any help with my file. I have a file with 4 columns

Accession   Description     logFC  p-value

P7852  DNA-dependent protein kinase catalytic subunit OS=Homo sapiens GN=PRKDC PE=1 SV=3       -0.183343951    0.006947985

Q13085  Acetyl-CoA carboxylase 1 OS=Homo sapiens GN=ACACA PE=1 SV=2     -1.250658294    0.012223886

A0A1W7HHM5      Major DNA-binding protein OS=Epstein-Barr virus (strain GD1) GN=DBP PE=3 SV=1   0.176282017     2.69897E-05

A0A0S2YRG9      Major DNA-binding protein OS=Epstein-Barr virus (strain GD1) GN=BALF2 PE=3 SV=1 2.707961346     0.015657277

I want to retain only the gene name after the pattern "GN=" in column 2 to have an output like this

Accession   Description logFC   p-value

P78527  PRKDC   -0.183343951    0.006947985

Q13085  ACACA   -1.250658294    0.012223886

A0A1W7HHM5  DBP 0.176282017     2.69897E-05

A0A0S2YRG9  BALF2   2.707961346     0.015657277

I tried this code but it excluded column1 and still retained part of column2

awk -F"GN=" '/GN=/{print $2}' file

Solution

One awk idea:

awk -v ptn="GN" '                          # define our search pattern
BEGIN  { FS=OFS="\t" }                     # input/output field delimiter is a tab
FNR==1 { print; next }                     # print header record; skip to next line of input
       { n=split($2,a,/[[:space:]]/)       # split 2nd field on white space; store results in array a[]
         for (i=1;i<=n;i++) {              # loop through a[] array
             m=split(a[i],b,/=/)           # split each item on "="; store results in array b[]
             if (b[1]==ptn) {              # if we found "<ptn>=" then ...
                $2=b[2]                    # reset entire 2nd field to b[2] and ...
                break                      # break out of loop
             }
         }
         print                             # print current line to stdout
       }
' file

This generates:

Accession       Description     logFC   p-value

P7852   PRKDC   -0.183343951    0.006947985

Q13085  ACACA   -1.250658294    0.012223886

A0A1W7HHM5      DBP     0.176282017     2.69897E-05

A0A0S2YRG9      BALF2   2.707961346     0.015657277

Answered By - markp-fuso

Answer Checked By - Marie Seifert (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 20, 2022

[SOLVED] How do I retain only strings after a pattern in a column

Issue

Solution

Popular Posts

Labels