Issue
I will appreciate any help with my file. I have a file with 4 columns
Accession Description logFC p-value
P7852 DNA-dependent protein kinase catalytic subunit OS=Homo sapiens GN=PRKDC PE=1 SV=3 -0.183343951 0.006947985
Q13085 Acetyl-CoA carboxylase 1 OS=Homo sapiens GN=ACACA PE=1 SV=2 -1.250658294 0.012223886
A0A1W7HHM5 Major DNA-binding protein OS=Epstein-Barr virus (strain GD1) GN=DBP PE=3 SV=1 0.176282017 2.69897E-05
A0A0S2YRG9 Major DNA-binding protein OS=Epstein-Barr virus (strain GD1) GN=BALF2 PE=3 SV=1 2.707961346 0.015657277
I want to retain only the gene name after the pattern "GN=" in column 2 to have an output like this
Accession Description logFC p-value
P78527 PRKDC -0.183343951 0.006947985
Q13085 ACACA -1.250658294 0.012223886
A0A1W7HHM5 DBP 0.176282017 2.69897E-05
A0A0S2YRG9 BALF2 2.707961346 0.015657277
I tried this code but it excluded column1 and still retained part of column2
awk -F"GN=" '/GN=/{print $2}' file
Solution
One awk
idea:
awk -v ptn="GN" ' # define our search pattern
BEGIN { FS=OFS="\t" } # input/output field delimiter is a tab
FNR==1 { print; next } # print header record; skip to next line of input
{ n=split($2,a,/[[:space:]]/) # split 2nd field on white space; store results in array a[]
for (i=1;i<=n;i++) { # loop through a[] array
m=split(a[i],b,/=/) # split each item on "="; store results in array b[]
if (b[1]==ptn) { # if we found "<ptn>=" then ...
$2=b[2] # reset entire 2nd field to b[2] and ...
break # break out of loop
}
}
print # print current line to stdout
}
' file
This generates:
Accession Description logFC p-value
P7852 PRKDC -0.183343951 0.006947985
Q13085 ACACA -1.250658294 0.012223886
A0A1W7HHM5 DBP 0.176282017 2.69897E-05
A0A0S2YRG9 BALF2 2.707961346 0.015657277
Answered By - markp-fuso Answer Checked By - Marie Seifert (WPSolving Admin)