Issue
I have a file that looks like this. It has 12 columns and 3244343 rows. Lets call this file 1.
variant_id gene_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se pval_nominal_threshold min_pval_nominal pval_beta
chr10_100000235_C_T_b38 ENSG00000227232.5 35211 73 74 0.061157 1.69779e-08 0.510322 0.0890939 0.0006160191.01823e-08 1.17701e-05
chr10_100002628_A_C_b38 ENSG00000227232.5 635545 126 130 0.107438 1.01823e-08 0.405406 0.0696647 0.0006160191.01823e-08 1.17701e-05
chr1_666028_G_A_b38 ENSG00000227232.5 636475 111 115 0.0950413 2.78462e-08 0.411513 0.0729864 0.0006160191.01823e-08 1.17701e-0
I have another file that looks like this with 7 headers and with 1633293 rows. This is file 2.
"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 101762385
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
I am only interested in the variant_id
column. This is the first column in both files.
How can I compare these two columns and only print those variant_id
values in the first column that are not found in the second file. For the above example, the output should be
chr1_666028_G_A_b38
because it's found in the first file but not the second.
All the values of the variant_id
in the second file are also in the first file. But there are extra IDs in the first file not present in the second, which I want to identify.
I am using the command below:
$ awk '{print $1}' file2 > file2.names
$ grep -vf file2.names file1
However, the output is just all the values of variant_id in file 1.
Solution
1st solution: With your shown samples please try following awk
code, written and tested in GNU awk
. Both the codes will print 1st field from file1 if its NOT present in 2nd field, its taking care of addiotnal "
present before and after $1 in file2.
awk '
FNR==1 { next }
FNR==NR{
gsub(/^"|"$/,"",$1)
arr[$1]
next
}
!($1 in arr){
print $1
}
' file2 file1
2nd solution OR without using gsub
as used in above awk
code a slight different from above one, where I am not doing substitution of "
values present before and after in $1, I am simply wrapping $1 of file1 and checking if that is present in file2 or not here.
awk '
FNR==1 { next }
FNR==NR{
arr[$1]
next
}
!("\""$1"\"" in arr){
print $1
}
' file2 file1
Answered By - RavinderSingh13 Answer Checked By - Dawn Plyler (WPSolving Volunteer)