Issue
I have a vcf file whose header has sample ID info. It looks like this:
##fileformat=VCFv4.2 ##FILTER=<ID=PASS,Description="All filters passed"> ##fileDate=20220214 ##source=PLINKv1.90 ##contig=<ID=1,length=249212497> ##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-1117F_GTEX-1117F GTEX-111CU_GTEX-111CU GTEX-111FC_GTEX-111FC GTEX-111VG_GTEX-111VG GTEX-111YS_GTEX-111YS GTEX-1122O_GTEX-1122O GTEX-1128S_GTEX-1128S GTEX-113IC_GTEX-113IC GTEX-113JC_GTEX-113JC GTEX-117XS_GTEX-117XS
I want to edit it to contain only
##fileformat=VCFv4.2 ##FILTER=<ID=PASS,Description="All filters passed"> ##fileDate=20220214 ##source=PLINKv1.90 ##contig=<ID=1,length=249212497> ##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-1117F GTEX-111CU GTEX-111FC GTEX-111VG GTEX-111YS
GTEX-1122O GTEX-1128S GTEX-113IC GTEX-113JC GTEX-117XS
Basically I want to remove anything coming after _ For example: ID:GTEX-1117F_GTEX-1117F desired ID: GTEX-1117F
I used this command but its not really giving me the desired output.
sed -e '$s/\[[[:digit:]]\+\]//g; s/_GTEX[[:digit:]]\+//g'chr1_impute_qc.vcf > chr1_impute_qc1.vcf
Can anyone help me with this one?
Solution
Using sed
$ sed 's/\(GTEX-[[:alnum:]]*\)_\1/\1/g' file
ID:GTEX-1117F
Answered By - HatLess Answer Checked By - Terry (WPSolving Volunteer)