Issue
In the second line of the code, I'm attempting to merge all the *.out.tab
files by columns.
The third line of the code extracts the first column and every 4th subsequent columns (4th, 8th, 12th, 16th...), meaning every fourth column of each file.
Without a for loop, it would be like...
paste 1.out.tab 2.out.tab 3.out.tab 4.out.tab \
awk '{for(i=1;i<=NF;i+=4){printf "%s ",$i;} print ""}' | \
tail -n +5 > tmpfile
cat tmpfile | sed "s/^ENSG*//" >gene_count.txt
However, now I want to use a for loop to merge all the files.
for f in `./alignments/repaired_reads/*ReadsPerGene.out.tab | sed 's/.ReadsPerGene.out.tab//'`;
paste "$f"\.out.tab | \
awk '{for(i=1;i<=NF;i+=4){printf "%s ",$i;} print ""}' | \
tail -n +5 > tmpfile
cat tmpfile | sed "s/^ENSG*://" > gene_count.txt
Sample input:
head ./alignments/repaired_reads/SRR9200814ReadsPerGene.out.tab
N_unmapped 18517 18517 18517
N_multimapping 1620 1620 1620
N_noFeature 8046 33145 33275
N_ambiguous 5860 1201 1034
ENSG00000160072 0 0 0
ENSG00000279928 0 0 0
ENSG00000228037 0 0 0
ENSG00000142611 0 0 0
ENSG00000284616 0 0 0
ENSG00000157911 0 0 0
head ./alignments/repaired_reads/SRR9200815ReadsPerGene.out.tab
N_unmapped 124416 124416 124416
N_multimapping 19165 19165 19165
N_noFeature 40924 384454 392595
N_ambiguous 99220 21834 20712
ENSG00000160072 0 0 0
ENSG00000279928 0 0 0
ENSG00000228037 0 0 0
ENSG00000142611 35 22 13
ENSG00000284616 0 0 0
ENSG00000157911 24 22 8
Desired output:
N_unmapped 18517 124416
N_multimapping 1620 19165
N_noFeature 33275 392595
N_ambiguous 1034 20712
ENSG00000160072 0 0
ENSG00000279928 0 0
ENSG00000228037 0 0
ENSG00000142611 0 13
ENSG00000284616 0 0
ENSG00000157911 0 8
Solution
With your shown samples and attempts please try following awk
code. This will make sure order of $1(first field) is same as per their presence. This will create output file named gene_count.txt
and it will read multiple files as input.
awk '
!presence[$1]++{
arr[++max]=$1
}
{
for(i=4;i<=NF;i+=4){
maxValue[$1]=(maxValue[$1]>$i?maxValue[$1]:$i)
}
overAllMax[$1]=(overAllMax[$1]?overAllMax[$1] OFS:"") maxValue[$1]
}
END{
for(j=1;j<=max;j++){
print arr[j],overAllMax[arr[j]]
}
}
' *ReadsPerGene.out.tab > gene_count.txt
Explanation: Adding detailed explanation for above code.
- Using an array named
presence
with index of $1 and checking if its occurrence is NOT present then make an entry in it and then: - Create an array named
arr
with index ofmax
variable with increasing value of it, assigning value toarr
as $1. - Starting a
for
loop to go through every 4th, 8th, 12th and every 4th field in each line. - Creating an array named
maxValue
with index of$1
and keeping only MAX value here Only by checking condition with Ternary operators. - Creating an array named
overAllMax
which will have all similar first fields all Maximum values keep appending to it, to get overall max values basically. - In
END
block of this program going through from1
to till value ofmax
to print all values. - Inside loop printing values of
arr[j]
(which is$1
) and value ofoverAllMax[arr[j]]
which is maximum value of each $1 of all files.
Answered By - RavinderSingh13 Answer Checked By - Cary Denson (WPSolving Admin)