Sunday, December 31, 2023

[SOLVED] How to write a for loop to merge all the nth columns of files in bash

December 31, 2023 awk, bash, sed

Issue

In the second line of the code, I'm attempting to merge all the *.out.tab files by columns. The third line of the code extracts the first column and every 4th subsequent columns (4th, 8th, 12th, 16th...), meaning every fourth column of each file.

Without a for loop, it would be like...

paste 1.out.tab 2.out.tab 3.out.tab 4.out.tab \
awk '{for(i=1;i<=NF;i+=4){printf "%s ",$i;} print ""}' | \
tail -n +5 > tmpfile
cat tmpfile | sed "s/^ENSG*//" >gene_count.txt

However, now I want to use a for loop to merge all the files.

for f in `./alignments/repaired_reads/*ReadsPerGene.out.tab | sed 's/.ReadsPerGene.out.tab//'`;
paste "$f"\.out.tab | \
awk '{for(i=1;i<=NF;i+=4){printf "%s ",$i;} print ""}' | \
tail -n +5 > tmpfile
cat tmpfile | sed "s/^ENSG*://" > gene_count.txt

Sample input:

head ./alignments/repaired_reads/SRR9200814ReadsPerGene.out.tab

N_unmapped  18517   18517   18517
N_multimapping  1620    1620    1620
N_noFeature 8046    33145   33275
N_ambiguous 5860    1201    1034
ENSG00000160072 0   0   0
ENSG00000279928 0   0   0
ENSG00000228037 0   0   0
ENSG00000142611 0   0   0
ENSG00000284616 0   0   0
ENSG00000157911 0   0   0

head ./alignments/repaired_reads/SRR9200815ReadsPerGene.out.tab

N_unmapped  124416  124416  124416
N_multimapping  19165   19165   19165
N_noFeature 40924   384454  392595
N_ambiguous 99220   21834   20712
ENSG00000160072 0   0   0
ENSG00000279928 0   0   0
ENSG00000228037 0   0   0
ENSG00000142611 35  22  13
ENSG00000284616 0   0   0
ENSG00000157911 24  22  8

Desired output:

N_unmapped   18517  124416
N_multimapping   1620  19165
N_noFeature     33275  392595
N_ambiguous  1034  20712
ENSG00000160072     0  0
ENSG00000279928  0  0
ENSG00000228037  0  0
ENSG00000142611  0  13
ENSG00000284616  0  0
ENSG00000157911  0  8

Solution

With your shown samples and attempts please try following awk code. This will make sure order of $1(first field) is same as per their presence. This will create output file named gene_count.txt and it will read multiple files as input.

awk '
!presence[$1]++{
  arr[++max]=$1
}
{
  for(i=4;i<=NF;i+=4){
    maxValue[$1]=(maxValue[$1]>$i?maxValue[$1]:$i)
  }
  overAllMax[$1]=(overAllMax[$1]?overAllMax[$1] OFS:"") maxValue[$1]
}
END{
  for(j=1;j<=max;j++){
    print arr[j],overAllMax[arr[j]]
  }
}
' *ReadsPerGene.out.tab > gene_count.txt

Explanation: Adding detailed explanation for above code.

Using an array named presence with index of $1 and checking if its occurrence is NOT present then make an entry in it and then:
Create an array named arr with index of max variable with increasing value of it, assigning value to arr as $1.
Starting a for loop to go through every 4th, 8th, 12th and every 4th field in each line.
Creating an array named maxValue with index of $1 and keeping only MAX value here Only by checking condition with Ternary operators.
Creating an array named overAllMax which will have all similar first fields all Maximum values keep appending to it, to get overall max values basically.
In END block of this program going through from 1 to till value of max to print all values.
Inside loop printing values of arr[j](which is $1) and value of overAllMax[arr[j]] which is maximum value of each $1 of all files.

Answered By - RavinderSingh13

Answer Checked By - Cary Denson (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 31, 2023

[SOLVED] How to write a for loop to merge all the nth columns of files in bash

Issue

Solution

Popular Posts

Labels