Friday, September 2, 2022

[SOLVED] Concatenating many text files based on intermediate directory names

Issue

I would like to simply concatenate all contents of a set of files into a single new file. Each file may be identified by either a shared file name, or folder name. I see many examples of performing this task while files are in the same directory, but not where they are spread across sub directories.

Input

my_project
+-- housecat
|   +-- 1234
|   |  +-- 1234_contigs.fasta
|   +-- 1290
|   |  +-- 1290_contigs.fasta
+-- jaguar
|   +-- 1234
|   |  +-- 1234_contigs.fasta
|   +-- 4567
|   |  +-- 4567_contigs.fasta
|   +-- 9876
|   |  +-- 9876_contigs.fasta
+-- puma
|   +-- 0987
|   |  +-- 0987_contigs.fasta
|   +-- 1029
|   |  +-- 1029_contigs.fasta
|   +-- 1234
|   |  +-- 1234_contigs.fasta
|   +-- 4567
|   |  +-- 4567_contigs.fasta

an 'example' of the output files.

mkdir -p concats/1234
cat puma/1234/1234_contigs.fasta jaguar/1234/1234_contigs.fasta housecat/1234/1234_contigs.fasta >> concats/1234/1234_concat.fasta
more concats/1234_concat.fasta

minimally reproducible contents of housecat 1234
minimally reproducible contents of jaguar 1234
minimally reproducible contents of puma 1234

I would like this action to be performed for each of these types of files, even if their is only one such file (e.g. 1029 & 1290.fasta). I see that I can copy each of the files into a directory, and concatenate from there - but I would like to avoid that. Is this possible or should I just continue along the path of renaming the files, placing into the same folder, and combining them there?

DESIRED OUTPUT (not showing contents of all sub directories)

my_project
+-- concats
|   +-- 0987/0987_concat.fasta
|   +-- 1029/1029_concat.fasta
|   +-- 1234/1234_concat.fasta
|   +-- 4567/4567_concat.fasta
|   +-- 1290/1290_concat.fasta
+-- jaguar
+-- housecat
+-- puma

what i have so far:

FILENAME=$(find . -print | grep -E '[0-9]{3,4}_contigs.fasta') # this due to many many non-target files being present. I can move it into the script later just do not want to much focus on this.

for i in $FILENAME; do
  FILE=$(basename "$i" | sed 's/_contigs//g')
  DIR=concats/${FILE%.*}
  ORGANISM=$(echo $i | cut -d/ -f 2)
  mkdir -p -- "$DIR"
  cp $i "${DIR}/${ORGANISM}_${FILE}" # rename the files here
done

for d in concats/*/ ; do
    LOCI=$(echo $d | cut -d/ -f 2)
    echo $d* > ${d}${LOCI}_concat.fasta
done

I was wondering if before I run the second loop I could use a cat like command to combine these files? Or whether I need to move them to destination and them combine them? Mostly just curious if I can avoid copying the files.


Solution

A solution in plain bash would be:

cd /path/to/my_project || exit
for src in */*/*_contigs.fasta; do
    IFS=/ read -r _ dir file <<< "$src"
    mkdir -p "concats/$dir"
    cat "$src" >> "concats/$dir/${file/contigs/concat}"
done


Answered By - M. Nejat Aydin
Answer Checked By - Robin (WPSolving Admin)