Sunday, October 9, 2022

[SOLVED] Unix decompress files parallel and store them

Issue

I have a directory /user/test with 2000 compressed files. I want to check if any given file has 5 records then I have to store it in decompressed format.

I am able to do it serially but it is taking a lot of time to finish this job.

Serially I am doing as below:

for i in `find /user/test -iname "abc*.gz"`;
do
    lines=`zcat $i | wc -l`
    if [ $lines = 5 ]; then
        fname=`basename -s .$file_ext $i`
        echo "copying $fname to new path"
        `zcat $i > new_path/$fname`
        cnt=$((cnt+1))
    else
        echo "Ignoring file $i. Expecting 5 records. It has more or less records"
    fi
done

I want to do the same in parallel.

I tried exploring GNU parallel but am seeing an error. I tried below command

find /user/test -iname "abc*.gz" |
parallel 'zcat {} | awk 'NR == 5 {print $0}' < {}.txt'

Above command is not working throwing error.


Solution

Untested:

doit() {
  zcat "$@" | awk 'NR == 5 {print $0}'
}
export -f doit
find /user/test -iname "abc*.gz" |
  parallel doit

Based on what you do serially:

doit() {
    i="$1"
    lines=`zcat $i | wc -l`
    if [ $lines = 5 ]; then
        fname=`basename -s .$file_ext $i`
        echo "copying $fname to new path"
        `zcat $i > new_path/$fname`
    else
        echo "Ignoring file $i. Expecting 5 records. It has more or less records"
    fi
}
export -f doit
export file_ext 

find /user/test -iname "abc*.gz" | parallel doit

The general idea is to build a bash function that works on a single input. export the function (and the variables needed by the function) and run the function in parallel.

The benefit is that it is pretty easy to test the function on a single input.

When writing the function there is a small gotcha: The function cannot write to hardcoded files, because this will create a race condition (multiple instances writing at the same time). So you need to write the function in a way in which this does not happen.



Answered By - Ole Tange
Answer Checked By - Robin (WPSolving Admin)