Issue

New to shell scripting..

I have a huge csv file, with a varying length f11, like

"000000aaad000000bhb200000uwwed..."
"000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew..."
. .

After splitting the string in size of 10, I need 6-9 characters. then I have to join them back using delimiter '|' like

0aaa|0bhb|uwwe...
0aba|bbrb|0wwq|caba|0bhb|0qwe...

and join the processed f11 with other fields

this is the time taken for processing 10k records ->

real 4m43.506s
user 0m12.366s
sys 0m12.131s

20K records ->
real 5m20.244s
user 2m21.591s
sys 3m20.042s

80K records (around 3.7Million f11 split and merge with '|') ->

real 21m18.854s
user 9m41.944s
sys 13m29.019s

My expected time is 30mins for processing 650K records (around 56Million f11 split and merge). Any way to optimize ?

while read -r line1; do
    f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
    echo $f10 >> $path/other_fields
    
    f11=$( echo $line1 | cut -d',' -f11 )
    f11_trim=$(echo "$f11" | tr -d '"')
    echo $f11_trim | fold -w10 > $path/f11_extract 

    cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim
    
    arr=($(cat $path/str_list_trim))
    printf "%s|" ${arr[@]} >> $path/str_list_serialized
    printf '\n' >> $path/str_list_serialized
    arr=()
    
    rm $path/f11_extract
    rm $path/str_list_trim

done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/"\1"/g' $path/str_list_serialized

paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out

Solution

Your code is not time-efficient due to:

invoking multiple commands including awk within the loop.
generating many intermediate temporal files.

You can do the job just with awk:

awk -F, -v OFS="," '                                    # assign input/output field separator to a comma
{
    len = length($11)                                   # length of the 11th field
    s = ""; d = ""                                      # clear output string and the delimiter
    for (i = 1; i <= len / 10; i++) {                   # iterate over the 11th field
        s = s d substr($11, (i - 1) * 10 + 6, 4)        # concatenate 6-9th substring of 10 characters long chunks
        d = "|"                                         # set the delimiter to a pipe character
    }
    $11 = "\"" s "\""                                   # assign the 11th field to the generated string
} 1' "$input"                                           # the final "1" tells awk to print all fields

Example of the input:

1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew

Output:

1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"

Answered By - tshiono

Thursday, October 28, 2021

[SOLVED] Reduce processing time for 'While read' loop

Issue

Solution