Friday, April 8, 2022

[SOLVED] Why does my nohup bash script reading in file always stop outputting count around 6k before end of file?

Issue

I use nohup to run a bash script to read in each line of a file (and extract info I need). I've used it on multiple files with different line sizes, mostly between 50k and 100k. But no matter how many lines my file is, nohup always stops outputting info around 6k before the last line.

my script called: fetchStuff.sh

#!/bin/bash

urlFile=$1
myHost='http://example.com'
useragent='me'
count=0
total_lines=$(wc -l < $urlFile)

while read url; do
    if [[ "$url" == *html ]]; then continue; fi

    reqURL=${myHost}${url}
    stuffInfo=$(curl -s -XGET -A "$useragent" "$reqURL" | jq -r '.stuff')
    [ "$stuffInfo" != "null" ] && echo ${stuffInfo/unwanted_garbage/} >> newversion-${urlFile}
    ((count++))
    if [ $(( $count%20 )) -eq 0 ]
    then
        sleep 1
    fi
    if [ $(( $count%100 )) -eq 0 ]; then echo "$urlFile read ${count} of $total_lines"; fi
done < $urlFile

I call it like so: nohup ./fetchStuff.sh file1.txt & I get count info in nohup.out, e.g. "file1 read 100 of 60000", "file1 read 200 of 60000", etc. But it always stops around 6k before end of file.

When I do tail nohup.out each time after running the script on the file, I get these as the last line in nohup.out:

file1.txt read 90000 of 96317  
file2.txt read 68000 of 73376  
file3.txt read 85000 of 91722  
file4.txt read 93000 of 99757  

I can't figure out why it always stops around 6k before end of file. (I put the sleep timer in to avoid flooding the api w/a lot of requests).


Solution

The loops skips lines that end with html, and they're not counted in $count. So I'll bet there are 6317 lines in file1.txt that end with html, 5376 in file2.txt, and so on.

If you want $count to include them, put ((count++)) before the if statement that checks the suffix.

while read url; do
    ((count++))
    if [[ "$url" == *html ]]; then continue; fi

    reqURL=${myHost}${url}
    stuffInfo=$(curl -s -XGET -A "$useragent" "$reqURL" | jq -r '.stuff')
    [ "$stuffInfo" != "null" ] && echo ${stuffInfo/unwanted_garbage/} >> newversion-${urlFile}
    if [ $(( $count%20 )) -eq 0 ]
    then
        sleep 1
    fi
    if [ $(( $count%100 )) -eq 0 ]; then echo "$urlFile read ${count} of $total_lines"; fi
done < $urlFile

Alternatively you could leave them out of total_lines with:

total_lines=$(grep -c -v 'html$' "$urlFile")

And you could do away with the if statement by using

grep -v 'html$' "$urlFile" | while read url; do
    ...
done


Answered By - Barmar
Answer Checked By - Pedro (WPSolving Volunteer)