Issue
Shell Script taking long time to add indexing to JSON file.
We have large number json files (with all required data) where we add indexing logic to it and then push to elastic search. Each file has 200K records. Indexing to one file is taking around 30 mins and push to elastic search takes around 20 seconds which is fine. Taking 30 mins for indexing is too long as we need to process around 5k files (approx 80 GB data).
We are nor expert in shell scripting, can anyone help to make this indexing logic fast? 30 mins for one files seems to be very long.
We are reading each line from json and adding below line for for indexing:
"{\"index\":{\"_id\":$guid,\"_index\" : \"demoindex\", \"_type\" : \"usage\"}}\n$row\n";
And this is the code:
#!/bin/bash
FEED_DIR="/apps/elasticsearch/demo"
FEED_TMP_DIR="/apps/elasticsearch/demo/temp"
FEED_ARCHIVE="/apps/elasticsearch/demo/archive"
FEEDER_LOG="/apps/log/feeder/demofeeder.log"
SCRPT_HOME="/apps/bin/elasticsearch"
LOCKFILE=$SCRPT_HOME/system/demofeeder.lock
# Skip if another version of Feeder2 is still executing
if test -f "$LOCKFILE"; then
echo "demofeeder.sh is still running."
echo "$LOCKFILE exists."
echo "Exiting !!"
exit 1;
fi
echo "Creating lock file: $LOCKFILE" > $FEEDER_LOG
touch $LOCKFILE
echo -e "\n Setting Replica to 0 before indexing!\n" >> $FEEDER_LOG
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/demoindex/_settings' -d '{ "number_of_replicas" : 0 } }'
echo "Starting demoindex Feeding Data at "$(date -u)"." >> $FEEDER_LOG
echo "Starting demoindex Feeding Data at "$(date -u)"."
echo "Clearing Temp directory" >> $FEEDER_LOG
rm -f $FEED_TMP_DIR/*.json
cd $FEED_DIR
for feed in $(ls -1 *.json)
do
echo "Parsing $feed"
while IFS= read -r line; do
guid=`echo "$line" | awk -F "\"" '{print $4}'`;
echo "{\"index\": {\"_index\": \"demoindex\", \"_id\": \"$guid\", \"_type\": \"usage\"}}" >> $FEED_TMP_DIR/"$feed";
echo $line >> $FEED_TMP_DIR/"$feed";
done <"$feed"
echo "Loading parsed files into elasticsearch: $FEED_TMP_DIR/$feed"
curl -s -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/demoindex/usage/_bulk?' --data-binary "@$FEED_TMP_DIR/$feed" >> $FEEDER_LOG
echo "Deleting parsed json: "$FEED_TMP_DIR/$feed" (not enabled)"
#rm "$FEED_TMP_DIR/$feed"
echo "Moving $feed to Archive Folder (not enabled)"
#mv $FEED_DIR/$feed $FEED_ARCHIVE/$feed
done
echo "Data Feeding demoindex Completed at "$(date -u)"." >> $FEEDER_LOG
echo "Removing lock file: $LOCKFILE" >> $FEEDER_LOG
rm -f $LOCKFILE
echo "Data Feeding demoindex Completed at "$(date -u)"."
Solution
AFAIU from your code, you get some value from 4th column in each line in each file, compile some string using this value and then print this new string and the original string, right? This could be done using this simple awk:
$ ls *.txt
file1.txt file2.txt file3.txt file4.txt
$ cat *.txt
1 2 3 v1 5 6 7 8 9 10
1 2 3 v2 5 6 7 8 9 10
1 2 3 v3 5 6 7 8 9 10
1 2 3 v4 5 6 7 8 9 10
$ awk '{print "some text", $4; print}' *.txt
some text v1
1 2 3 v1 5 6 7 8 9 10
some text v2
1 2 3 v2 5 6 7 8 9 10
some text v3
1 2 3 v3 5 6 7 8 9 10
some text v4
1 2 3 v4 5 6 7 8 9 10
File with 200k strings:
$ wc -l f1
200000 f1
$ time awk '{print "some text", $4; print}' f1
...
some text v1
1 2 3 v1 5 6 7 8 9 10
some text v1
1 2 3 v1 5 6 7 8 9 10
real 0m1,345s
user 0m0,242s
sys 0m0,680s
Answered By - Ivan Answer Checked By - Marie Seifert (WPSolving Admin)