Saturday, December 2, 2023

[SOLVED] Shell Script taking long time to add indexing to JSON file and push elastic search

December 02, 2023 elasticsearch, shell

Issue

Shell Script taking long time to add indexing to JSON file.

We have large number json files (with all required data) where we add indexing logic to it and then push to elastic search. Each file has 200K records. Indexing to one file is taking around 30 mins and push to elastic search takes around 20 seconds which is fine. Taking 30 mins for indexing is too long as we need to process around 5k files (approx 80 GB data).

We are nor expert in shell scripting, can anyone help to make this indexing logic fast? 30 mins for one files seems to be very long.

We are reading each line from json and adding below line for for indexing:

"{\"index\":{\"_id\":$guid,\"_index\" : \"demoindex\", \"_type\" : \"usage\"}}\n$row\n";

And this is the code:

#!/bin/bash

FEED_DIR="/apps/elasticsearch/demo"
FEED_TMP_DIR="/apps/elasticsearch/demo/temp"
FEED_ARCHIVE="/apps/elasticsearch/demo/archive"
FEEDER_LOG="/apps/log/feeder/demofeeder.log"

SCRPT_HOME="/apps/bin/elasticsearch"

LOCKFILE=$SCRPT_HOME/system/demofeeder.lock

# Skip if another version of Feeder2 is still executing
if test -f "$LOCKFILE"; then
    echo "demofeeder.sh is still running."
    echo "$LOCKFILE exists."
    echo "Exiting !!"
    exit 1;
fi

echo "Creating lock file: $LOCKFILE" > $FEEDER_LOG
touch $LOCKFILE

echo -e "\n Setting Replica to 0 before indexing!\n" >> $FEEDER_LOG
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/demoindex/_settings' -d '{ "number_of_replicas" : 0 } }'

echo "Starting demoindex Feeding Data at "$(date -u)"." >> $FEEDER_LOG
echo "Starting demoindex Feeding Data at "$(date -u)"."

echo "Clearing Temp directory" >> $FEEDER_LOG
rm -f $FEED_TMP_DIR/*.json

cd $FEED_DIR
for feed in $(ls -1 *.json)
do
  echo "Parsing $feed"

  while IFS= read -r line; do
    guid=`echo "$line" | awk -F "\"" '{print $4}'`;
    echo "{\"index\": {\"_index\": \"demoindex\", \"_id\": \"$guid\", \"_type\": \"usage\"}}" >> $FEED_TMP_DIR/"$feed";
    echo $line >> $FEED_TMP_DIR/"$feed";
  done <"$feed"

  echo "Loading parsed files into elasticsearch: $FEED_TMP_DIR/$feed"
  curl -s -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/demoindex/usage/_bulk?' --data-binary "@$FEED_TMP_DIR/$feed" >> $FEEDER_LOG

  echo "Deleting parsed json: "$FEED_TMP_DIR/$feed" (not enabled)"
  #rm "$FEED_TMP_DIR/$feed"

  echo "Moving $feed to Archive Folder (not enabled)"
  #mv $FEED_DIR/$feed $FEED_ARCHIVE/$feed
done

echo "Data Feeding demoindex Completed at "$(date -u)"." >> $FEEDER_LOG
echo "Removing lock file: $LOCKFILE" >> $FEEDER_LOG
rm -f $LOCKFILE
echo "Data Feeding demoindex Completed at "$(date -u)"."

Solution

AFAIU from your code, you get some value from 4th column in each line in each file, compile some string using this value and then print this new string and the original string, right? This could be done using this simple awk:

$ ls *.txt
file1.txt  file2.txt  file3.txt  file4.txt

$ cat *.txt
1 2 3 v1 5 6 7 8 9 10
1 2 3 v2 5 6 7 8 9 10
1 2 3 v3 5 6 7 8 9 10
1 2 3 v4 5 6 7 8 9 10

$ awk '{print "some text", $4; print}' *.txt
some text v1
1 2 3 v1 5 6 7 8 9 10
some text v2
1 2 3 v2 5 6 7 8 9 10
some text v3
1 2 3 v3 5 6 7 8 9 10
some text v4
1 2 3 v4 5 6 7 8 9 10

File with 200k strings:

$ wc -l f1
200000 f1

$ time awk '{print "some text", $4; print}' f1
...
some text v1
1 2 3 v1 5 6 7 8 9 10
some text v1
1 2 3 v1 5 6 7 8 9 10

real    0m1,345s
user    0m0,242s
sys     0m0,680s

Answered By - Ivan

Answer Checked By - Marie Seifert (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 2, 2023

[SOLVED] Shell Script taking long time to add indexing to JSON file and push elastic search

Issue

Solution

Popular Posts

Labels