Thursday, April 28, 2022

[SOLVED] Improving performance when using jq to process large files

April 28, 2022 jq, json, sed, split

Issue

Use Case

I need to split large files (~5G) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to read the entire JSON blob into memory). The JSON data in each source file is an array of objects.

Unfortunately, the source data is not newline-delimited JSON and in some cases there are no newlines in the files at all. This means I can't simply use the split command to split the large file into smaller chunks by newline. Here are examples of how the source data is stored in each file:

Example of a source file with newlines.

[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
...
,{"id": 9, "name": "qux"}]

Example of a source file without newlines.

[{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}, ...{"id": 9, "name": "qux"}]

Here's an example of the desired format for a single output file:

{"id": 1, "name": "foo"}
{"id": 2, "name": "bar"}
{"id": 3, "name": "baz"}

Current Solution

I'm able to achieve the desired result by using jq and split as described in this SO Post. This approach is memory efficient thanks to the jq streaming parser. Here's the command that achieves the desired result:

cat large_source_file.json \
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  | split --line-bytes=1m --numeric-suffixes - split_output_file

The Problem

The command above takes ~47 mins to process through the entire source file. This seems quite slow, especially when compared to sed which can produce the same output much faster.

Here are some performance benchmarks to show processing time with jq vs. sed.

export SOURCE_FILE=medium_source_file.json  # smaller 250MB

# using jq
time cat ${SOURCE_FILE} \
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  | split --line-bytes=1m - split_output_file

real    2m0.656s
user    1m58.265s
sys     0m6.126s

# using sed
time cat ${SOURCE_FILE} \
  | sed -E 's#^\[##g' \
  | sed -E 's#^,\{#\{#g' \
  | sed -E 's#\]$##g' \
  | sed 's#},{#}\n{#g' \
  | split --line-bytes=1m - sed_split_output_file

real    0m25.545s
user    0m5.372s
sys     0m9.072s

Questions

Is this slower processing speed expected for jq compared to sed? It makes sense jq would be slower given it's doing a lot of validation under the hood, but 4X slower doesn't seem right.
Is there anything I can do to improve the speed at which jq can process this file? I'd prefer to use jq to process files because I'm confident it could seamlessly handle other line output formats, but given I'm processing thousands of files each day, it's hard to justify the speed difference I've observed.

Solution

jq's streaming parser (the one invoked with the --stream command-line option) intentionally sacrifices speed for the sake of reduced memory requirements, as illustrated below in the metrics section. A tool which strikes a different balance (one which seems to be closer to what you're looking for) is jstream, the homepage of which is https://github.com/bcicen/jstream

Running the sequence of commands in a bash or bash-like shell:

cd
go get github.com/bcicen/jstream
cd go/src/github.com/bcicen/jstream/cmd/jstream/
go build

will result in an executable, which you can invoke like so:

jstream -d 1 < INPUTFILE > STREAM

Assuming INPUTFILE contains a (possibly ginormous) JSON array, the above will behave like jq's .[], with jq's -c (compact) command-line option. In fact, this is also the case if INPUTFILE contains a stream of JSON arrays, or a stream of JSON non-scalars ...

Illustrative space-time metrics

Summary

For the task at hand (streaming the top-level items of an array):

                  mrss   u+s
jq --stream:      2 MB   447
jstream    :      8 MB   114
jq         :  5,582 MB    39

In words:

space: jstream is economical with memory, but not as much as jq's streaming parser.
time: jstream runs slightly slower than jq's regular parser but about 4 times faster than jq's streaming parser.

Interestingly, space*time is about the same for the two streaming parsers.

Characterization of the test file

The test file consists of an array of 10,000,000 simple objects:

[
{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
,{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
...
]

$ ls -l input.json
-rw-r--r--  1 xyzzy  staff  980000002 May  2  2019 input.json

$ wc -l input.json
 10000001 input.json

jq times and mrss

$ /usr/bin/time -l jq empty input.json
       43.91 real        37.36 user         4.74 sys
4981452800  maximum resident set size

$ /usr/bin/time -l jq length input.json
10000000
       48.78 real        41.78 user         4.41 sys
4730941440  maximum resident set size

/usr/bin/time -l jq type input.json
"array"
       37.69 real        34.26 user         3.05 sys
5582196736  maximum resident set size

/usr/bin/time -l jq 'def count(s): reduce s as $i (0;.+1); count(.[])' input.json
10000000
       39.40 real        35.95 user         3.01 sys
5582176256  maximum resident set size

/usr/bin/time -l jq -cn --stream 'fromstream(1|truncate_stream(inputs))' input.json | wc -l
      449.88 real       444.43 user         2.12 sys
   2023424  maximum resident set size
 10000000

jstream times and mrss

$ /usr/bin/time -l jstream -d 1 < input.json > /dev/null
       61.63 real        79.52 user        16.43 sys
   7999488  maximum resident set size

$ /usr/bin/time -l jstream -d 1 < input.json | wc -l
       77.65 real        93.69 user        20.85 sys
   7847936  maximum resident set size
 10000000

Answered By - peak

Answer Checked By - Mary Flores (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0