Thursday, October 28, 2021

[SOLVED] sed not working on large file [Looking for other options]

Issue

I have a gigantic json file that was accidentally output without a newline character in between all the json entries. It is being treated as one giant single line. So what I did was try and take a find an replace with sed and insert a newline.

sed 's/{"seq_id"/\n{"seq_id"/g' my_giant_json.json

It doesn't output anything

However, I know my sed expression is working if I operate on just a small part of the file and it works fine.

head -c 1000000 my_giant_json.json |  sed 's/{"seq_id"/\n{"seq_id"/g'

I have also tried using python with this gnarly one liner

'\n{"seq_id'.join(open(json_file,'r').readlines()[0].split('{"seq_id')).lstrip()

But this loads into memory thanks to readlines() method. But I don't know how to iterate through a giant single line of characters (iterate in chunks) and do a find and replace.

Any thoughts?


Solution

Perl will let you change the input separator ($/) from newline to another character. You could take advantage of this to get some convenient chunking.

perl -pe'BEGIN{$/="}"}s/^({"seq_id")/\n$1/' my_giant_json.json

That sets the input separator to be "}". Then it looks for chunks that start with {"seq_id" and prefixes them with a newline.

Note that it puts an unnecessary empty line at the beginning. You could complicate the program to eliminate that or just delete it manually after.



Answered By - Bo Borgerson