Issue

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.

Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?

Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?

Thanks for any help!

Solution

There are some possible solutions:

1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes

2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like: awk '!x[$0]++' oldfile > newfile

3 - Why not split the files but with some criteria? Supposing all your lines begin with letters: - break your original_file in 20 smaller files: grep "^a*$" original_file > a_file - sort each small file: a_file, b_file, and so on - verify the duplicates, count them, do whatever you want.

Answered By - woliveirajr

Answer Checked By - Robin (WPSolving Admin)

Wednesday, February 7, 2024

[SOLVED] Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

Issue

Solution