Issue
I’m working with large text datasets, size of about 1 GB (the smallest file has about 2 million lines). Each line is supposed to be split into a number of columns. I say supposed because there are exceptions; while the normal lines end with \r\n
, a good number of them are incorrectly divided onto 2 to 3 lines.
Given that there are 10 columns, each line is supposed to have the following format:
col_1 | col_2 | col_3 | ... | col_10\r\n
The exceptions have this format:
1. col_1 | col_2 | col_3 ...\n
... | col_10\r\n
2. col_1 | col_2 | col_3 ...\n
... | col_10\n
\r\n
What would be the fastest way to correct these exceptions? I did a simple find/replace in a text editor (TextMate, on Mac) on a sample of 1000 lines using the regular expression (^[^\r\n]*)\n
(replacing with $1
), and it works perfectly. But the text editor apparently cannot handle the big files (>= 2 million lines). Can those be done with sed
or grep
(or in some other command-line tool, or even in Python) using equivalent regular expressions, and how?
Solution
Your approach:
perl -pe 's/(^[^\r\n]*)\n/\1/' input > output
Or, a negative lookbehind:
perl -pe 's/(?<!\r)\n//' input > output
Or, remove all \n
and replace each \r
with \r\n
:
perl -pe 's/\n//; s/\r/\r\n/' input > output
Answered By - Nikita Kouevda Answer Checked By - David Goodson (WPSolving Volunteer)