Issue
I have inherited this sed script snippet that attempts to remove certain empty spaces:
s/[\s\t]*|/|/g
s/|[\s\t]*/|/g
s/[\s] *$//g
s/^|/null|/g
that operates on a file that is around 1Gb large. This script runs for 2 hours on our unix server. Any ideas how to speed it up?
Notes that the \s stands for a space and \t stands for a tab, the actual script uses the actual space and tab and not those symbols
The input file is a pipe delimited file and is located locally not on the network. The 4 lines are in a file executed with sed -f
Solution
The best I was able to do with sed, was this script:
s/[\s\t]*|[\s\t]*/|/g
s/[\s\t]*$//
s/^|/null|/
In my tests, this ran about 30% faster than your sed script. The increase in performance comes from combining the first two regexen and omitting the "g" flag where it's not needed.
However, 30% faster is only a mild improvement (it should still take about an hour and a half to run the above script on your 1GB data file). I wanted to see if I could do any better.
In the end, no other method I tried (awk, perl, and other approaches with sed) fared any better, except -- of course -- a plain ol' C implementation. As would be expected with C, the code is a bit verbose for posting here, but if you want a program that's likely going to be faster than any other method out there, you may want to take a look at it.
In my tests, the C implementation finishes in about 20% of the time it takes for your sed script. So it might take about 25 minutes or so to run on your Unix server.
I didn't spend much time optimizing the C implementation. No doubt there are a number of places where the algorithm could be improved, but frankly, I don't know if it's possible to shave a significant amount of time beyond what it already achieves. If anything, I think it certainly places an upper limit on what kind of performance you can expect from other methods (sed, awk, perl, python, etc).
Edit: The original version had a minor bug that caused it to possibly print the wrong thing at the end of the output (e.g. could print a "null" that shouldn't be there). I had some time today to take a look at it and fixed that. I also optimized away a call to strlen()
that gave it another slight performance boost.
Answered By - Dan Moulding Answer Checked By - Cary Denson (WPSolving Admin)