Wednesday, March 16, 2022

[SOLVED] AWK very slow when splitting large file based on column value and using close command

Issue

I need to split a large log file into smaller ones based on the id found in the first column, this solution worked wonders and very fast for months:

awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;

Where $nome is a file and directory name.

Its very fast and worked until the log file reachead several million lines +2GB text file, then it started to show

"Too many open files"

The solution is indeed very simple, adding the close command:

awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;

The problem is, now its VERY slow, its taking forever to do something that was done in seconds and I need to optmize this.

AWK is not mandatory, I can use an alternative, I just dont know how


Solution

Untested since you didn't provide any sample input/output to test with but this should do it:

sort -t';' -k1,1 "${nome}.all"  |
awk -v dir="$nome" -F\; '$1!=prev{close(out); out=dir"/"dir"_"$1".log"; prev=$1} {print > out}'

Your first script:

awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;

had 3 problems:

  1. It wasn't closing file names as you go and so exceeded the threshold you saw.
  2. It had an unparenthesized expression on the right side of output redirection which is undefined behavior per POSIX.
  3. It wasn't quoting the shell variable ${nome} in the file name.

It's worth mentioning that gawk would be able to handle 1 and 2 without failing but it would slow down as the number of open files grew and it was having to manage the opens/closes internally.

Your second script:

awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;

though now closing the output file name, still had problems 2 and 3 and then added 2 new problems:

  1. It was opening and closing the output files once per input line instead of only when the output file name had to change.
  2. It was overwriting the output file for each $1 for every line written to it instead of appending to it.

The above assumes you have multiple lines of input for each $1 and so each output file will have multiple lines. Otherwise the slow down you saw when closing the output files wouldn't have happened.

The above sort could rearrange the order of input lines for each $1. If that's a problem add -s for "stable sort" if you have GNU sort or let us know as it's easy to work around with POSIX tools.



Answered By - Ed Morton
Answer Checked By - David Marino (WPSolving Volunteer)