Issue
I have a log file (about 50K rows) in the format:
[email protected]:address0:some_details0
[email protected]:address1:some_details1
[email protected]:address2:some_details2
[email protected]:address3:some_details3
I am trying to read this file and split it into two folders (gmail.com and yahoo.com), and then write each row to a unique file named after the email-ID. My code below works, but it is very slow. Can someone pls help me make this faster and more effecient? Would be appreciated.
#/bin/sh
grep -hv -P "[^[:ascii:]]" * |
awk -F":" '
{
if ($1 ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]/ && NF>1 && $NF!="")
{
split($1, arr, "@")
system("mkdir -p "tolower(arr[2]))
print $0 >> tolower(arr[2])"/"tolower(arr[1])
}
}'
PS: the regex is a basic check to ensure the email address is valid. I am not doing a overly heavy check. At first I thought the regex is making my code slower, but not really. Even without the regex the code is super slow. I think the I/O is making this slow. How do we improve?
Solution
It's mostly spawning a new subshell to call mkdir once per input line that's making your code run so slow. Do something like this instead:
filename = tolower(arr[1])
dirname = tolower(arr[2])
if ( !seen[dirname]++ ) {
system("mkdir -p \047" dirname "\047")
}
print > (dirname "/" filename)
so you only spawn a subshell to call mkdir once per directory.
Note that unless you're using GNU awk you'll hit a "too many open files" error when you've created about a dozen output files and even with GNU awk that'll get slower the more output files you have open so that may also be impacting your codes performance. The common solution for that is to sort your input file by email address first and then close the current output file every time the email address (new output file name) changes.
Given that, here's how I'd really write your program:
#!/usr/bin/env bash
grep -hv -P '[^[:ascii:]]' "${@:--}" |
sort -t':' -k1,1 -s |
awk -F':' '
!($1 ~ /^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]/ && NF>1 && $NF!="") { next }
{ curr = tolower($1) }
curr != prev {
close(out)
split(curr, arr, /@/)
filename = arr[1]
dirname = arr[2]
if ( !seen[dirname]++ ) {
system("mkdir -p \047" dirname "\047")
}
out = dirname "/" filename
prev = $1
}
{ print > out }
'
I used GNU sort above for -s
for "stable sort", if you don't have that and care about relative order of input lines for a given email address being retained in the output, there's other ways to handle that, e.g. awk -v OFS=':' '{print NR, $0}' | sort -t':' -k2,2 -k1,1n | cut -d':' -f2-
.
Answered By - Ed Morton