Issue
I have an input file with billions of records and a header. Header consists of meta info, total number of rows and sum of the sixth column. I am splitting the file into small sizes, due to which my header record must be updated as the sum of sixth column and total rows is changed.
This is the sample record
filename: testFile.text
00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
Expected:
filename: testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
filename: testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
I am able to split the file and calculate the sum but unable to replace the value in header part. This is the script I have made
#!/bin/bash
splitRowCount=$1
transactionColumn=$2
filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"
echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"
awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount')+1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename
ls *.txt | while read line
do
firstLine=$(head -n 1 $line);
awk -F '|' 'NR !=1 {sum += '$transactionColumn'}END {print sum} ' $line
done
Solution
Here's an awk
solution for splitting the original file into files of n
records. The idea is to accumulate the records until the given count is reached then generate a file with the updated header and the accumulated records:
n=3
file=./testFile.text
awk -v numRecords="$n" '
BEGIN {
FS = OFS = "|"
if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
filePrefix = substr(ARGV[1],1,RSTART)
fileSuffix = substr(ARGV[1],RSTART+1)
} else {
filePrefix = ARGV[1]
fileSuffix = ""
}
if (getline headerStr <= 0)
exit 1
split(headerStr, headerArr)
}
(NR-2) % numRecords == 0 && recordsCount {
outfile = filePrefix "_" ++filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
records = ""
recordsCount = recordsSum = 0
}
{
records = records $0 ORS
recordsCount++
recordsSum += $6
}
END {
if (recordsCount) {
outfile = filePrefix "_" ++filesCount fileSuffix
print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
printf("%s", records) > outfile
close(outfile)
}
}
' "$file"
With the given sample you'll get:
testFile_1.text
00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
testFile_2.text
00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP
Answered By - Fravadona Answer Checked By - Senaida (WPSolving Volunteer)