Issue
I have a file that has the following data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
I want to find every line that contains CACR and then divide col2, col4, col5, and col6 values by the respective cells of col6 (ignore the divide calculation if col6 has 0 or blank cells) using Linux terminal. So, my output looks like following:
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOX","30","10",
"CACR","50","CO","40","5","0"
I am trying to use grep and awk
grep CACR file.csv | awk -F "," '$6 != 0; $6 == "" {$2 = $2/$6; $4= $4/$6; $5 = $5/$6; $6 = $6/$6}1'
But couldn't get any desired output.
Solution
As outlined in a comment, the primary problem is that the double quotes around the fields mean that when a field is interpreted as a number (e.g. with a division), the value is zero. I think you need to write Awk functions to remove and reinstate the double quotes. With those in place, it's mostly a SMOP — a Simple Matter of Programming.
Here's my version. It could be written more succinctly (fewer newlines, fewer spaces), but I prefer clarity over brevity.
script.awk
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
printf("%s,%s,%s,%s,%s,%s\n",
$1,
add_quotes(strip_quotes($2) / div),
$3,
add_quotes(strip_quotes($4) / div),
add_quotes(strip_quotes($5) / div),
add_quotes(strip_quotes($6) / div))
}
data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
Output
$ awk -f script.awk data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
$
Variant script3.awk
This code sets the output field separator OFS
to comma too, and resets the values of $2
, $4
, $5
, and $6
before using print
to print the modified $0
.
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = ","; OFS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
$2 = add_quotes(strip_quotes($2) / div)
$4 = add_quotes(strip_quotes($4) / div)
$5 = add_quotes(strip_quotes($5) / div)
$6 = add_quotes(strip_quotes($6) / div)
print
}
Data validation
Both versions of the script could be more stringent, validating that there are 5 or 6 columns (rejecting lines with more columns or fewer columns or complaining about them). The check for the headings could insist on 6 columns. It might be sensible to check that div
is a non-zero number. It might be sensible to check that each of $2
, $4
, $5
and $6
is a number. The divisors (column 6) in the sample data are convenient; you might need to do some work if the number is not as simple, such as 7
, where the result could have many decimal places. You'd need to decide how such numbers should be formatted (the default might be OK, or it might not). It might also be worth checking that the data in each field matches the regex /^"[^"]*"$/
(so each value is surrounded by double quotes).
Trailing white space
The rule $6 == "" || $6 == "\"0\"" { print; next }
does not handle trailing white space very well. It can be revised to:
$6 ~ /^[[:space:]]*$/ || $6 == "\"0\"" { print; next }
That recognizes trailing white space and treats it as zero. It would be possible, and probably sensible, to add:
if (div == 0) { print; next }
after the assignment to div
. If the value found is zero, there is a problem. It would be possible to complain too — to produce an error message diagnosing 'malformed data'.
How much of the validation and error prevention is worthwhile depends on how unruly your input data is. If you're dealing with human-generated data, you have to deal with human's propensity for varying the rules and giving erratic or erroneous data to programs, and you probably need to handle (diagnose) unexpected inputs. If you're dealing with machine-generated data, it is typically more uniform, and you can get away with less validation work.
Most solutions that depend on regexes have to strike a balance between working sufficiently well and breaking on erratic inputs. The more erratic the inputs, the harder it is to devise bomb-proof (fool-proof) regexes. As the saying goes, "if you make something idiot-proof, someone will just make a better idiot".
Answered By - Jonathan Leffler Answer Checked By - Cary Denson (WPSolving Admin)