Thursday, May 19, 2022

[SOLVED] Modify multiple columns value based on specific column values in Linux

May 19, 2022 centos7, linux, unix

Issue

I have a file that has the following data

"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",        
"CACR","50","CO","40","5","0"

I want to find every line that contains CACR and then divide col2, col4, col5, and col6 values by the respective cells of col6 (ignore the divide calculation if col6 has 0 or blank cells) using Linux terminal. So, my output looks like following:

"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOX","30","10",           
"CACR","50","CO","40","5","0"

I am trying to use grep and awk

grep  CACR  file.csv | awk -F "," '$6 != 0; $6 == "" {$2 = $2/$6; $4= $4/$6; $5 = $5/$6; $6 = $6/$6}1'

But couldn't get any desired output.

Solution

As outlined in a comment, the primary problem is that the double quotes around the fields mean that when a field is interpreted as a number (e.g. with a division), the value is zero. I think you need to write Awk functions to remove and reinstate the double quotes. With those in place, it's mostly a SMOP — a Simple Matter of Programming.

Here's my version. It could be written more succinctly (fewer newlines, fewer spaces), but I prefer clarity over brevity.

`script.awk`

function strip_quotes(s)
{
    gsub(/"/, "", s)
    return s
}
function add_quotes(s)
{
    return sprintf("\"%s\"", s)
}
BEGIN        { FS = "," }
NR == 1      { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
        {
            div = strip_quotes($6)
            printf("%s,%s,%s,%s,%s,%s\n",
                   $1,
                   add_quotes(strip_quotes($2) / div),
                   $3,
                   add_quotes(strip_quotes($4) / div),
                   add_quotes(strip_quotes($5) / div),
                   add_quotes(strip_quotes($6) / div))
        }

`data`

"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"

Output

$ awk -f script.awk data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
$

Variant `script3.awk`

This code sets the output field separator OFS to comma too, and resets the values of $2, $4, $5, and $6 before using print to print the modified $0.

function strip_quotes(s)
{
    gsub(/"/, "", s)
    return s
}
function add_quotes(s)
{
    return sprintf("\"%s\"", s)
}
BEGIN        { FS = ","; OFS = "," }
NR == 1      { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
        {
            div = strip_quotes($6)
            $2 = add_quotes(strip_quotes($2) / div)
            $4 = add_quotes(strip_quotes($4) / div)
            $5 = add_quotes(strip_quotes($5) / div)
            $6 = add_quotes(strip_quotes($6) / div)
            print
        }

Data validation

Both versions of the script could be more stringent, validating that there are 5 or 6 columns (rejecting lines with more columns or fewer columns or complaining about them). The check for the headings could insist on 6 columns. It might be sensible to check that div is a non-zero number. It might be sensible to check that each of $2, $4, $5 and $6 is a number. The divisors (column 6) in the sample data are convenient; you might need to do some work if the number is not as simple, such as 7, where the result could have many decimal places. You'd need to decide how such numbers should be formatted (the default might be OK, or it might not). It might also be worth checking that the data in each field matches the regex /^"[^"]*"$/ (so each value is surrounded by double quotes).

Trailing white space

The rule $6 == "" || $6 == "\"0\"" { print; next } does not handle trailing white space very well. It can be revised to:

$6 ~ /^[[:space:]]*$/ || $6 == "\"0\"" { print; next }

That recognizes trailing white space and treats it as zero. It would be possible, and probably sensible, to add:

if (div == 0) { print; next }

after the assignment to div. If the value found is zero, there is a problem. It would be possible to complain too — to produce an error message diagnosing 'malformed data'.

How much of the validation and error prevention is worthwhile depends on how unruly your input data is. If you're dealing with human-generated data, you have to deal with human's propensity for varying the rules and giving erratic or erroneous data to programs, and you probably need to handle (diagnose) unexpected inputs. If you're dealing with machine-generated data, it is typically more uniform, and you can get away with less validation work.

Most solutions that depend on regexes have to strike a balance between working sufficiently well and breaking on erratic inputs. The more erratic the inputs, the harder it is to devise bomb-proof (fool-proof) regexes. As the saying goes, "if you make something idiot-proof, someone will just make a better idiot".

Answered By - Jonathan Leffler

Answer Checked By - Cary Denson (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, May 19, 2022

[SOLVED] Modify multiple columns value based on specific column values in Linux

Issue

Solution

`script.awk`

`data`

Output

Variant `script3.awk`

Data validation

Trailing white space

Popular Posts

Labels

Thursday, May 19, 2022

Issue

Solution

script.awk

data

Output

Variant script3.awk

Data validation

Trailing white space

Popular Posts

Labels

`script.awk`

`data`

Variant `script3.awk`