Issue
My assignment is to create a script called script.sed
that must set the conditions below:
- Erase the two first columns (already achieved in the code below)
- Erase the lines with a 0 in the column "frequency" (already achieved in the code below)
- Keep the headers for the fields not modified, erase the titles of the columns previously erased, and change the name of the column "freq_prop_headlines" to "%_headlines" (already achieved in the code below)
- Move the decimal point of
freq_prop_headlines
two places back to obtain the percentage instead of the percentage by one, keeping all the digits and remove any unnecessary zeros before the decimal point. Some records are in scientific notation (all raised to -5) and they must also be treated, being displayed in decimal notation. (Note by the OP: For example, if there is a number like 0.000814664 it has to turn into 0.0814664)
So, I only have the last point remaining to resolve.
The CSV file I have to work with is called headlines_words.csv
, and its 10 first rows are:
Unnamed: 0;Unnamed: 0.1;year;country;word;frequency;count;freq_prop_headlines;word_len;freq_rank;hfreq_rank;theme
12;277;2010;India;cricketer;0;1584;0.0;9;20;20;empowerment
13;278;2011;India;cricketer;0;2438;0.0;9;20;20;empowerment
14;279;2012;India;cricketer;0;3634;0.0;9;20;20;empowerment
15;280;2013;India;cricketer;4;4910;0.000814664;9;20;20;empowerment
16;281;2014;India;cricketer;6;7502;0.0007997869999999;9;20;20;empowerment
17;282;2015;India;cricketer;11;10532;0.001044436;9;20;20;empowerment
18;283;2016;India;cricketer;14;14012;0.000999144;9;20;20;empowerment
19;284;2017;India;cricketer;48;17097;0.00280751;9;20;20;empowerment
20;285;2018;India;cricketer;40;19170;0.002086594;9;20;20;empowerment
21;286;2019;India;cricketer;66;20849;0.003165619;9;20;20;empowerment
The code of the script.sed
I already got is:
# Erase the two first columns:
s/^[^;]*;[^;]*;//
# Erase all rows with count 0 in the frequency column
/^.*;0;/d
# Rename freq_prop_headlines to %_headlines
s/freq_prop_headlines/%_headlines/
# Show the first 10 rows (to ease the code checking by the output)
10q
I have to run the command below (it's mandatory by the statement):
sed -f script.sed headlines_words.csv
Once run my code I get this:
year;country;word;frequency;count;%_headlines;word_len;freq_rank;hfreq_rank;theme
2013;India;cricketer;4;4910;0.000814664;9;20;20;empowerment
2014;India;cricketer;6;7502;0.0007997869999999;9;20;20;empowerment
2015;India;cricketer;11;10532;0.001044436;9;20;20;empowerment
2016;India;cricketer;14;14012;0.000999144;9;20;20;empowerment
2017;India;cricketer;48;17097;0.00280751;9;20;20;empowerment
2018;India;cricketer;40;19170;0.002086594;9;20;20;empowerment
The expected output must be:
year;country;word;frequency;count;%_headlines;word_len;freq_rank;hfreq_rank;theme
2013;India;cricketer;4;4910;0.0814664;9;20;20;empowerment
2014;India;cricketer;6;7502;0.07997869999999;9;20;20;empowerment
2015;India;cricketer;11;10532;0.1044436;9;20;20;empowerment
2016;India;cricketer;14;14012;0.0999144;9;20;20;empowerment
2017;India;cricketer;48;17097;0.280751;9;20;20;empowerment
2018;India;cricketer;40;19170;0.2086594;9;20;20;empowerment
Now how can I set the last condition of the statement?
Solution
Assuming all the values are less than 0.1 and that the operation can thus be completed simply by moving the decimal point or changing the exponent in the case of a value in scientific notation, and that we can rely on the field which needs to be adjusted to always be in the sixth column (after you deleted the first two), try
s/^\([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;\)0\.0\([0-9]\)\([0-9]*;\)/\1\2.\3/
s/^\([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;1\.[0-9]*[eE]\)-5;/\1-3;/
If you have access to sed -E
or -r
the regex can be simplified considerably, for improved legibility and maintainability:
s/^(([^;]*;){5})0\.0([0-9])([0-9]*;)/\1\3.\4/
s/^(([^;]*;){5}1\.[0-9]*[eE])-5;/\1-3;/
In other words, capture the first five columns just so we can replace them with themselves, and capture the parts of the sixth column which we need to keep and reorder. The captured groups are numbered from the left opening parenthesis, so \1
refers to what the first group captured (in these cases, the first five columns), \2
to whatever the second leftmost parentheses captured, etc.
As an aside, the expression which renames one of the headers should strictly speaking only be applied to the first line.
1s/freq_prop_headlines/%_headlines/
Even more strictly speaking, you should perhaps anchor the eypression to avoid matching a substring of another field name; but I'll leave that as an exercise.
Answered By - tripleee Answer Checked By - Marie Seifert (WPSolving Admin)