Issue
I have this table dump from a MySQL system, and although it follows RFC standards, it appears to have added unwanted space in columns where HTML text are stored. For example:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
This is one out of about 30K rows, so I am trying to figure out a smart way to remove the space between the " and <div (and possibly others) here. I tried out:
awk '{$1=$1;printf $0}'
And this kind of works, but it mashes everything into one line which is not what I want. I would like to preserve the line breaks in the CSV dump. I am very curious to hear your ideas on how to tackle this.
Solution
You can do this with perl
:
perl -0777 -i -pe 's/"\K\s+(?=<div)//g' file
Details
0777
slurps the file into a single string so that the pattern could match line break sequences-i
- file inline replacement is on"\K\s+(?=<div)
- matches a"
char that is dropped from the match value with\K
, then one or more whitespaces are consumed (with\s+
) and then<div
must follow immediately and the match is replaced with an empty stringg
replaces all occurrences.
You can achieve the same with a GNU sed
:
sed -i -Ez 's/"\s+<div/"<div/g' file
where -i
enables inplace file replacement and -E
enables the POSIX ERE regex syntax, and z
pulls the file text into pattern space where line breaks are "visible" for the regex pattern.
Answered By - Wiktor Stribiżew Answer Checked By - Marilyn (WPSolving Volunteer)