Issue
I have a set of txt files that I need to parse and insert to MySQL. I'm trying to write a bash script to sanitise the data, before attempting SQL INSERTs.
[email protected]:hashed_password
In some cases, as in the above example, the delimiter is a colon ':'
- Yay, nice and easy...
However some files use a comma ',' or a semicolon ';' as the delimiter instead.
- Hmm not so easy.
As an example of a more varied file:
[email protected];hashed_password
[email protected],hashed_password
[email protected],hashed_password
-- note the varied use of delimiters.
I am attempting to write a bash script that checks each line in the txt files. For each line, the script should detect which delimiter is in use, and adjust the line accordingly.
I decided to use three colons as the SQL delimiter ':::'. This was because many hashed passwords contained more occurrences of the delimiter.
Here's my original bash script:
#!/bin/bash
DIR="/file/path/location"
for file in "$DIR"/*.txt
do
echo "Processing file $file"
sed -i 's/:/:::/' "$file"
echo "Importing to SQL db"
mysql -uroot -p'password' <<EOF
USE collection1;
LOAD DATA LOCAL INFILE '$file' INTO TABLE test_table CHARACTER SET utf8mb4 FIELDS TERMINATED BY ':::' LINES TERMINATED BY '\n';
EOF
echo "Import complete, archiving file..."
sudo mv "$file" processed/
done
This first attempt worked where files had a colon delimiter, but of course failed to work with files that used a semicolon or comma.
I guess I need some pattern matching? Well, OK then let's get stuck in.
I experimented with some pattern matching as I'm new to it. I ended up with this line:
grep -Eio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' test_data.txt
Yay, this produced a nice long list of email addresses in my terminal:
[email protected]
[email protected]
[email protected]
etc
OK, but I need to detect the delimiter, which follows these email addresses. Well I made a simple addition:
grep -Eio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b.{0,1}' test_data.txt
Another win, this time I get a list of email addresses with the delimiter at the end of each line:
[email protected]:
[email protected];
[email protected]:
etc
But, how can I make a decision/branch, based on what the delimiter is? Am I headed in the right direction with this approach?
What I'll do is:
if [ delimiter=":" ]
sed -i 's/:/:::/' "$file"
elif [ delimiter =";" ]
sed -i 's/;/:::/' "$file"
elif [ delimiter ="," ]
sed -i 's/,/:::/' "$file"
etc
I hope this makes sense! Any help much appreciated.
Solution
Why bother determining the delimiter? Just allow either delimiter.
Here we replace the first ,
/;
/:
we encounter after each @
with your custom delimiter :::
sed -i -E 's/(@[^,;:]*)[,;:]/\1:::/' "$file"
This is safe even though the local-part (before @
) can contain ,
/,
/:
when quoted. The domain part (after @
) can neither contain ,
nor ;
, nor :
. Since we only look at that part, there are no problems.
Answered By - Socowi Answer Checked By - Timothy Miller (WPSolving Admin)