Friday, January 28, 2022

[SOLVED] Bash scripting: detect delimiter after email address, and do branching

January 28, 2022 awk, bash, regex, scripting, sed

Issue

I have a set of txt files that I need to parse and insert to MySQL. I'm trying to write a bash script to sanitise the data, before attempting SQL INSERTs.

Files are often in format:

[email protected]:hashed_password

In some cases, as in the above example, the delimiter is a colon ':'

Yay, nice and easy...

However some files use a comma ',' or a semicolon ';' as the delimiter instead.

Hmm not so easy.

As an example of a more varied file:

[email protected];hashed_password
[email protected],hashed_password
[email protected],hashed_password

-- note the varied use of delimiters.

I am attempting to write a bash script that checks each line in the txt files. For each line, the script should detect which delimiter is in use, and adjust the line accordingly.

I decided to use three colons as the SQL delimiter ':::'. This was because many hashed passwords contained more occurrences of the delimiter.

Here's my original bash script:

#!/bin/bash

DIR="/file/path/location"
for file in "$DIR"/*.txt
do
echo "Processing file $file"
sed -i 's/:/:::/' "$file"
echo "Importing to SQL db"
mysql -uroot -p'password' <<EOF
USE collection1;
LOAD DATA LOCAL INFILE '$file' INTO TABLE test_table CHARACTER SET utf8mb4 FIELDS TERMINATED BY ':::' LINES TERMINATED BY '\n';
EOF
echo "Import complete, archiving file..."
sudo mv "$file" processed/
done

This first attempt worked where files had a colon delimiter, but of course failed to work with files that used a semicolon or comma.

I guess I need some pattern matching? Well, OK then let's get stuck in.

I experimented with some pattern matching as I'm new to it. I ended up with this line:

grep -Eio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' test_data.txt

Yay, this produced a nice long list of email addresses in my terminal:

[email protected]
[email protected]
[email protected]
etc

OK, but I need to detect the delimiter, which follows these email addresses. Well I made a simple addition:

grep -Eio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b.{0,1}' test_data.txt

Another win, this time I get a list of email addresses with the delimiter at the end of each line:

[email protected]:
[email protected];
[email protected]:
etc

But, how can I make a decision/branch, based on what the delimiter is? Am I headed in the right direction with this approach?

What I'll do is:

if [ delimiter=":" ] 
    sed -i 's/:/:::/' "$file"
elif [ delimiter =";" ]
    sed -i 's/;/:::/' "$file"
elif [ delimiter ="," ]
    sed -i 's/,/:::/' "$file"
etc

I hope this makes sense! Any help much appreciated.

Solution

Why bother determining the delimiter? Just allow either delimiter.
Here we replace the first ,/;/: we encounter after each @ with your custom delimiter :::

sed -i -E 's/(@[^,;:]*)[,;:]/\1:::/' "$file"

This is safe even though the local-part (before @) can contain ,/,/: when quoted. The domain part (after @) can neither contain , nor ;, nor :. Since we only look at that part, there are no problems.

Answered By - Socowi

Answer Checked By - Timothy Miller (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 28, 2022

[SOLVED] Bash scripting: detect delimiter after email address, and do branching

Issue

Solution

Popular Posts

Labels