Issue
I am writing a bash file that matches emails using regex. But I only want to match emails with single top level domain NOT emails with multiple ones.
For example those emails should match:
[email protected]
[email protected]
[email protected]
But this email should NOT match because it has 2 top level domains .co.fr
[email protected]
I tried the following:
grep -E -o '[A-Za-z0-9.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}(?!\.[A-Za-z])' log.txt > mails.txt
But the (?!\.[A-Za-z])
part is not working with bash, my understanding that it negates the match if it finds a second domain after the first dot.
it's working fine when I try it on online tools: https://regex101.com/r/H4ftC3/1
I also tried use $ at the end: [A-Za-z0-9.]+@[A-Za-z0-9-]+\.[A-Za-z]{2,}$
but this one doesn't match anything.
How can I match only single top level domains?
Thanks
Solution
tl;dr
grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt
-P - option for advanced Perl like regex (allows using \w)
-i - ignore case (matches @xyz.com ou @xyz.COM)
For input file: file.txt
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Resulting:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
No fancy characters, please.
In order to answer your question it's important to make some assumptions.
- E-mails regex are tricky, and you already read this answer on Stackoverflow (1), as well as this article on Wikipedia (2).
- Your e-mails local part (a.k.a. user name) only have the following characters: letter from
A-Za-z
, numbers from0
to9
, special characters+-_
(a very reduced of the allowed set), and dot.
in the middle. - No fancy
utf-8
orutf-16
characters. Not even latin ones (e.g.ç
,ñ
)
This assumption represents 99,73% of all e-mail addresses known so far.
Allowed chars
username_allowed_chars = [A-Za-z0-9_+-.]
In fact, I assume you're using gnu grep, therefore you may use grep -P
(perl style regex) and the following set \w
which is equivalent to [A-Za-z0-9_]
, thence:
username_allowed_chars = [\w+-.]
As for the domain part, remove +
and dot .
, thence:
domain_allowed_chars = [\w-]
Finally we will use +
for 1 or more
repetitions of chars.
grep -P -i "[\w+-.]+@[\w+-]+\.[a-z]{2,}$" file.txt
I'll break this regex in parts. First the character set \w
that is used extensively.
\w
- Translates do[A-Za-z0-9_]
word indentifier a.k.a. allowed chars for variable names, in programming parlance. In practice disallows punctuations and other unusual characters in e-mail user name;\.
- literal dot.
;[\w+-.]+
- One or more of these identifiers, and includes the period or dot in user names. e.g.[email protected]
.@
- literal@
to separate username from domain name.[a-z]{2,}$
- No less than two lowercase letters up to the end of the string (marked by$
).
References
(1) Stack Overflow
(2) Wikipedia
Answered By - Jayr Magave Answer Checked By - David Marino (WPSolving Volunteer)