Issue
I am trying to find the repeated strings (not words) from text.
x = 'This is a sample text and this is lowercase text that is repeated.'
In this example, the string ' text ' should not return because only 6 characters match with one another. But the string 'his is ' is the expected value returned.
I tried using range, Counter and regular expression.
import re
from collections import Counter
duplist = list()
for i in range(1, 30):
mylist = re.findall('.{1,'+str(i)+'}', x)
duplist.append([k for k,v in Counter(mylist).items() if v>1])
Solution
You can use a quantifier of {7,}
to ensure that a match is more than 6 characters long, and use a positive lookahead pattern with a backreference to assert that the captured string is repeated:
import re
x = 'This is a sample text and this is lowercase text that is repeated.'
print(re.findall(r'(.{7,})(?=.*\1)', x, re.S))
This outputs:
['his is ', 'e text ']
Demo: https://ideone.com/jZvQR5
Answered By - blhsing Answer Checked By - Clifford M. (WPSolving Volunteer)