Python Forum
find and group similar words with re? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: find and group similar words with re? (/thread-40998.html)



find and group similar words with re? - cartonics - Oct-27-2023

If i have 2 list of names

for example
list1 ="Augsburg II, Turkgucu Munchen,Bayern II, Burghausen, Memmingen, Wurzburger Kickers, Ansbach, Buchbach,Aschaffenburg, Schweinfurt ,Illertissen, Bamberg,Schalding, Bayreuth ,Aubstadt, Furth II ,Vilzing, Nurnberg II "

list2 ="Augsburg II,Turkgucu Munich,Bayern Munich II,Wacker Burghausen,Memmingen ,Kickers Würzburg,Aubstadt ,SpVgg Greuther Furth II,FV Illertissen, Eintracht Bamberg 2010,Schalding-Heining Passau, SpVgg Bayreuth,SpVgg Ansbach,TSV Buchbach,Viktoria Aschaffenburg , 1. FC Schweinfurt,Vilzing , 1. FC Norimberga II"

is there some re command to have in output the words that are similar and not equal?

list3 ="Turkgucu Munchen = Turkgucu Munich, Bayern II =Bayern Munich II, Wurzburger Kickers= Kickers Würzburg ... and so on "

i was searching for commands:
re.search(pattern, string, flags=0)
re.search(pattern, sequence).group()



RE: find and group similar words with re? - Gribouillis - Oct-27-2023

The re module cannot do that. You could perhaps find specialized modules that help in Pypi, such as textdistance (untested)


RE: find and group similar words with re? - snippsat - Oct-27-2023

A similar library to what Gribouillis posted is TheFuzz(eailer called fuzzywuzzy).
Test.
from thefuzz import fuzz

list1 = ["Augsburg II", "Turkgucu Munchen", "Bayern II"]
list2 = ["Augburg II", "Turkgucu Munich", "Baye II"]
>>> fuzz.ratio(list1[0], list2[0])
95
>>> fuzz.ratio(list1[1], list2[1])
90
>>> fuzz.ratio(list1[2], list2[2])
88
Then can decided what ratio is ok to make it similar enuff,let say that choose 90.
from thefuzz import fuzz

list1 = ["Augsburg II", "Turkgucu Munchen", "Bayern II"]
list2 = ["Augburg II", "Turkgucu Munich", "Baye II"]


list3 = []
for l1, l2 in zip(list1, list2):
    if fuzz.ratio(l1, l2) >= 90:
        #print(f'{l1} = {l2}')
        list3.append(f'{l1} = {l2}')

print(list3)
Output:
['Augsburg II = Augburg II', 'Turkgucu Munchen = Turkgucu Munich']



RE: find and group similar words with re? - cartonics - Oct-27-2023

from thefuzz import fuzz
 
list1 = ["Augsburg II","Turkgucu Munchen","Bayern II","Burghausen","Memmingen","Wurzburger Kickers","Ansbach","Buchbach","Aschaffenburg","Schweinfurt","Illertissen","Bamberg","Schalding"," Bayreuth","Aubstadt","Furth II","Vilzing","Nurnberg II"]
list2 = ["Augsburg II","Turkgucu Munich","Bayern Munich II","Wacker Burghausen","Memmingen ","Kickers Würzburg","Aubstadt","SpVgg Greuther Furth II","FV Illertissen","Eintracht Bamberg 2010","Schalding-Heining Passau","SpVgg Bayreuth","SpVgg Ansbach","TSV Buchbach","Viktoria Aschaffenburg","1. FC Schweinfurt 05","Vilzing","1. FC Norimberga II"]
 
 
list3 = []
for l1, l2 in zip(list1, list2):
    if fuzz.ratio(l1, l2) >= 35:
        #print(f'{l1} = {l2}')
        list3.append(f'{l1} = {l2}')
 
print(list3)
Output:
'Ansbach = Aubstadt', 'Buchbach = SpVgg Greuther Furth II', 'Aschaffenburg = FV Illertissen', 'Schweinfurt = Eintracht Bamberg 2010', 'Illertissen = Schalding-Heining Passau', 'Bamberg = SpVgg Bayreuth', 'Schalding = SpVgg Ansbach', ' Bayreuth = TSV Buchbach', 'Aubstadt = Viktoria Aschaffenburg', 'Furth II = 1. FC Schweinfurt 05'
can be done something for a better output? if i have a high value >= 75: no couple of words while if i decrise to 35 fails for these one!


RE: find and group similar words with re? - deanhystad - Oct-27-2023

There are lots of matches with a score > 75. My guess is you are not comparing list1 and list2. My guess is you have different lists, and the lists are not ordered like the lists in your example.You probably cannot use zip, but rather need to compare all words in list1 to all words in list2. Like this:
random.shuffle(list2)
scores = [(fuzz.ratio(w1, w2), w1, w2) for w2 in list2 for w1 in list1]
print(*sorted(scores, reverse=True)[:30], sep="\n")
Output:
(100, 'Vilzing', 'Vilzing') (100, 'Augsburg II', 'Augsburg II') (100, 'Aubstadt', 'Aubstadt') (95, 'Memmingen', 'Memmingen ') (90, 'Turkgucu Munchen', 'Turkgucu Munich') (88, 'Illertissen', 'FV Illertissen') (80, 'Buchbach', 'TSV Buchbach') (78, ' Bayreuth', 'SpVgg Bayreuth') (74, 'Burghausen', 'Wacker Burghausen') (74, 'Aschaffenburg', 'Viktoria Aschaffenburg') (72, 'Bayern II', 'Bayern Munich II') (71, 'Schweinfurt', '1. FC Schweinfurt 05') (70, 'Ansbach', 'SpVgg Ansbach') (64, 'Nurnberg II', 'Augsburg II')
I saved the match score along with the words and sorted the list. Do something like this to help you set your threshold value.

Matching words like this will never be perfect.