[SOLVED] [regex] Why isn't possible substring ignored? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: [SOLVED] [regex] Why isn't possible substring ignored? (/thread-39750.html) |
[SOLVED] [regex] Why isn't possible substring ignored? - Winfried - Apr-08-2023 Hello, I need to loop through a list of URLs to grab each page's title, which might contain a substring I want to ignore. For some reason, the substring isn't removed: with open('list.txt") as f: for line in f: print(line.replace('\n', '')) n = requests.get(line) al = n.text #Doesn't remove possible ( - dummy)? d = re.search('<\W*title\W*(.*)( - dummy)?</title', al, re.IGNORECASE) title = html.unescape(d.group(1)) print(title)How is my regex wrong? Thank you. RE: [regex] Why isn't possible substring ignored? - Gribouillis - Apr-08-2023 What do you mean by "the substring isn't removed"? Can you give concrete example of data? RE: [regex] Why isn't possible substring ignored? - Winfried - Apr-08-2023 Some titles look like this: <title>My title - dummy</title> Others look like this: <title>My title</title> If it's there, how can I get rid of the " - dummy" part? I expected this to work, but it's ignored: ( - dummy)? RE: [regex] Why isn't possible substring ignored? - snippsat - Apr-08-2023 (Apr-08-2023, 01:43 PM)Winfried Wrote: If it's there, how can I get rid of the " - dummy" part?Yes,if the format is the same in all titles.
import re with open('url_lst.txt') as f: for line in f: d = re.search('<\W*title\W*(.*?)( - \w.*)?</title', line) title = d.group(1) print(title)
RE: [regex] Why isn't possible substring ignored? - Winfried - Apr-08-2023 Thanks, it works, although I don't understand why I need to 1) make it ungreedy since the part can only occur as the last token, and 2) add a trailing ".*" for it to work since it's a single word and nothing can possibly follow. I'll investigate further. -- Edit: This works. Maybe it's not a plain space that separates "dummy" from the rest. d = re.search('<title>(.+) (- dummy)?</title', al, re.IGNORECASE) |