Python Forum
[SOLVED] [regex] Why isn't possible substring ignored? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: [SOLVED] [regex] Why isn't possible substring ignored? (/thread-39750.html)



[SOLVED] [regex] Why isn't possible substring ignored? - Winfried - Apr-08-2023

Hello,

I need to loop through a list of URLs to grab each page's title, which might contain a substring I want to ignore.

For some reason, the substring isn't removed:

with open('list.txt") as f:
	for line in f:
		print(line.replace('\n', ''))
		n = requests.get(line)
		al = n.text
		#Doesn't remove possible ( - dummy)?
		d = re.search('<\W*title\W*(.*)( - dummy)?</title', al, re.IGNORECASE)
		title = html.unescape(d.group(1))
		print(title)
How is my regex wrong?

Thank you.


RE: [regex] Why isn't possible substring ignored? - Gribouillis - Apr-08-2023

What do you mean by "the substring isn't removed"? Can you give concrete example of data?


RE: [regex] Why isn't possible substring ignored? - Winfried - Apr-08-2023

Some titles look like this:
<title>My title - dummy</title>

Others look like this:
<title>My title</title>

If it's there, how can I get rid of the " - dummy" part?

I expected this to work, but it's ignored: ( - dummy)?


RE: [regex] Why isn't possible substring ignored? - snippsat - Apr-08-2023

(Apr-08-2023, 01:43 PM)Winfried Wrote: If it's there, how can I get rid of the " - dummy" part?
Yes,if the format is the same in all titles.
Output:
<title>My title - dummy</title> <title>Site about cars - car 99</title> <title>Numbers - 12345 678</title>
import re

with open('url_lst.txt') as f:
    for line in f:
        d = re.search('<\W*title\W*(.*?)( - \w.*)?</title', line)
        title = d.group(1)
        print(title)
Output:
My title Site about cars Numbers



RE: [regex] Why isn't possible substring ignored? - Winfried - Apr-08-2023

Thanks, it works, although I don't understand why I need to 1) make it ungreedy since the part can only occur as the last token, and 2) add a trailing ".*" for it to work since it's a single word and nothing can possibly follow.

I'll investigate further.

--
Edit: This works. Maybe it's not a plain space that separates "dummy" from the rest.

d = re.search('<title>(.+) (- dummy)?</title', al, re.IGNORECASE)