[SOLVED] [regex] Why isn't possible substring ignored?

Winfried · (This post was last modified: Apr-08-2023, 06:12 PM by Winfried.)

Hello,

I need to loop through a list of URLs to grab each page's title, which might contain a substring I want to ignore.

For some reason, the substring isn't removed:

with open('list.txt") as f:
	for line in f:
		print(line.replace('\n', ''))
		n = requests.get(line)
		al = n.text
		#Doesn't remove possible ( - dummy)?
		d = re.search('<\W*title\W*(.*)( - dummy)?</title', al, re.IGNORECASE)
		title = html.unescape(d.group(1))
		print(title)

How is my regex wrong?

Thank you.

**Gribouillis** · (This post was last modified: Apr-08-2023, 01:36 PM by Gribouillis.)

What do you mean by "the substring isn't removed"? Can you give concrete example of data?

Winfried · Apr-08-2023, 01:43 PM

Some titles look like this:
<title>My title - dummy</title>

Others look like this:
<title>My title</title>

If it's there, how can I get rid of the " - dummy" part?

I expected this to work, but it's ignored: ( - dummy)?

***snippsat*** · Apr-08-2023, 04:58 PM

(Apr-08-2023, 01:43 PM)Winfried Wrote: If it's there, how can I get rid of the " - dummy" part?

Yes,if the format is the same in all titles.

Output:<title>My title - dummy</title>
<title>Site about cars - car 99</title>
<title>Numbers - 12345 678</title>

import re

with open('url_lst.txt') as f:
    for line in f:
        d = re.search('<\W*title\W*(.*?)( - \w.*)?</title', line)
        title = d.group(1)
        print(title)

Output:My title
Site about cars
Numbers

Winfried · (This post was last modified: Apr-08-2023, 06:36 PM by Winfried.)

Thanks, it works, although I don't understand why I need to 1) make it ungreedy since the part can only occur as the last token, and 2) add a trailing ".*" for it to work since it's a single word and nothing can possibly follow.

I'll investigate further.

--
Edit: This works. Maybe it's not a plain space that separates "dummy" from the rest.

d = re.search('<title>(.+) (- dummy)?</title', al, re.IGNORECASE)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	extract substring from a string before a word !!	evilcode1	3	568	Nov-08-2023, 12:18 AM Last Post: evilcode1
	[solved] Regex expression do not want to taken :/	SpongeB0B	2	802	Nov-06-2023, 02:43 PM Last Post: SpongeB0B
	Help with a regex? (solved)	wrybread	3	852	May-01-2023, 05:12 AM Last Post: deanhystad
	[SOLVED] Alternative to regex to extract date from whole timestamp?	Winfried	6	1,859	Nov-16-2022, 01:49 PM Last Post: carecavoador
	ValueError: substring not found	nby2001	4	7,992	Aug-08-2022, 11:16 AM Last Post: rob101
	Match substring using regex	Pavel_47	6	1,472	Jul-18-2022, 07:46 AM Last Post: Pavel_47
	Substring Counting	shelbyahn	4	6,164	Jan-13-2022, 10:08 AM Last Post: krisputas
	[SOLVED] Why does regex fail cleaning line?	Winfried	5	2,486	Aug-22-2021, 06:59 PM Last Post: Winfried
	Python Substring	muzikman	4	2,339	Dec-01-2020, 03:07 PM Last Post: deanhystad
	Removing items from list if containing a substring	pythonnewbie138	2	2,230	Aug-27-2020, 10:20 PM Last Post: pythonnewbie138

[SOLVED] [regex] Why isn't possible substring ignored?

User Panel Messages

Announcements