Posts: 141
Threads: 76
Joined: Jul 2017
hi,
I have some html links and i want to find some particular text and it's next text also. I am using regex but receiving lost of empty lists.
These are links:
https://www.99acres.com/mailers/mmm_html...7-558.html https://www.99acres.com/mailers/mmm_html...-2016.html https://www.99acres.com/mailers/mmm_html...7-553.html
text i am finding Area Range: Next Text also Possession: next text also for example possession 2019 Price: next text also
below are my codes:
import requests
from bs4 import BeautifulSoup
import csv
import json
import itertools
import re
file = {}
final_data = []
final = []
textdata = []
def readfile(alldata, filename):
with open("./"+filename, "w") as csvfile:
csvfile = csv.writer(csvfile, delimiter=",")
for i in range(0, len(alldata)):
csvfile.writerow(alldata[i])
def parsedata(url, values):
r = requests.get(url, values)
data = r.text
return data
def getresults():
global final_data, file
with open("Mailers.csv", "r") as f:
reader = csv.reader(f)
next(reader)
for row in reader:
ids = row[0]
link = row[1]
html = parsedata(link, {})
soup = BeautifulSoup(html, "html.parser")
titles = soup.title.text
td = soup.find_all("td")
for i in td:
sublist = []
data = i.text
pattern = r'(Possession:)(.)(.+)'
x1 = re.findall(pattern, data)
sublist.append(x1)
sublist.append(link)
final_data.append(sublist)
print(final_data)
return final_data
def main():
getresults()
readfile(final_data, "Data.csv")
main()
Posts: 3,458
Threads: 101
Joined: Sep 2016
Not all of those pages have the word "Possession" in them, and the pages that do have it, don't have it in every cell. Since you don't check whether there were any matches, your list has empty entries for every td that doesn't have any matches.
Posts: 141
Threads: 76
Joined: Jul 2017
how can i match if there is word present? and how can i remove those empty list?
Posts: 3,458
Threads: 101
Joined: Sep 2016
Just check if there's anything there, and if there isn't, don't add it to your list: >>> import re
>>> data = ['<td width="40"></td>', '<td height="50"></td>', '<td width="40"></td>']
>>> # random sample data from the first link
...
>>> pattern = r'(Possession:)(.)(.+)'
>>> for cell in data:
... x1 = re.findall(pattern, cell)
... print(x1)
...
[]
[]
[]
>>> for cell in data:
... x1 = re.findall(pattern, cell)
... if x1:
... print(x1)
...
Posts: 141
Threads: 76
Joined: Jul 2017
I found this useful, but it is repeating multiple times...how can i solve that too?
Posts: 3,458
Threads: 101
Joined: Sep 2016
I don't know what you mean. Can you share what some of the output is now?
Posts: 141
Threads: 76
Joined: Jul 2017
output....
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'New Launch</td>')]
Posts: 3,458
Threads: 101
Joined: Sep 2016
If the order doesn't matter, you could use a set instead of a list, so duplicates will just be ignored.
|