How to find particular text from td tag using bs4 - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: How to find particular text from td tag using bs4 (/thread-13030.html) |
How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018 hi, I have some html links and i want to find some particular text and it's next text also. I am using regex but receiving lost of empty lists. These are links: https://www.99acres.com/mailers/mmm_html/eden-park-14mar2017-558.html https://www.99acres.com/mailers/mmm_html/ats-golf-meadows-13april-2016.html https://www.99acres.com/mailers/mmm_html/spaze-privy-the-address-10mar2017-553.html text i am finding Area Range: Next Text also Possession: next text also for example possession 2019 Price: next text also below are my codes: import requests from bs4 import BeautifulSoup import csv import json import itertools import re file = {} final_data = [] final = [] textdata = [] def readfile(alldata, filename): with open("./"+filename, "w") as csvfile: csvfile = csv.writer(csvfile, delimiter=",") for i in range(0, len(alldata)): csvfile.writerow(alldata[i]) def parsedata(url, values): r = requests.get(url, values) data = r.text return data def getresults(): global final_data, file with open("Mailers.csv", "r") as f: reader = csv.reader(f) next(reader) for row in reader: ids = row[0] link = row[1] html = parsedata(link, {}) soup = BeautifulSoup(html, "html.parser") titles = soup.title.text td = soup.find_all("td") for i in td: sublist = [] data = i.text pattern = r'(Possession:)(.)(.+)' x1 = re.findall(pattern, data) sublist.append(x1) sublist.append(link) final_data.append(sublist) print(final_data) return final_data def main(): getresults() readfile(final_data, "Data.csv") main() RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018 Not all of those pages have the word "Possession" in them, and the pages that do have it, don't have it in every cell. Since you don't check whether there were any matches, your list has empty entries for every td that doesn't have any matches. RE: How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018 how can i match if there is word present? and how can i remove those empty list? RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018 Just check if there's anything there, and if there isn't, don't add it to your list: >>> import re >>> data = ['<td width="40"></td>', '<td height="50"></td>', '<td width="40"></td>'] >>> # random sample data from the first link ... >>> pattern = r'(Possession:)(.)(.+)' >>> for cell in data: ... x1 = re.findall(pattern, cell) ... print(x1) ... [] [] [] >>> for cell in data: ... x1 = re.findall(pattern, cell) ... if x1: ... print(x1) ... RE: How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018 I found this useful, but it is repeating multiple times...how can i solve that too? RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018 I don't know what you mean. Can you share what some of the output is now? RE: How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018 output.... [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'September 2019</td>')] [('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')] [('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')] [('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')] [('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')] [('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')] [('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')] [('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')] [('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')] [('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')] [('Possession:', ' ', 'New Launch</td>')] RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018 If the order doesn't matter, you could use a set instead of a list, so duplicates will just be ignored. |