How to find particular text from td tag using bs4

How to find particular text from td tag using bs4 - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: How to find particular text from td tag using bs4 (/thread-13030.html)

How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018

hi,

I have some html links and i want to find some particular text and it's next text also. I am using regex but receiving lost of empty lists.

These are links:

https://www.99acres.com/mailers/mmm_html/eden-park-14mar2017-558.html https://www.99acres.com/mailers/mmm_html/ats-golf-meadows-13april-2016.html https://www.99acres.com/mailers/mmm_html/spaze-privy-the-address-10mar2017-553.html

text i am finding Area Range: Next Text also Possession: next text also for example possession 2019 Price: next text also

below are my codes:

import requests
from bs4 import BeautifulSoup
import csv
import json
import itertools
import re
file = {}
final_data = []
final = []
textdata = []
def readfile(alldata, filename):
    with open("./"+filename, "w") as csvfile:
        csvfile = csv.writer(csvfile, delimiter=",")
        for i in range(0, len(alldata)):
            csvfile.writerow(alldata[i])
def parsedata(url, values):
    r = requests.get(url, values)
    data = r.text
    return data

def getresults():
    global final_data, file
    with open("Mailers.csv", "r") as f:
        reader = csv.reader(f)
        next(reader)
        for row in reader:
            ids = row[0]
            link = row[1]
            html = parsedata(link, {})
            soup = BeautifulSoup(html, "html.parser")
            titles = soup.title.text
            td = soup.find_all("td")
            for i in td:
                sublist = []
                data = i.text
                pattern = r'(Possession:)(.)(.+)'
                x1 = re.findall(pattern, data)
                sublist.append(x1)
                sublist.append(link)
                final_data.append(sublist)
    print(final_data)
    return final_data
def main():
    getresults()
    readfile(final_data, "Data.csv")
main()

RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018

Not all of those pages have the word "Possession" in them, and the pages that do have it, don't have it in every cell. Since you don't check whether there were any matches, your list has empty entries for every td that doesn't have any matches.

RE: How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018

how can i match if there is word present? and how can i remove those empty list?

RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018

Just check if there's anything there, and if there isn't, don't add it to your list:

>>> import re
>>> data = ['<td width="40"></td>', '<td height="50"></td>', '<td width="40"></td>']
>>> # random sample data from the first link
...
>>> pattern = r'(Possession:)(.)(.+)'
>>> for cell in data:
...   x1 = re.findall(pattern, cell)
...   print(x1)
...
[]
[]
[]
>>> for cell in data:
...   x1 = re.findall(pattern, cell)
...   if x1:
...     print(x1)
...

RE: How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018

I found this useful, but it is repeating multiple times...how can i solve that too?

RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018

I don't know what you mean. Can you share what some of the output is now?

RE: How to find particular text from td tag using bs4 - Prince_Bhatia - Sep-24-2018

output....

[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'New Launch</td>')]

RE: How to find particular text from td tag using bs4 - nilamo - Sep-24-2018

If the order doesn't matter, you could use a set instead of a list, so duplicates will just be ignored.