Same Data Showing Several Times With Beautifulsoup Query - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Same Data Showing Several Times With Beautifulsoup Query (/thread-37342.html) |
Same Data Showing Several Times With Beautifulsoup Query - eddywinch82 - May-29-2022 Hi there, I have the following Python Code :- import pandas as pd import requests import numpy as np from bs4 import BeautifulSoup import xlrd import re pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 1000) res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm") soup3 = BeautifulSoup(res3.content,'lxml') BBMF_2022 = [] #BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast')) for item in soup3.find_all('a', string=re.compile(r'between|Flypast')): li1 = item.find_parent().text #li2 = li1.find_previous().font #print(link) print(li1) #print(li2) #BBMF_2022.append(li1) #check if links are in dataframe #df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022']) #dfThe issue I have is when I run the Code, the Data is printed for 15 Entries from May 28th to May 29th, several times, I am not sure why that is the case ? Could someone suggest for me the reason why ? And tell me what I need to change in the Code, so that that Data is printed only once and not several times ? I have tried to Scrape Data from a Website, where entries contain the word between or Flypast. When I use the following piece of Code instead :- for item in soup3.find_all('a', string=re.compile(r'between|Flypast')): li1 = item.find_parent().text #li2 = li1.find_previous().font #print(link) #print(li1) #print(li2) BBMF_2022.append(li1) df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022']) df The first entry for the 28th May, is printed out in the DataFrame 15 times ! instead of 15 seperate Entries I mentioned before. Any help would be much appreciated. Best Regards Eddie Winch )) RE: Same Data Showing Several Times With Beautifulsoup Query - Larz60+ - May-29-2022 You are using a redirected url, instead use: https://python-forum.io/thread-37342.html ? This code will get all data and save as a json file, without any filtering. You can add filters, and any other data you need import requests from bs4 import BeautifulSoup import os import json import sys class airshowdata: def __init__(self): self.airshow_details = {} self.cd = CreateDict() self.jsonfile = 'airshow.json' def get_links(self): url = 'https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm' res3 = requests.get(url) if res3.status_code == 200: soup3 = BeautifulSoup(res3.content,'lxml') else: print(f"Cannot load page {url}") sys.exit(-1) links = soup3.find_all('a') for link in links: anode = self.cd.add_node(self.airshow_details, link.text.strip()) self.cd.add_cell(anode, 'url', link.get('href')) with open(self.jsonfile, 'w') as fp: json.dump(self.airshow_details, fp) # following not needed and can be removed (displays dictionary contents) self.cd.display_dict(self.airshow_details) class CreateDict: """ CreateDict.py - Contains methods to simplify node and cell creation within a dictionary Usage: new_dict(dictname) - Creates a new dictionary instance with the name contained in dictname add_node(parent, nodename) - Creates a new node (nested dictionary) named in nodename, in parent dictionary. add_cell(nodename, cellname, value) - Creates a leaf node within node named in nodename, with a cell name of cellname, and value of value. display_dict(dictname) - Recursively displays a nested dictionary. Requirements: Python standard library: os Author: Larz60+ -- May 2019. """ def __init__(self): os.chdir(os.path.abspath(os.path.dirname(__file__))) def new_dict(self, dictname): setattr(self, dictname, {}) def add_node(self, parent, nodename): node = parent[nodename] = {} return node def add_cell(self, nodename, cellname, value): cell = nodename[cellname] = value return cell def display_dict(self, dictname, level=0): indent = " " * (4 * level) for key, value in dictname.items(): if isinstance(value, dict): print(f'\n{indent}{key}') level += 1 self.display_dict(value, level) else: print(f'{indent}{key}: {value}') if level > 0: level -= 1 def main(): airs = airshowdata() airs.get_links() if __name__ == '__main__': main() RE: Same Data Showing Several Times With Beautifulsoup Query - eddywinch82 - May-29-2022 Many thanks for that Code Larz60+, its very much appreciated by me, thankyou for taking the time to type it. I chose the web.archive link, because the Data is from a week ago, from that Website, the 21st May Data was removed from the Website the other day. Does anyone have any idea, how I can change my Code, to solve the issue I am having with it ? Any help would be very much appreciated. Regards Eddie Winch )) |