Beautiful Soup (suddenly) doesn't get full webpage html - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Beautiful Soup (suddenly) doesn't get full webpage html (/thread-28253.html) |
Beautiful Soup (suddenly) doesn't get full webpage html - j.crater - Jul-11-2020 Hello all, few months ago I dabbled in Beautiful Soup for first time, so I still lack much understanding of the module and entire subject. However, the issue I'm having is that B.S. parsed the whole page HTML just fine. And this time, when re-running the same code, I only get partial HTML response, most of the response being lines of JavaScript. The code is: import requests from bs4 import BeautifulSoup url = "https://www.youtube.com/results?search_query=python" response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") # print(soup)Any tip will be much appreciated. Thanks and best regards, JC RE: Beautiful Soup (suddenly) doesn't get full webpage html - HarleyQuin - Jul-11-2020 Hey bro, So i would recommend using the code that i used below:- It returns full request. Its hard for me to know what it is missing compared to an original you ran months ago tho... The below query works nicely as far as i am aware. import requests from bs4 import BeautifulSoup url = requests.get("https://www.youtube.com/results?search_query=python").content soup = BeautifulSoup(url, 'lxml') # You can use html.parser here alternatively - Depends on what you are wanting to achieve print(soup) RE: Beautiful Soup (suddenly) doesn't get full webpage html - snippsat - Jul-11-2020 j.crater Wrote:most of the response being lines of JavaScript.Look at Web-scraping part-2 under: snippsat Wrote:JavaScript,why do i not get all content So to give a demo of using both BS and Selenium to parse. from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import time #--| Setup options = Options() #options.add_argument("--headless") #options.add_argument("--window-size=1980,1020") browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options) #--| Parse or automation url = "https://www.youtube.com/results?search_query=python" browser.get(url) time.sleep(2) # Use Bs to Parse soup = BeautifulSoup(browser.page_source, 'lxml') first_title = soup.find('a', id="video-title") print(first_title.text.strip()) print('-' * 50) # Use Selenium to parse second_title_sel = browser.find_elements_by_xpath('//*[@id="video-title"]') print(second_title_sel[1].text) YouTube has also a API YouTube Data API that can be used from Python.Example this post. RE: Beautiful Soup (suddenly) doesn't get full webpage html - j.crater - Jul-11-2020 Thank you both for answers. @HarleyQuin The code I ran months ago was same as I posted here, but result was not same. As stated, on my first attempt I got all the HTML contents, while this time I didn't. Also, replacing the parser for lxml parser didn't make a difference. Do you have any idea, from experience, why such difference? @snippsat Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However, to reiterate, my original run of the code returned complete HTML contents. Could it be that website render just got slower for some reason, since my last attempt at parsing (few months ago)? RE: Beautiful Soup (suddenly) doesn't get full webpage html - snippsat - Jul-11-2020 (Jul-11-2020, 11:28 AM)j.crater Wrote: Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However,The 2-seconds sleep has nothing to about this just there for safety(to make sure all page has loaded) can comment it out and it still work. It's Selenium that's that's important part here. In link Web-scraping part-2. snippsat Wrote:JavaScript is used all over the web because it's unique position to run in Browser(client side). When you just parse with Requests and BS,you will not get the executed JavaScript but only the raw content. Then you will not at all find as example this tag soup.find('a', id="video-title") Because getting raw JavaScript back. It will be in a script tag,here a clean up version bye deleting a lot get where title is.<script> window["ytInitialData"] .... = "title":{"runs":[{"text":"Learn Python - Full Course for Beginners [Tutorial]"}],"accessibility":{"accessibilityData":{"label":"Learn Python "viewCountText":{"simpleText":"Sett 16 184 859 ganger"},..... window["ytInitialPlayerResponse"] = null; if (window.ytcsi) {window.ytcsi.tick("pdr", null, '');} </script>To parse this raw JavaScript is almost impossible that's why use Selenium to get the executed JavaScript back. RE: Beautiful Soup (suddenly) doesn't get full webpage html - HarleyQuin - Jul-11-2020 (Jul-11-2020, 11:52 AM)j.crater Wrote: Thank you both for answers. Hey again, From experience i have noticed that not using a user-agent/header makes it very easy for YouTube to immediately identify you as a web scraper and deal with your request connection differently to how a conventional user may be welcomed by the site. That is something that made a difference when i first started scraping. e.g. i use this in my code: import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', "Content-Type": "application/x-www-form-urlencoded"} url = "https://whatsmyua.info/" webpage = requests.get(url, headers=headers).text print(webpage)Sorry if i have been of no use! I hope you solve your issue buddy, Regards, Harley RE: Beautiful Soup (suddenly) doesn't get full webpage html - j.crater - Jul-11-2020 @HarleyQuin This is a very clever approach, I will probably be using preset headers from now. I would probably never even consider effects websites might have on "robot" users. @snippsat Your code works well and I can definitely continue from here. Frankly, I have no idea what was different on my attempt this time, since using requests.get() and B.S. gave good results on my first try. Given your examples with B.S. and Selenium, can Selenium replace B.S. entirely for use with scraping/navigating websites? In that case, I will stick to Selenium down the road, to avoid overhead and invest in learning one tool well instead. RE: Beautiful Soup (suddenly) doesn't get full webpage html - snippsat - Jul-11-2020 (Jul-11-2020, 02:43 PM)j.crater Wrote: Your code works well and I can definitely continue from here. Frankly, I have no idea what was different on my attempt this time, since using requests.get() and B.S. gave good results on my first try.They may have changes source,so now is almost all code generated bye JavaScript. (Jul-11-2020, 02:43 PM)j.crater Wrote: Given your examples with B.S. and Selenium, can Selenium replace B.S. entirely for use with scraping/navigating websites?You use Selenium only when it's necessary and can not get content only using Requests/BS. This is usually the case with heavy sites eg to pick a example stock exchanges sites that we have many Thread about. To better understand what JavaScript DOM(Document Object Model) dos in browser. Use this address as before: https://www.youtube.com/results?search_query=pythonNow turn off JavaScript in Browser,the reload page what do you see now? RE: Beautiful Soup (suddenly) doesn't get full webpage html - j.crater - Jul-11-2020 Quote:They may have changes source,so now is almost all code generated bye JavaScript.This is most likely the case indeed. Quote:Now turn off JavaScript in Browser,the reload page what do you see now?And this seems to prove it. By disabling JavaScript and then checking page source, I see the results I got from B.S. Thanks a lot for help and tips. |