Webscraper for multiple urls - Printable Version

Webscraper for multiple urls - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/forum-1.html)
+--- Forum: Code Review (https://python-forum.io/forum-46.html)
+--- Thread: Webscraper for multiple urls (/thread-29826.html)

Webscraper for multiple urls - Milan - Sep-21-2020

Hello team,

I would like to share a script I have created.

It gets the name and price of a product for each url.

It looks like that:

# USED LIBRARIES
import urllib.request
from bs4 import BeautifulSoup

#URLS FROM WHICH NAME AND PRICE FROM EACH PRODUCT ARE RETRIEVED. ALL PAGES SHOULD HAVE THE SAME FORMAT
urls = ['https://gigatron.rs/ssd/wd-ssd-green-series-wds480g2g0a-193671',
       'https://gigatron.rs/ssd/wd-ssd-blue-250gb-25-sata-iiiwds250g2b0a-250gb-25-sata-iii-do-550-mbs-125220',
       'https://gigatron.rs/ssd/silicon-power-ssd-512gb-25-sata-iii-ace-a55sp512gbss3a55s25-512gb-25-sata-iii-do-560-mbs-144553',
       'https://gigatron.rs/ssd/crucial-ssd-bx500-serijact120bx500ssd1-165010']

#LIST WERE THE NAME AND PRICE ARE STORED
data = []    

#THE MAGIC HAPPENS HERE
for i in urls:
    page = urllib.request.urlopen(i)
    soup = BeautifulSoup(page, features='lxml')
    name = soup.find('h1', {'itemprop':'name'}).text
    price = price = soup.find('span', {'itemprop':'price'}).text
    p = [name, price]
    data.append(p)

#DISPLAYS RESULTS
for j in data:
    print(j)

Any input on how to improve it or simply discuss about is welcome.

RE: Webscraper for multiple urls - Larz60+ - Sep-21-2020

Although urllib suffices in this instance, I'd just suggest using requests (future code) rather than urllib
Requests provides a higher level HTTP client interface.

RE: Webscraper for multiple urls - scidam - Sep-21-2020

1) I would recommend to use meaningful variable names (e.g. url instead of i): for url in urls:
2) Typo (Line No 19): price = price = .
3) if content of the webpage was changed and there was no such things as h1 and price anymore. What would the program do in this case?
4) what if the url doesn't exist...
5) You can try to process several urls in "parallel" (e.g. using Threads) or asynchronously.

RE: Webscraper for multiple urls - Milan - Sep-22-2020

(Sep-21-2020, 11:33 PM)scidam Wrote: 1) I would recommend to use meaningful variable names (e.g. url instead of i): for url in urls:
2) Typo (Line No 19): price = price = .
3) if content of the webpage was changed and there was no such things as h1 and price anymore. What would the program do in this case?
4) what if the url doesn't exist...
5) You can try to process several urls in "parallel" (e.g. using Threads) or asynchronously.

So this is the version with the suggested amendments.

"""
@author: Milan Grujicic
"""

import requests
from bs4 import BeautifulSoup

urls = ['https://gigatron.rs/ssd/wd-ssd-green-series-wds480g2g0a-193671',
       'https://gigatron.rs/ssd/wd-ssd-blue-250gb-25-sata-iiiwds250g2b0a-250gb-25-sata-iii-do-550-mbs-125220',
       'https://gigatron.rs/ssd/silicon-power-ssd-512gb-25-sata-iii-ace-a55sp512gbss3a55s25-512gb-25-sata-iii-do-560-mbs-144553',
       'https://gigatron.rs/ssd/crucial-ssd-bx500-serijact120bx500ssd1-165010']

data = []    

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, features='lxml')
    
    try:
        name = soup.find('h1', {'itemprop':'name'}).text
    except AttributeError:
        print('h1 tag with name does not exist')
    
    try:
        price = soup.find('span', {'itemprop':'price'}).text
    except AttributeError:
        print('Span tag with price does not exist')
    
    p = [name, price]
    data.append(p)

for products in data:
    print(products)

Now it displays a message should the tags are not found, besides other minor changes.

The last two itens have been puzzling me.

4) Based on what can I fetch nonexistent urls?
5) You mean each url in a specific thread? How can I retrieve the urls from the list to do so?

RE: Webscraper for multiple urls - buran - Sep-22-2020

note that if you hit one of except blocks you introduced you will get error if this is the first url (name and/or price will not be defined) or they will have incorrect value (from previous iteration of the loop)