Python Forum
Parsing html page and working with checkbox (on a captcha)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsing html page and working with checkbox (on a captcha)
#1
Hello, I am new to Python programming and currently trying to write the very first Python program.
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.
url = '''https://www.shutterstock.com/ru/g/Ivanov+Oleg'''
page = urlopen(url)
data = page.read().decode()
print(data)
Then data is planned to be parsed, but the problem is that normal decoding (decode ()) of any of the pages (? Page = 1? Page = 2, etc.) ends at the 21st photo: if photo 1-20 < img class = "z_h_9d80b z_h_2f2f0", then 21-100 <img class = "z_h_9d80b" and pictures are not displayed (see portfolio.jpg), although in the original page all photos in portfolio have class = "z_h_9d80b z_h_2f2f0"
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)

Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.
import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)
Another variant gives the same
 headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, sdch, br',
        'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0(Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    }
    url_get = requests.get(url, headers=headers)
    parser = url_get.content
    soup = BeautifulSoup(parser, "html.parser")
    print(soup)
The question is: what could be causing this decode () behavior and how to recode these pages correctly?

2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()
from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
. . .
driver = webdriver.Chrome()
driver.get(url)

elem_name = driver.find_element_by_name("username")
elem_name.send_keys("user_х@gmail.com")

elem_pass = driver.find_element_by_name("password")
elem_pass.send_keys("qwerty")

# doesn't work - elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
# doesn't work - elem_capt = driver.find_element_by_class_name("rc-anchor-center-item rc-anchor-checkbox-holder")
elem_capt = driver.find_elements_by_class_name("recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox")

#then it is not clear how to select the checkbox and check it !

elem_name.send_keys(Keys.RETURN)
. . .
(see captcha.jpg)

Please help me to select a checkbox and check it!
Reply


Messages In This Thread
Parsing html page and working with checkbox (on a captcha) - by straannick - Jan-15-2021, 09:35 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
Photo Disable checkbox of google maps markers/labels using selenium erickkill 0 1,306 Nov-25-2021, 12:20 PM
Last Post: erickkill
  <title> django page title dynamic and other field (not working) lemonred 1 2,153 Nov-04-2021, 08:50 PM
Last Post: lemonred
  Automating Captcha form submission with Mechanize Dexty 2 3,372 Aug-03-2021, 01:02 PM
Last Post: Dexty
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,749 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Saving html page and reloading into selenium while developing all xpaths Larz60+ 4 4,273 Feb-04-2021, 07:01 AM
Last Post: jonathanwhite1
  API auto-refresh on HTML page using Flask toc 2 11,968 Dec-23-2020, 02:00 PM
Last Post: toc
  Selenium Parsing (unable to Parse page after loading) oneclick 7 6,139 Oct-30-2020, 08:13 PM
Last Post: tomalex
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,763 Oct-01-2020, 02:19 PM
Last Post: snippsat
  [FLASK] checkbox onclick event Mad0ck 2 4,963 May-14-2020, 09:35 AM
Last Post: Mad0ck
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,420 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020