Parsing html page and working with checkbox (on a captcha)

straannick · (This post was last modified: Jan-27-2021, 12:30 PM by straannick.)

Hello, I am new to Python programming and currently trying to write the very first Python program.
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.

url = '''https://www.shutterstock.com/ru/g/Ivanov+Oleg'''
page = urlopen(url)
data = page.read().decode()
print(data)

Then data is planned to be parsed, but the problem is that normal decoding (decode ()) of any of the pages (? Page = 1? Page = 2, etc.) ends at the 21st photo: if photo 1-20 < img class = "z_h_9d80b z_h_2f2f0", then 21-100 <img class = "z_h_9d80b" and pictures are not displayed (see portfolio.jpg), although in the original page all photos in portfolio have class = "z_h_9d80b z_h_2f2f0"
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)

Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.

import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)

Another variant gives the same

 headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, sdch, br',
        'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0(Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    }
    url_get = requests.get(url, headers=headers)
    parser = url_get.content
    soup = BeautifulSoup(parser, "html.parser")
    print(soup)

The question is: what could be causing this decode () behavior and how to recode these pages correctly?

2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()

from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
. . .
driver = webdriver.Chrome()
driver.get(url)

elem_name = driver.find_element_by_name("username")
elem_name.send_keys("user_х@gmail.com")

elem_pass = driver.find_element_by_name("password")
elem_pass.send_keys("qwerty")

# doesn't work - elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
# doesn't work - elem_capt = driver.find_element_by_class_name("rc-anchor-center-item rc-anchor-checkbox-holder")
elem_capt = driver.find_elements_by_class_name("recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox")

#then it is not clear how to select the checkbox and check it !

elem_name.send_keys(Keys.RETURN)
. . .

(see captcha.jpg)

Please help me to select a checkbox and check it!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Disable checkbox of google maps markers/labels using selenium	erickkill	0	1,306	Nov-25-2021, 12:20 PM Last Post: erickkill
	<title> django page title dynamic and other field (not working)	lemonred	1	2,153	Nov-04-2021, 08:50 PM Last Post: lemonred
	Automating Captcha form submission with Mechanize	Dexty	2	3,372	Aug-03-2021, 01:02 PM Last Post: Dexty
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,749	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Saving html page and reloading into selenium while developing all xpaths	Larz60+	4	4,273	Feb-04-2021, 07:01 AM Last Post: jonathanwhite1
	API auto-refresh on HTML page using Flask	toc	2	11,968	Dec-23-2020, 02:00 PM Last Post: toc
	Selenium Parsing (unable to Parse page after loading)	oneclick	7	6,139	Oct-30-2020, 08:13 PM Last Post: tomalex
	Help: Beautiful Soup - Parsing HTML table	ironfelix717	2	2,763	Oct-01-2020, 02:19 PM Last Post: snippsat
	[FLASK] checkbox onclick event	Mad0ck	2	4,963	May-14-2020, 09:35 AM Last Post: Mad0ck
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,420	Mar-22-2020, 06:10 AM Last Post: BrandonKastning

Parsing html page and working with checkbox (on a captcha)

User Panel Messages

Announcements