Posts: 7,107
Threads: 122
Joined: Sep 2016
(Aug-26-2018, 11:18 AM)eddywinch82 Wrote: How do I do that snippsat ? Thanks guys, for all your input. Are you a member with working username and password to that site?
You see in @ DeaD_EyE post #3 that he try to log in.
This can be hard to figure out for some sites.
I would use Selenium to do log in,if there is top much struggle with Requests.
Then give source code to BS for parsing.
Posts: 218
Threads: 27
Joined: May 2018
Aug-26-2018, 03:16 PM
(This post was last modified: Aug-26-2018, 03:16 PM by eddywinch82.)
Hi guys, snippsat I tried logging in with selenium, instead of requests, i.e. I used import selenium and I can't with that module either, I get the same error message, I got when running with requests with the codes. Also I have tried running both yours and Larz60 's codes for getting the File Path Data etc, and both have Syntax error, when I run them in Python. I am assuming that the coding worked for you both, in both cases ?
I have checked the coding, and I have copied both codes correctly.
Also snippsat you said "Or write a code that goes through all pages(simple page system 2,3,4, etc...) and download." How do I do that ?
Hi snippsat, I have attempted to adapt the coding you did for me, a while back for the Project AI Website .zip Files I wanted to download, But it hasn't worked, where am I going wrong ? Here is the adapted Code :-
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm, trange
from itertools import islice
def all_planes():
'''Generate url links for all planes'''
url = 'https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', width="50%")
plain_link = [link.find('a').get('href') for link in td]
for ref in tqdm(plain_link):
url_file_id = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65857709{}'.format(ref)
yield url_file_id
def download(all_planes_pages):
'''Download zip for 1 plain,feed with more url download all planes'''
# A_300 = next(all_planes()) # Test with first link
last_253 = islice(all_planes_pages(), 0, 253)
for plane_page_url in last_253:
url_get = requests.get(plane_page_url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', class_="text", colspan="2")
zip_url = 'https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid={}'
for item in tqdm(td):
zip_name = item.text
zip_number = item.find('a').get('href').split('=')[-1]
with open(zip_name, 'wb') as f_out:
down_url = requests.get(zip_url.format(zip_number))
f_out.write(down_url.content)
if __name__ == '__main__':
download(all_planes_pages) Eddie
Posts: 11,883
Threads: 474
Joined: Sep 2016
worked for me, but today I see that the same url is now not accessible without a password, so someone has tightened the security.
Scraping is always touchy, and what works today often will not work tomorrow.
If you haven't done so already, you should (need to?) run through snippsat's tutorials here:
part1
part2
Posts: 218
Threads: 27
Joined: May 2018
Aug-26-2018, 08:10 PM
(This post was last modified: Aug-26-2018, 08:57 PM by Larz60+.)
Hi Guys, I combined coding I found from someone, on the Internet for Web-Scraping ZIP Files.
With Your Code DeadEye, here is the Combined code :-
import sys
import getpass
import hashlib
import requests
BASE_URL = 'https://www.flightsim.com/'
def do_login(credentials):
session = requests.Session()
session.get(BASE_URL)
req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials)
if req.status_code != 200:
print('Login not successful')
sys.exit(1)
# session is now logged in
return session
def get_credentials():
username = input('Username: ')
password = getpass.getpass()
password_md5 = hashlib.md5(password.encode()).hexdigest()
return {
'cookieuser': 1,
'do': 'login',
's': '',
'securitytoken': 'guest',
'vb_login_md5_password': password_md5,
'vb_login_md5_password_utf': password_md5,
'vb_login_password': '',
'vb_login_password_hint': 'Password',
'vb_login_username': username,
}
credentials = get_credentials()
session = do_login()
import urllib2
from urllib2 import Request, urlopen, URLError
#import urllib
import os
from bs4 import BeautifulSoup
#Create a new directory to put the files into
#Get the current working directory and create a new directory in it named test
cwd = os.getcwd()
newdir = cwd +"\\test"
print "The current Working directory is " + cwd
os.mkdir( newdir, 0777);
print "Created new directory " + newdir
newfile = open('zipfiles.txt','w')
print newfile
print "Running script.. "
#Set variable for page to be open and url to be concatenated
url = "http://www.flightsim.com"
page = urllib2.urlopen('https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62').read()
#File extension to be looked for.
extension = ".zip"
#Use BeautifulSoup to clean up the page
soup = BeautifulSoup(page)
soup.prettify()
#Find all the links on the page that end in .zip
for anchor in soup.findAll('a', href=True):
links = url + anchor['href']
if links.endswith(extension):
newfile.write(links + '\n')
newfile.close()
#Read what is saved in zipfiles.txt and output it to the user
#This is done to create presistent data
newfile = open('zipfiles.txt', 'r')
for line in newfile:
print line + '/n'
newfile.close()
#Read through the lines in the text file and download the zip files.
#Handle exceptions and print exceptions to the console
with open('zipfiles.txt', 'r') as url:
for line in url:
if line:
try:
ziplink = line
#Removes the first 48 characters of the url to get the name of the file
zipfile = line[48:]
#Removes the last 4 characters to remove the .zip
zipfile2 = zipfile[:3]
print "Trying to reach " + ziplink
response = urllib2.urlopen(ziplink)
except URLError as e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
continue
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
continue
else:
zipcontent = response.read()
completeName = os.path.join(newdir, zipfile2+ ".zip")
with open (completeName, 'w') as f:
print "downloading.. " + zipfile
f.write(zipcontent)
f.close()
print "Script completed"
But I get the following Traceback Error, the coding runs ok initially, allowing me to type my Username. But I get the following Error Message after I hit enter :-
Error: Traceback (most recent call last):
File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 38, in <module>
credentials = get_credentials()
File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 22, in get_credentials
username = input('Username: ')
File "<string>", line 1, in <module>
NameError: name '......' is not defined
Any ideas where I am going wrong ?
Eddie
Posts: 11,883
Threads: 474
Joined: Sep 2016
Line 22 is where you input your user name (I removed actual username):
username = input('Username: ') This is where the error traceback is showing the error,
Error: File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 22, in get_credentials
username = input('Username: ')
File "<string>", line 1, in <module>
NameError: name '......' is not defined
The last line number is usually where the error is encountered
however I don't see an issue here
One last note. If you must use python 2, would you at least put print statements in parenthesis?
Posts: 218
Threads: 27
Joined: May 2018
Aug-26-2018, 09:19 PM
(This post was last modified: Aug-26-2018, 09:19 PM by eddywinch82.)
My Username is eddywinch82 where do I type that on Line 22 ?
Should I type :-
eddywinch82 = input('Username: ')
Posts: 11,883
Threads: 474
Joined: Sep 2016
you enter it real time, while running script.
Posts: 218
Threads: 27
Joined: May 2018
That's what I was doing, do you have an idea, what the issue is here ?
Posts: 11,883
Threads: 474
Joined: Sep 2016
you're using antique python, it's raw_input
Posts: 218
Threads: 27
Joined: May 2018
I was using Python 3.43 before, and the same problem was occuring then.
|