I wan't to Download all .zip Files From A Website (Project AI)

***snippsat*** · Aug-26-2018, 12:11 PM

(Aug-26-2018, 11:18 AM)eddywinch82 Wrote: How do I do that snippsat ? Thanks guys, for all your input.

Are you a member with working username and password to that site?
You see in @DeaD_EyE post #3 that he try to log in.
This can be hard to figure out for some sites.

I would use Selenium to do log in,if there is top much struggle with Requests.
Then give source code to BS for parsing.

eddywinch82 · (This post was last modified: Aug-26-2018, 03:16 PM by eddywinch82.)

Hi guys, snippsat I tried logging in with selenium, instead of requests, i.e. I used import selenium and I can't with that module either, I get the same error message, I got when running with requests with the codes. Also I have tried running both yours and Larz60 's codes for getting the File Path Data etc, and both have Syntax error, when I run them in Python. I am assuming that the coding worked for you both, in both cases ?

I have checked the coding, and I have copied both codes correctly.

Also snippsat you said "Or write a code that goes through all pages(simple page system 2,3,4, etc...) and download." How do I do that ?

Hi snippsat, I have attempted to adapt the coding you did for me, a while back for the Project AI Website .zip Files I wanted to download, But it hasn't worked, where am I going wrong ? Here is the adapted Code :-

from bs4 import BeautifulSoup
import requests
from tqdm import tqdm, trange
from itertools import islice
 
def all_planes():
    '''Generate url links for all planes'''
    url = 'https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62'
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')
    td = soup.find_all('td', width="50%")
    plain_link = [link.find('a').get('href') for link in td]
    for ref in tqdm(plain_link):
         url_file_id = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65857709{}'.format(ref)
         yield url_file_id
 
def download(all_planes_pages):
    '''Download zip for 1 plain,feed with more url download all planes'''
    # A_300 = next(all_planes())  # Test with first link
    last_253 = islice(all_planes_pages(), 0, 253)
    for plane_page_url in last_253:
        url_get = requests.get(plane_page_url)
        soup = BeautifulSoup(url_get.content, 'lxml')
        td = soup.find_all('td', class_="text", colspan="2")
        zip_url = 'https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid={}'
        for item in tqdm(td):
            zip_name = item.text
            zip_number = item.find('a').get('href').split('=')[-1]
            with open(zip_name, 'wb')  as f_out:
                down_url = requests.get(zip_url.format(zip_number))
                f_out.write(down_url.content)
 
if __name__ == '__main__':
    download(all_planes_pages)

Eddie

**Larz60+** · Aug-26-2018, 04:47 PM

worked for me, but today I see that the same url is now not accessible without a password, so someone has tightened the security.
Scraping is always touchy, and what works today often will not work tomorrow.
If you haven't done so already, you should (need to?) run through snippsat's tutorials here:
part1
part2

eddywinch82 · (This post was last modified: Aug-26-2018, 08:57 PM by Larz60+.)

Hi Guys, I combined coding I found from someone, on the Internet for Web-Scraping ZIP Files.
With Your Code DeadEye, here is the Combined code :-

import sys
import getpass
import hashlib
import requests
 
 
BASE_URL = 'https://www.flightsim.com/'
 
 
def do_login(credentials):
    session = requests.Session()
    session.get(BASE_URL)
    req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials)
    if req.status_code != 200:
        print('Login not successful')
        sys.exit(1)
    # session is now logged in
    return session
 
 
def get_credentials():
    username = input('Username: ')
    password = getpass.getpass()
    password_md5 = hashlib.md5(password.encode()).hexdigest()
    return {
        'cookieuser': 1,
        'do': 'login',
        's': '',
        'securitytoken': 'guest',
        'vb_login_md5_password': password_md5,
        'vb_login_md5_password_utf': password_md5,
        'vb_login_password': '',
        'vb_login_password_hint': 'Password',
        'vb_login_username': username,
        }
 
 
credentials = get_credentials()
session = do_login()

import urllib2
from urllib2 import Request, urlopen, URLError
#import urllib
import os
from bs4 import BeautifulSoup


#Create a new directory to put the files into
#Get the current working directory and create a new directory in it named test
cwd = os.getcwd()
newdir = cwd +"\\test"
print "The current Working directory is " + cwd
os.mkdir( newdir, 0777);
print "Created new directory " + newdir
newfile = open('zipfiles.txt','w')
print newfile


print "Running script.. "
#Set variable for page to be open and url to be concatenated 
url = "http://www.flightsim.com"
page = urllib2.urlopen('https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62').read()

#File extension to be looked for. 
extension = ".zip"

#Use BeautifulSoup to clean up the page
soup = BeautifulSoup(page)
soup.prettify()

#Find all the links on the page that end in .zip
for anchor in soup.findAll('a', href=True):
    links = url + anchor['href']
    if links.endswith(extension):
        newfile.write(links + '\n')
newfile.close()

#Read what is saved in zipfiles.txt and output it to the user
#This is done to create presistent data 
newfile = open('zipfiles.txt', 'r')
for line in newfile:
    print line + '/n'
newfile.close()

#Read through the lines in the text file and download the zip files.
#Handle exceptions and print exceptions to the console
with open('zipfiles.txt', 'r') as url:
    for line in url:
        if line:
            try:
                ziplink = line
                #Removes the first 48 characters of the url to get the name of the file
                zipfile = line[48:]
                #Removes the last 4 characters to remove the .zip
                zipfile2 = zipfile[:3]
                print "Trying to reach " + ziplink
                response = urllib2.urlopen(ziplink)
            except URLError as e:
                if hasattr(e, 'reason'):
                    print 'We failed to reach a server.'
                    print 'Reason: ', e.reason
                    continue
                elif hasattr(e, 'code'):
                    print 'The server couldn\'t fulfill the request.'
                    print 'Error code: ', e.code
                    continue
            else:
                zipcontent = response.read()
                completeName = os.path.join(newdir, zipfile2+ ".zip")
                with open (completeName, 'w') as f:
                    print "downloading.. " + zipfile
                    f.write(zipcontent)
                    f.close()
print "Script completed"

But I get the following Traceback Error, the coding runs ok initially, allowing me to type my Username. But I get the following Error Message after I hit enter :-

Error:Traceback (most recent call last):
  File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 38, in <module>
    credentials = get_credentials()
  File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 22, in get_credentials
    username = input('Username: ')
  File "<string>", line 1, in <module>
NameError: name '......' is not defined

Any ideas where I am going wrong ?

Eddie

**Larz60+** · Aug-26-2018, 09:05 PM

Line 22 is where you input your user name (I removed actual username):

username = input('Username: ')

This is where the error traceback is showing the error,

Error:File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 22, in get_credentials
    username = input('Username: ')
  File "<string>", line 1, in <module>
NameError: name '......' is not defined

The last line number is usually where the error is encountered
however I don't see an issue here

One last note. If you must use python 2, would you at least put print statements in parenthesis?

eddywinch82 · (This post was last modified: Aug-26-2018, 09:19 PM by eddywinch82.)

My Username is eddywinch82 where do I type that on Line 22 ?

Should I type :-

eddywinch82 = input('Username: ')

**Larz60+** · Aug-26-2018, 09:22 PM

you enter it real time, while running script.

eddywinch82 · Aug-26-2018, 09:25 PM

That's what I was doing, do you have an idea, what the issue is here ?

**Larz60+** · Aug-26-2018, 09:37 PM

you're using antique python, it's raw_input

eddywinch82 · Aug-26-2018, 10:00 PM

I was using Python 3.43 before, and the same problem was occuring then.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Website scrapping and download	santoshrane	3	4,436	Apr-14-2021, 07:22 AM Last Post: kashcode
	Login and download an exported csv file within a ribbon/button in a website	Alekhya	0	2,726	Feb-26-2021, 04:15 PM Last Post: Alekhya
	Cant Download Images from Unsplash Website	firaki12345	1	2,356	Feb-08-2021, 04:15 PM Last Post: buran
	Download some JPG files and make it a single PDF & share it	rompdeck	5	5,800	Jul-31-2020, 01:15 AM Last Post: Larz60+
	download pdf file from website	m_annur2001	1	3,062	Jun-21-2019, 05:03 AM Last Post: j.crater
	Access my webpage and download files from Python	Pedroski55	7	5,804	May-26-2019, 12:08 PM Last Post: snippsat
	Download all secret links from a map design website	fyec	0	2,903	Jul-24-2018, 09:08 PM Last Post: fyec
	I Want To Download Many Files Of Same File Extension With Either Wget Or Python,	eddywinch82	15	14,862	May-20-2018, 06:05 PM Last Post: eddywinch82

I wan't to Download all .zip Files From A Website (Project AI)

User Panel Messages

Announcements