Python Forum
Multiprocessing OSError 'too many open files'
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Multiprocessing OSError 'too many open files'
#4
(Dec-27-2019, 04:36 PM)micseydel Wrote: I expect you to know by now, it's incredibly difficult for us to help with code when we're provided only 10 lines of who knows how much. What you should do is make a copy of your code and simplify it until there's nothing that can be removed. Any user-input should be hard-coded to reproduce the problem, and any line of code that can be removed without preventing the problem being reproduced should be removed.

Once you've gotten to that point, you'll have code that we can actually look at and start trying to figure out what's wrong. The fewer lines of code you're able to reduce to, the more likely you will be to get a satisfactory response.
That's fair enough. I was trying to keep the code to minimum.
I've removed unnecessary code, and added a few hardcoded values.

Here's the code - I believe the libraries you need are requests, urllib and psaw.
class Links():
    def start_sorting_links():
        processes = []
        index = 0

        api = PushshiftAPI()
        SUBMISSIONS = api.search_submissions(subreddit='wellthatsucks', filter=['url', 'over_18'], limit=500)

        urls = [subs.d_['url'] for subs in SUBMISSIONS] #get all the urls and put in array

        while (len(processes) < len(list(urls))): #checks for current processes alive being less than all the links needing processing
            if (len(processes) - len([p for p in processes if not p.is_alive()]) < OPTIONS['max_proc']):
                p = Process(target = Links.process_link, args=(urls[index], )) #create a new process
                processes.append(p) #add it to array
                p.start()

                index += 1
                    
        for p in processes:
            p.join() #dont continue main script until processes have finished

    def process_link(url):
        with closing(gzip.GzipFile('/links.data.gz', 'a')) as lnk: #close file after we have finished with it
            if('/imgur.com/' in url): #imgur plain
                tree = Links.parse_link(url)
                Links.Imgur.imgur_plain(tree, lnk)
            elif('/i.imgur.com/' in url): #imgur direct
                Links.Imgur.imgur_direct(url, lnk)
            elif('/gfycat.com/' in url): #gfycat plain
                tree = Links.parse_link(url)
                Links.Gfycat.gfycat_plain(tree, lnk)
            elif('/thumbs.gfycat.com/' in url): #gfycat direct
                Links.Gfycat.gfycat_direct(url, lnk)
            elif('/i.redd.it/' in url): #reddit image / direct
                Links.Reddit.i_reddit(url, lnk)
            elif('/v.redd.it' in url): #reddit video / plain
                Links.Reddit.v_reddit(url, lnk)
            elif('/giphy.com/' in url): #giphy plain
                tree = Links.parse_link(url)
                Links.Giphy.giphy_plain(tree, lnk)
            elif('/media.giphy.com/' in url): #giphy partial direct 
                Links.Giphy.giphy_part_direct(url, lnk)
            elif('/i.giphy.com/' in url): #giphy full direct 
                Links.Giphy.giphy_full_direct(url, lnk)
            else:
                pass #clearly not anything we want here!
            lnk.flush() #flush the buffer
        
    class Imgur():
        def imgur_plain(tree, file):
            l = tree.xpath('/html/head/link[12]')
            try:
                direct_link = [i.attrib['href'] for i in l][0]
            except Exception:
                return
            n = tree.xpath('/html/body/div[7]/p[2]')
            nsfw = [i.text for i in n]
            file.write('{0}\n'.format(str(direct_link)).encode())
        def imgur_direct(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode())
            
    class Gfycat():
        def gfycat_plain(tree, file):
            l = tree.xpath('/html/head/meta[51]')
            direct_link = [i.attrib['content'] for i in l][0]
            #almost impossible to find out whether gif is nsfw or not on gfycat
            file.write('{0}\n'.format(str(direct_link)).encode()) #just gonna risk it
        def gfycat_direct(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode()) #just gonna risk it

    class Reddit(): ####NSFW
        def i_reddit(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode())

        def v_reddit(url, file): #v.reddit is video, and will direct you back to normal reddit page so i need to get direct link
            full_url_jsonified = str(requests.get(url).url) + '.json'#v.reddit redirects to normal reddit, and i need that link, so i get it. add json to the end, so i get json data about video
            try:
                req = urlopen(full_url_jsonified, context=CONTEXT)
            except HTTPError:
                    return                
            data = json.load(req) #load the request into a json format
            direct_link = data[0]['data']['children'][0]['data']['secure_media']['reddit_video']['fallback_url'] #it uses the fallback url which is just a direct url to the video      
            file.write('{0}\n'.format(str(direct_link)).encode())

    class Giphy():
        def giphy_plain(tree, file):
            l = tree.xpath('/html/head/meta[19]')
            direct_link = [i.attrib['content'] for i in l][0].replace('media', 'i', 1) #replace first instance of 'media' with 'i' and that will get you direct link
            #very few giphy gifs have a pg rating / nsfw tag
            file.write('{0}\n'.format(str(direct_link)).encode())

        def giphy_part_direct(url, file):
            suffix = url.split(".")[-1]
            direct_link = url.replace('media', 'i', 1) #changing one of the 'media's makes it a direct link
            file.write('{0}\n'.format(str(direct_link)).encode())

        def giphy_full_direct(url, lnk):
            file.write('{0}\n'.format(str(url)).encode())

    def parse_link(url):
        htmlparser = etree.HTMLParser() #create parser
        with closing(urlopen(Request(url, headers={'User-Agent': 'Mozilla/3.0'}), context=CONTEXT)) as response: #get the html of the page - the headers make it seem like Mozilla is acessing the page otherwise some sites can detect if it is python trying to access and block the connection
            return etree.parse(response, htmlparser) #create an element tree out of it

(Dec-27-2019, 04:46 PM)ibreeden Wrote: A Python error message usually also tells on which line the error occurs. Can you share that information?
I'm actually struggling to reproduce the error.
I've run it twice, and the error hasn't occurred. I'm wondering if I was getting the error because there were too many processes left over from other times I have run the program.

Multiprocessing makes a huge spread difference in this part of the code, but if it's going to be finicky like this, I might not even continue with it.
Reply


Messages In This Thread
RE: Multiprocessing OSError 'too many open files' - by DreamingInsanity - Dec-27-2019, 04:50 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
Question Right way to open files with different encodings? Winfried 2 407 Apr-23-2024, 05:50 PM
Last Post: snippsat
  Open files in an existing window instead of new Kostov 2 461 Apr-13-2024, 07:22 AM
Last Post: Kostov
  OSError occurs in Linux. anna17 2 468 Mar-23-2024, 10:00 PM
Last Post: snippsat
  open python files in other drive akbarza 1 801 Aug-24-2023, 01:23 PM
Last Post: deanhystad
  OSError with SMPT script Milan 0 875 Apr-28-2023, 01:34 PM
Last Post: Milan
  OSERROR When mkdir Oshadha 4 1,927 Jun-29-2022, 08:50 AM
Last Post: DeaD_EyE
  How to open/load image .tiff files > 2 GB ? hobbyist 1 2,564 Aug-19-2021, 12:50 AM
Last Post: Larz60+
  Open and read multiple text files and match words kozaizsvemira 3 6,883 Jul-07-2021, 11:27 AM
Last Post: Larz60+
Question (solved) open multiple libre office files in libre office lucky67 5 3,531 May-29-2021, 04:54 PM
Last Post: lucky67
  OSError: Unable to load libjvm when connecting to hdfs with pyarrow 3.0.0 aupres 0 3,263 Mar-22-2021, 10:25 AM
Last Post: aupres

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020