Multiprocessing OSError 'too many open files'

DreamingInsanity · (This post was last modified: Dec-27-2019, 05:10 PM by DreamingInsanity.)

(Dec-27-2019, 04:36 PM)micseydel Wrote: I expect you to know by now, it's incredibly difficult for us to help with code when we're provided only 10 lines of who knows how much. What you should do is make a copy of your code and simplify it until there's nothing that can be removed. Any user-input should be hard-coded to reproduce the problem, and any line of code that can be removed without preventing the problem being reproduced should be removed.

Once you've gotten to that point, you'll have code that we can actually look at and start trying to figure out what's wrong. The fewer lines of code you're able to reduce to, the more likely you will be to get a satisfactory response.

That's fair enough. I was trying to keep the code to minimum.
I've removed unnecessary code, and added a few hardcoded values.

Here's the code - I believe the libraries you need are requests, urllib and psaw.

class Links():
    def start_sorting_links():
        processes = []
        index = 0

        api = PushshiftAPI()
        SUBMISSIONS = api.search_submissions(subreddit='wellthatsucks', filter=['url', 'over_18'], limit=500)

        urls = [subs.d_['url'] for subs in SUBMISSIONS] #get all the urls and put in array

        while (len(processes) < len(list(urls))): #checks for current processes alive being less than all the links needing processing
            if (len(processes) - len([p for p in processes if not p.is_alive()]) < OPTIONS['max_proc']):
                p = Process(target = Links.process_link, args=(urls[index], )) #create a new process
                processes.append(p) #add it to array
                p.start()

                index += 1
                    
        for p in processes:
            p.join() #dont continue main script until processes have finished

    def process_link(url):
        with closing(gzip.GzipFile('/links.data.gz', 'a')) as lnk: #close file after we have finished with it
            if('/imgur.com/' in url): #imgur plain
                tree = Links.parse_link(url)
                Links.Imgur.imgur_plain(tree, lnk)
            elif('/i.imgur.com/' in url): #imgur direct
                Links.Imgur.imgur_direct(url, lnk)
            elif('/gfycat.com/' in url): #gfycat plain
                tree = Links.parse_link(url)
                Links.Gfycat.gfycat_plain(tree, lnk)
            elif('/thumbs.gfycat.com/' in url): #gfycat direct
                Links.Gfycat.gfycat_direct(url, lnk)
            elif('/i.redd.it/' in url): #reddit image / direct
                Links.Reddit.i_reddit(url, lnk)
            elif('/v.redd.it' in url): #reddit video / plain
                Links.Reddit.v_reddit(url, lnk)
            elif('/giphy.com/' in url): #giphy plain
                tree = Links.parse_link(url)
                Links.Giphy.giphy_plain(tree, lnk)
            elif('/media.giphy.com/' in url): #giphy partial direct 
                Links.Giphy.giphy_part_direct(url, lnk)
            elif('/i.giphy.com/' in url): #giphy full direct 
                Links.Giphy.giphy_full_direct(url, lnk)
            else:
                pass #clearly not anything we want here!
            lnk.flush() #flush the buffer
        
    class Imgur():
        def imgur_plain(tree, file):
            l = tree.xpath('/html/head/link[12]')
            try:
                direct_link = [i.attrib['href'] for i in l][0]
            except Exception:
                return
            n = tree.xpath('/html/body/div[7]/p[2]')
            nsfw = [i.text for i in n]
            file.write('{0}\n'.format(str(direct_link)).encode())
        def imgur_direct(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode())
            
    class Gfycat():
        def gfycat_plain(tree, file):
            l = tree.xpath('/html/head/meta[51]')
            direct_link = [i.attrib['content'] for i in l][0]
            #almost impossible to find out whether gif is nsfw or not on gfycat
            file.write('{0}\n'.format(str(direct_link)).encode()) #just gonna risk it
        def gfycat_direct(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode()) #just gonna risk it

    class Reddit(): ####NSFW
        def i_reddit(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode())

        def v_reddit(url, file): #v.reddit is video, and will direct you back to normal reddit page so i need to get direct link
            full_url_jsonified = str(requests.get(url).url) + '.json'#v.reddit redirects to normal reddit, and i need that link, so i get it. add json to the end, so i get json data about video
            try:
                req = urlopen(full_url_jsonified, context=CONTEXT)
            except HTTPError:
                    return                
            data = json.load(req) #load the request into a json format
            direct_link = data[0]['data']['children'][0]['data']['secure_media']['reddit_video']['fallback_url'] #it uses the fallback url which is just a direct url to the video      
            file.write('{0}\n'.format(str(direct_link)).encode())

    class Giphy():
        def giphy_plain(tree, file):
            l = tree.xpath('/html/head/meta[19]')
            direct_link = [i.attrib['content'] for i in l][0].replace('media', 'i', 1) #replace first instance of 'media' with 'i' and that will get you direct link
            #very few giphy gifs have a pg rating / nsfw tag
            file.write('{0}\n'.format(str(direct_link)).encode())

        def giphy_part_direct(url, file):
            suffix = url.split(".")[-1]
            direct_link = url.replace('media', 'i', 1) #changing one of the 'media's makes it a direct link
            file.write('{0}\n'.format(str(direct_link)).encode())

        def giphy_full_direct(url, lnk):
            file.write('{0}\n'.format(str(url)).encode())

    def parse_link(url):
        htmlparser = etree.HTMLParser() #create parser
        with closing(urlopen(Request(url, headers={'User-Agent': 'Mozilla/3.0'}), context=CONTEXT)) as response: #get the html of the page - the headers make it seem like Mozilla is acessing the page otherwise some sites can detect if it is python trying to access and block the connection
            return etree.parse(response, htmlparser) #create an element tree out of it

(Dec-27-2019, 04:46 PM)ibreeden Wrote: A Python error message usually also tells on which line the error occurs. Can you share that information?

I'm actually struggling to reproduce the error.
I've run it twice, and the error hasn't occurred. I'm wondering if I was getting the error because there were too many processes left over from other times I have run the program.

Multiprocessing makes a huge spread difference in this part of the code, but if it's going to be finicky like this, I might not even continue with it.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Right way to open files with different encodings?	Winfried	2	407	Apr-23-2024, 05:50 PM Last Post: snippsat
	Open files in an existing window instead of new	Kostov	2	461	Apr-13-2024, 07:22 AM Last Post: Kostov
	OSError occurs in Linux.	anna17	2	468	Mar-23-2024, 10:00 PM Last Post: snippsat
	open python files in other drive	akbarza	1	801	Aug-24-2023, 01:23 PM Last Post: deanhystad
	OSError with SMPT script	Milan	0	875	Apr-28-2023, 01:34 PM Last Post: Milan
	OSERROR When mkdir	Oshadha	4	1,927	Jun-29-2022, 08:50 AM Last Post: DeaD_EyE
	How to open/load image .tiff files > 2 GB ?	hobbyist	1	2,564	Aug-19-2021, 12:50 AM Last Post: Larz60+
	Open and read multiple text files and match words	kozaizsvemira	3	6,883	Jul-07-2021, 11:27 AM Last Post: Larz60+
	(solved) open multiple libre office files in libre office	lucky67	5	3,531	May-29-2021, 04:54 PM Last Post: lucky67
	OSError: Unable to load libjvm when connecting to hdfs with pyarrow 3.0.0	aupres	0	3,263	Mar-22-2021, 10:25 AM Last Post: aupres

Multiprocessing OSError 'too many open files'

User Panel Messages

Announcements