Dec-27-2019, 04:50 PM
(This post was last modified: Dec-27-2019, 05:10 PM by DreamingInsanity.)
(Dec-27-2019, 04:36 PM)micseydel Wrote: I expect you to know by now, it's incredibly difficult for us to help with code when we're provided only 10 lines of who knows how much. What you should do is make a copy of your code and simplify it until there's nothing that can be removed. Any user-input should be hard-coded to reproduce the problem, and any line of code that can be removed without preventing the problem being reproduced should be removed.That's fair enough. I was trying to keep the code to minimum.
Once you've gotten to that point, you'll have code that we can actually look at and start trying to figure out what's wrong. The fewer lines of code you're able to reduce to, the more likely you will be to get a satisfactory response.
I've removed unnecessary code, and added a few hardcoded values.
Here's the code - I believe the libraries you need are requests, urllib and psaw.
class Links(): def start_sorting_links(): processes = [] index = 0 api = PushshiftAPI() SUBMISSIONS = api.search_submissions(subreddit='wellthatsucks', filter=['url', 'over_18'], limit=500) urls = [subs.d_['url'] for subs in SUBMISSIONS] #get all the urls and put in array while (len(processes) < len(list(urls))): #checks for current processes alive being less than all the links needing processing if (len(processes) - len([p for p in processes if not p.is_alive()]) < OPTIONS['max_proc']): p = Process(target = Links.process_link, args=(urls[index], )) #create a new process processes.append(p) #add it to array p.start() index += 1 for p in processes: p.join() #dont continue main script until processes have finished def process_link(url): with closing(gzip.GzipFile('/links.data.gz', 'a')) as lnk: #close file after we have finished with it if('/imgur.com/' in url): #imgur plain tree = Links.parse_link(url) Links.Imgur.imgur_plain(tree, lnk) elif('/i.imgur.com/' in url): #imgur direct Links.Imgur.imgur_direct(url, lnk) elif('/gfycat.com/' in url): #gfycat plain tree = Links.parse_link(url) Links.Gfycat.gfycat_plain(tree, lnk) elif('/thumbs.gfycat.com/' in url): #gfycat direct Links.Gfycat.gfycat_direct(url, lnk) elif('/i.redd.it/' in url): #reddit image / direct Links.Reddit.i_reddit(url, lnk) elif('/v.redd.it' in url): #reddit video / plain Links.Reddit.v_reddit(url, lnk) elif('/giphy.com/' in url): #giphy plain tree = Links.parse_link(url) Links.Giphy.giphy_plain(tree, lnk) elif('/media.giphy.com/' in url): #giphy partial direct Links.Giphy.giphy_part_direct(url, lnk) elif('/i.giphy.com/' in url): #giphy full direct Links.Giphy.giphy_full_direct(url, lnk) else: pass #clearly not anything we want here! lnk.flush() #flush the buffer class Imgur(): def imgur_plain(tree, file): l = tree.xpath('/html/head/link[12]') try: direct_link = [i.attrib['href'] for i in l][0] except Exception: return n = tree.xpath('/html/body/div[7]/p[2]') nsfw = [i.text for i in n] file.write('{0}\n'.format(str(direct_link)).encode()) def imgur_direct(url, file): #direct links don't need processing file.write('{0}\n'.format(str(url)).encode()) class Gfycat(): def gfycat_plain(tree, file): l = tree.xpath('/html/head/meta[51]') direct_link = [i.attrib['content'] for i in l][0] #almost impossible to find out whether gif is nsfw or not on gfycat file.write('{0}\n'.format(str(direct_link)).encode()) #just gonna risk it def gfycat_direct(url, file): #direct links don't need processing file.write('{0}\n'.format(str(url)).encode()) #just gonna risk it class Reddit(): ####NSFW def i_reddit(url, file): #direct links don't need processing file.write('{0}\n'.format(str(url)).encode()) def v_reddit(url, file): #v.reddit is video, and will direct you back to normal reddit page so i need to get direct link full_url_jsonified = str(requests.get(url).url) + '.json'#v.reddit redirects to normal reddit, and i need that link, so i get it. add json to the end, so i get json data about video try: req = urlopen(full_url_jsonified, context=CONTEXT) except HTTPError: return data = json.load(req) #load the request into a json format direct_link = data[0]['data']['children'][0]['data']['secure_media']['reddit_video']['fallback_url'] #it uses the fallback url which is just a direct url to the video file.write('{0}\n'.format(str(direct_link)).encode()) class Giphy(): def giphy_plain(tree, file): l = tree.xpath('/html/head/meta[19]') direct_link = [i.attrib['content'] for i in l][0].replace('media', 'i', 1) #replace first instance of 'media' with 'i' and that will get you direct link #very few giphy gifs have a pg rating / nsfw tag file.write('{0}\n'.format(str(direct_link)).encode()) def giphy_part_direct(url, file): suffix = url.split(".")[-1] direct_link = url.replace('media', 'i', 1) #changing one of the 'media's makes it a direct link file.write('{0}\n'.format(str(direct_link)).encode()) def giphy_full_direct(url, lnk): file.write('{0}\n'.format(str(url)).encode()) def parse_link(url): htmlparser = etree.HTMLParser() #create parser with closing(urlopen(Request(url, headers={'User-Agent': 'Mozilla/3.0'}), context=CONTEXT)) as response: #get the html of the page - the headers make it seem like Mozilla is acessing the page otherwise some sites can detect if it is python trying to access and block the connection return etree.parse(response, htmlparser) #create an element tree out of it
(Dec-27-2019, 04:46 PM)ibreeden Wrote: A Python error message usually also tells on which line the error occurs. Can you share that information?I'm actually struggling to reproduce the error.
I've run it twice, and the error hasn't occurred. I'm wondering if I was getting the error because there were too many processes left over from other times I have run the program.
Multiprocessing makes a huge spread difference in this part of the code, but if it's going to be finicky like this, I might not even continue with it.