a better way to code this to fix a URL? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: News and Discussions (https://python-forum.io/forum-31.html) +--- Thread: a better way to code this to fix a URL? (/thread-41788.html) Pages:
1
2
|
a better way to code this to fix a URL? - Skaperen - Mar-19-2024 i have this silly code meant to fix a URL that is missing as many as 7 start characters: if url.startswith('//'): url = 'https:' + url if url.startswith('://'): url = 'https' + url if url.startswith('s://'): url = 'http' + url if url.startswith('ps://'): url = 'htt' + url if url.startswith('tps://'): url = 'ht' + url if url.startswith('ttps://'): url = 'h' + urlis there a nice clean and pythonic way to refactor this code into 3 lines or fewer? i hope you got a good laugh out of that code above. RE: a better way to code this to fix a URL? - perfringo - Mar-19-2024 It seems that url will always contain // so why not just split and join and not bother whether is missing something or not: >>> urls = ["//example.com", "://example.com", "s://example.com", "https://example.com"] >>> for url in urls: ... print("".join(["https://", url.split("//")[-1]])) ... https://example.com https://example.com https://example.com https://example.com RE: a better way to code this to fix a URL? - DeaD_EyE - Mar-19-2024 If the url should always start with https even, if only a hostname is given, you could use this:from urllib.parse import urlsplit, urlunsplit def conv2(url): return urlunsplit(("https", *urlsplit(url)[1:]))But I would prefer an error message, that this scheme is not supported or that the url is mistyped. Automatic correction of human input, leads to funny outcomes. And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date. RE: a better way to code this to fix a URL? - Skaperen - Mar-22-2024 if characters are not missing, i want them to be correct. correctness will be tested later. this code needs to just pass provided characters as is so they can be tested. later code will test for "http://" and file:///" and whatever else i might want to look for. it's "https://" that i am focusing on. the typical errors come from copy and paste sloppiness. RE: a better way to code this to fix a URL? - Skaperen - Mar-22-2024 (Mar-19-2024, 09:15 AM)DeaD_EyE Wrote: And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.so that must be why i have so much difficulty with decorators RE: a better way to code this to fix a URL? - Gribouillis - Mar-22-2024 You could try this perhaps import re url = re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url) RE: a better way to code this to fix a URL? - Skaperen - Mar-25-2024 (Mar-22-2024, 04:58 PM)Gribouillis Wrote: re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)i don't understand it, but it seems to work. RE: a better way to code this to fix a URL? - Gribouillis - Mar-25-2024 (Mar-25-2024, 03:26 AM)Skaperen Wrote: i don't understand itIt is easy to understand. With a regular expression, if you want to write « spam optionally preceded by eggs », you write r'(eggs)?spam'This matches 'spam' and also 'eggsspam'. The problem is that (eggs) is a capturing group, which you don't need. So you can replace it by a non capturing group by using the (?: syntax. It givesr'(?:eggs)?spam'Now, instead of « spam optionally preceded by eggs », the above regular expression means
RE: a better way to code this to fix a URL? - Skaperen - Mar-25-2024 now, i'm totally lost. i just want to keep it simple and obvious or well explained in comments. RE: a better way to code this to fix a URL? - Gribouillis - Mar-26-2024 (Mar-25-2024, 11:59 PM)Skaperen Wrote: i just want to keep it simple and obvious or well explained in comments.You can simplify it a bit because you don't need to match the 'h'. Here it is with a clear comment import re # replace any of //, ://, s://, ps://, tps://, ttps:// by https:// at beginning of url url = re.sub(r'^(?:(?:(?:(?:t?t)?p)?s)?\:)?//', 'https://', url) |