Python Forum
a better way to code this to fix a URL? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/forum-1.html)
+--- Forum: News and Discussions (https://python-forum.io/forum-31.html)
+--- Thread: a better way to code this to fix a URL? (/thread-41788.html)

Pages: 1 2


a better way to code this to fix a URL? - Skaperen - Mar-19-2024

i have this silly code meant to fix a URL that is missing as many as 7 start characters:
if url.startswith('//'): url = 'https:' + url
if url.startswith('://'): url = 'https' + url
if url.startswith('s://'): url = 'http' + url
if url.startswith('ps://'): url = 'htt' + url
if url.startswith('tps://'): url = 'ht' + url
if url.startswith('ttps://'): url = 'h' + url
is there a nice clean and pythonic way to refactor this code into 3 lines or fewer?

i hope you got a good laugh out of that code above.


RE: a better way to code this to fix a URL? - perfringo - Mar-19-2024

It seems that url will always contain // so why not just split and join and not bother whether is missing something or not:

>>> urls = ["//example.com", "://example.com", "s://example.com", "https://example.com"]
>>> for url in urls:
...     print("".join(["https://", url.split("//")[-1]]))
...
https://example.com
https://example.com
https://example.com
https://example.com



RE: a better way to code this to fix a URL? - DeaD_EyE - Mar-19-2024

If the url should always start with https even, if only a hostname is given, you could use this:

from urllib.parse import urlsplit, urlunsplit


def conv2(url):
    return urlunsplit(("https", *urlsplit(url)[1:]))
But I would prefer an error message, that this scheme is not supported or that the url is mistyped.
Automatic correction of human input, leads to funny outcomes.

And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.


RE: a better way to code this to fix a URL? - Skaperen - Mar-22-2024

if characters are not missing, i want them to be correct. correctness will be tested later. this code needs to just pass provided characters as is so they can be tested. later code will test for "http://" and file:///" and whatever else i might want to look for. it's "https://" that i am focusing on. the typical errors come from copy and paste sloppiness.


RE: a better way to code this to fix a URL? - Skaperen - Mar-22-2024

(Mar-19-2024, 09:15 AM)DeaD_EyE Wrote: And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.
so that must be why i have so much difficulty with decorators Cry


RE: a better way to code this to fix a URL? - Gribouillis - Mar-22-2024

You could try this perhaps
import re

url = re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)



RE: a better way to code this to fix a URL? - Skaperen - Mar-25-2024

(Mar-22-2024, 04:58 PM)Gribouillis Wrote: re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)
i don't understand it, but it seems to work.


RE: a better way to code this to fix a URL? - Gribouillis - Mar-25-2024

(Mar-25-2024, 03:26 AM)Skaperen Wrote: i don't understand it
It is easy to understand. With a regular expression, if you want to write « spam optionally preceded by eggs », you write
r'(eggs)?spam'
This matches 'spam' and also 'eggsspam'. The problem is that (eggs) is a capturing group, which you don't need. So you can replace it by a non capturing group by using the (?: syntax. It gives
r'(?:eggs)?spam'
Now, instead of « spam optionally preceded by eggs », the above regular expression means
Output:
beginning of the string then '//' optionally preceded by ( ':' optionally preceded by ( 's' optionally preceded by ( 'p' optionallly preceded by ( 't' optionally preceded by ( 't' optionally preceded by ( 'h' ))))))



RE: a better way to code this to fix a URL? - Skaperen - Mar-25-2024

now, i'm totally lost. i just want to keep it simple and obvious or well explained in comments.


RE: a better way to code this to fix a URL? - Gribouillis - Mar-26-2024

(Mar-25-2024, 11:59 PM)Skaperen Wrote: i just want to keep it simple and obvious or well explained in comments.
You can simplify it a bit because you don't need to match the 'h'. Here it is with a clear comment
import re
 
# replace any of //, ://, s://, ps://, tps://, ttps:// by https:// at beginning of url
url = re.sub(r'^(?:(?:(?:(?:t?t)?p)?s)?\:)?//', 'https://', url)