Recently I wanted to make a Python program which can crawl a website. I want to join the two components which should give the following output using urllib.parse.urljoin
https://test.com/endpoint + test.php = https://test.com/endpoint/test.php
My code:
urllib.parse.urljoin('https://test.com/endpoint','test.php')
However, it is showing the following output:
https://test.com/test.php
Is there any way which can help me to get my desired output?
The purpose of urljoin is to replace the last part of the path in the base URL. If that's not what you want, probably use a different function. Regular string joining would work well here, perhaps with a provision for normalizing slashes.
def joinurl(baseurl, path):
return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])
This is rather similar to os.path.join; maybe consider using that instead. (Of course, on Windows, where the system path separator is not a slash, it will do the wrong thing for URLs.)
That' because urllib.parse.urljoin is not made for this use case.
Example from the docs (https://docs.python.org/fr/3/library/urllib.parse.html#module-urllib.parse):
from urllib.parse import urljoin
new_url = urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
print(new_url)
Output:
http://www.cwi.nl/%7Eguido/FAQ.html
As written in the doc, urllib.parse.urljoin constructs
a full ("absolute") URL by combining a "base URL" (base) with another URL (url).
In your example, you give "https://test.com/endpoint" as first parameter, so urllib.parse.urljoin will consider that the "base url" is "https://test.com/", and it will add what you pass as a second parameter (that is "test.php"), that's why your output is "https://test.com/test.php".
I think that you best option is to use the joinurl function posted by @tripleee, because it will not produce results like "endpoint//test.php" or "endpointtest.php".
But you should not use os.path.join if your code has to be cross platform. On Windows, you will get a backslash instead of a slash ("https://test.com/endpoint\test.php").
Here is a code sample for testing purposes:
def joinurl(baseurl, path):
return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])
url_base = "https://test.com/endpoint"
web_page_name = "/test.php"
desired_output = "https://test.com/endpoint/test.php"
assert(joinurl("https://test.com/endpoint", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint", "/test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "/test.php") == desired_output)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With