Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem while joining two URL components with urllib [duplicate]

Tags:

python

urllib

Recently I wanted to make a Python program which can crawl a website. I want to join the two components which should give the following output using urllib.parse.urljoin

https://test.com/endpoint + test.php =  https://test.com/endpoint/test.php

My code:

urllib.parse.urljoin('https://test.com/endpoint','test.php')

However, it is showing the following output:

https://test.com/test.php

Is there any way which can help me to get my desired output?

like image 636
Faiyaz Ahmad Avatar asked May 19 '26 19:05

Faiyaz Ahmad


2 Answers

The purpose of urljoin is to replace the last part of the path in the base URL. If that's not what you want, probably use a different function. Regular string joining would work well here, perhaps with a provision for normalizing slashes.

def joinurl(baseurl, path):
    return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])

This is rather similar to os.path.join; maybe consider using that instead. (Of course, on Windows, where the system path separator is not a slash, it will do the wrong thing for URLs.)

like image 65
tripleee Avatar answered May 22 '26 07:05

tripleee


That' because urllib.parse.urljoin is not made for this use case.

Example from the docs (https://docs.python.org/fr/3/library/urllib.parse.html#module-urllib.parse):

from urllib.parse import urljoin

new_url = urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
print(new_url)

Output:

http://www.cwi.nl/%7Eguido/FAQ.html

As written in the doc, urllib.parse.urljoin constructs

a full ("absolute") URL by combining a "base URL" (base) with another URL (url).

In your example, you give "https://test.com/endpoint" as first parameter, so urllib.parse.urljoin will consider that the "base url" is "https://test.com/", and it will add what you pass as a second parameter (that is "test.php"), that's why your output is "https://test.com/test.php".

I think that you best option is to use the joinurl function posted by @tripleee, because it will not produce results like "endpoint//test.php" or "endpointtest.php".

But you should not use os.path.join if your code has to be cross platform. On Windows, you will get a backslash instead of a slash ("https://test.com/endpoint\test.php").

Here is a code sample for testing purposes:

def joinurl(baseurl, path):
    return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])

url_base = "https://test.com/endpoint"
web_page_name = "/test.php"

desired_output = "https://test.com/endpoint/test.php"

assert(joinurl("https://test.com/endpoint", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint", "/test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "/test.php") == desired_output)
like image 29
Rivers Avatar answered May 22 '26 07:05

Rivers



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!