Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

urlparse() on a Windows file scheme URI leaves extra slash at start

I'm making an application that needs to read URIs from drag-and-drop input. I'm trying to process each URI with urllib.parse.urlparse().

urlparse() deals with Internet URLs as expected:

>>> import urllib
>>> urllib.parse.urlparse('https://www.google.com/advanced_search')
ParseResult(scheme='https', netloc='www.google.com', path='/advanced_search', params='', query='', fragment='')

But using it on local Windows files leaves an extra slash at the beginning of the path:

>>> urllib.parse.urlparse('file:///C:/Program%20Files/Python36/LICENSE.txt')
ParseResult(scheme='file', netloc='', path='/C:/Program%20Files/Python36/LICENSE.txt', params='', query='', fragment='')

And indeed, functions expecting a local file path don't seem to like this extra slash:

>>> from pathlib import Path
>>> Path('/C:/Program%20Files/Python36/LICENSE.txt').exists()
Traceback (most recent call last):
...
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\\C:\\Program%20Files\\Python36\\LICENSE.txt'

I could code a special case of my own to deal with file:///<Windows drive letter>: somehow, but as a matter of cleanliness: Is there a better Python function to split URIs generally, not just URLs? Or is there something else I'm missing?

Using Python 3.6.1.

like image 231
S. Kirby Avatar asked Sep 19 '25 08:09

S. Kirby


1 Answers

Using urllib.request.url2pathname() on the path component will strip the beginning slash for Windows.

>>> import urllib
>>> import urllib.request
>>> path = urllib.parse.urlparse('file:///C:/Program%20Files/Python36/LICENSE.txt').path
>>> path
'/C:/Program%20Files/Python36/LICENSE.txt'
>>> urllib.request.url2pathname(path)
'C:\\Program Files\\Python36\\LICENSE.txt'

So my URI-processing application should use url2pathname() on the path if the urlparse() result's scheme is file.


Thanks to @eryksun's comment, pointing out that pip uses url2pathname(). pip also shows how to generalize the code more to handle Windows UNC paths, which are used in things like Windows shared network folders. It seems that UNC paths can be detected if the scheme is 'file' and netloc is non-empty, and we need to prepend a couple of backslashes before working with the UNC path.

>>> parse_result = urllib.parse.urlparse('file://some-host/Shared Travel Photos/20170312_112803.jpg')
>>> parse_result
ParseResult(scheme='file', netloc='some-host', path='/Shared Travel Photos/20170312_112803.jpg', params='', query='', fragment='')
>>> urllib.request.url2pathname(r'\\' + parse_result.netloc + parse_result.path)
'\\\\some-host\\Shared Travel Photos\\20170312_112803.jpg'
like image 54
S. Kirby Avatar answered Sep 20 '25 23:09

S. Kirby