Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip URL in Python

I'm quite new to python. I'm trying to parse a file of URLs to leave only a specific part (bold part) of URL.

Here are some example of URL's i am working with:

http://www.mega.pk/**washingmachine**-dawlance/
http://www.mega.pk/**washingmachine**-haier/
http://www.mega.pk/**airconditioners**-acson/
http://www.mega.pk/**airconditioners**-lg/
http://www.mega.pk/**airconditioners**-samsung/

I have tried some regular expression but it gets very complicated. What I have in mind is remove this "http://www.mega.pk/" from all urls as it is common and then remove anything that is after "-" including all slashes. But know no way of doing it.

like image 366
Mansoor Akram Avatar asked Oct 19 '25 13:10

Mansoor Akram


1 Answers

Use the urllib (formerly urlparse) module. It's built specifically for this purpose.

from urllib.parse import urlparse

url = "http://www.mega.pk/washingmachine-dawlance/"

path = urlparse(url).path  # get the path from the URL ("/washingmachine-dawlnace/")
path = path[:path.index("-")]  # remove everything after the '-' including itself
path = path[1:]  # remove the '/' at the starting of the path (just before 'washing')

path variable will have the value washingmachine

Cheers!

like image 103
narendranathjoshi Avatar answered Oct 22 '25 08:10

narendranathjoshi