I'm quite new to python. I'm trying to parse a file of URLs to leave only a specific part (bold part) of URL.
Here are some example of URL's i am working with:
http://www.mega.pk/**washingmachine**-dawlance/
http://www.mega.pk/**washingmachine**-haier/
http://www.mega.pk/**airconditioners**-acson/
http://www.mega.pk/**airconditioners**-lg/
http://www.mega.pk/**airconditioners**-samsung/
I have tried some regular expression but it gets very complicated. What I have in mind is remove this "http://www.mega.pk/" from all urls as it is common and then remove anything that is after "-" including all slashes. But know no way of doing it.
Use the urllib (formerly urlparse) module. It's built specifically for this purpose.
from urllib.parse import urlparse
url = "http://www.mega.pk/washingmachine-dawlance/"
path = urlparse(url).path # get the path from the URL ("/washingmachine-dawlnace/")
path = path[:path.index("-")] # remove everything after the '-' including itself
path = path[1:] # remove the '/' at the starting of the path (just before 'washing')
path
variable will have the value washingmachine
Cheers!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With