I have this sample vector containing URLs. My goal is to obtain the path of the URL.
sample1 <- c("http://tercihblog.com/indirisu/docugard/", "http://funerariagomez.com/js/ggogle/a201209e3f79b740337b7bdb521630fe/", 
      "http://www.t-online.de/contacts/2015/08/atlas.html/", "http://mgracetimber.ie/wp-content/themes/Banner/db/box/", 
      "http://zamartrade.com/cs/DHL/DHL%20_%20Tracking.htm/", "http://dunhamengineering.com/menu/Auto-loadgoogleDrive/Document.Index/", 
      "http://www.indiegogo.com/guide/forum/2014/09/forgot-password/", 
      "http://raetc.com/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/", 
      "http://www.lidanhang.com/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&hwjklxlamp;ssl=0&dest/", 
      "http://www.sudaener.com/wp-includes/js/crop/dropbox/", "https://zeustracker.abuse.ch/blocklist.php/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=hostsdeny/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=iptablesblocklist/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=snort/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=squiddomain/"
    )
My initial try was this:
gsub('http://[^/]+/','/',sample1)
However this won't work with URLs that have https://. A suitable solution would be to drop everything before the third occurrence of"/". I was wondering how to use regexto do this and also if there is a way to do it using substring.
Thanks
It is really advisable to go with gsub here since the code is cleaner and more straightforward. 
If you want to remove all before the 3rd /, use
> gsub('^(?:[^/]*/){3}','/',sample1)
 [1] "/indirisu/docugard/"                                                                              
 [2] "/js/ggogle/a201209e3f79b740337b7bdb521630fe/"                                                     
 [3] "/contacts/2015/08/atlas.html/"                                                                    
 [4] "/wp-content/themes/Banner/db/box/"                                                                
 [5] "/cs/DHL/DHL%20_%20Tracking.htm/"                                                                  
 [6] "/menu/Auto-loadgoogleDrive/Document.Index/"                                                       
 [7] "/guide/forum/2014/09/forgot-password/"                                                            
 [8] "/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/"                                      
 [9] "/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&hwjklxlamp;ssl=0&dest/"
[10] "/wp-includes/js/crop/dropbox/"                                                                    
[11] "/blocklist.php/"                                                                                  
[12] "/blocklist.php?download=hostsdeny/"                                                               
[13] "/blocklist.php?download=iptablesblocklist/"                                                       
[14] "/blocklist.php?download=snort/"                                                                   
[15] "/blocklist.php?download=squiddomain/"   
The ^(?:[^/]*/){3} matches:
^ - start of string(?:[^/]*/){3} - exactly 3 occurrences of:
[^/]* - zero or more characters other than /
/ - a literal / character.Cath suggests a more precise your regex fix, but perhaps, you'd like to add ^ at the start to only match at the beginning of the string:
gsub('^https?://[^/]+/','/',sample1)
      ^     ^
The ? (greedy) quantifier means one or zero occurrences, thus making the s after http optional. It is identical to (but more efficient than) gsub('^(https|http)://[^/]+/','/',sample1).
You may also want to make your regex case-insensitive, add ignore.case = TRUE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With