I have a list which contains set of URL's which is similar to
These are stored into a list and then display on a listview after a crawling process. i tried different regex patters but still couldn't archive what i exactly need because query string became a problem.
Here is one of the patterns i tried.
(http://?)(w*)(\.*)(\w*)(\.)(\w*)
let me write how i need the above URL's to be filtered.
Well as you can see , the pages which are the same but with different query strings have been removed. This is what i want to archive. Please do note that the above links does contain http:// but did not include them since the SOF finds them as spam. Can anyone be kind to help me out with this. Thanks in advance.
Instead of parsing the Url's manually, you can make use of the Uri class and HttpUtility.ParseQueryString to do the parsing. Here's an example using the LINQ .GroupBy method to collect similar urls into groups, then select the first url from the group.
var distinctUrls = urls.GroupBy (u =>
{
var uri = new Uri(u);
var query = HttpUtility.ParseQueryString(uri.Query);
var baseUri = uri.Scheme + "://" + uri.Host + uri.AbsolutePath;
return new {
Uri = baseUri,
QueryStringKeys = string.Join("&", query.AllKeys.OrderBy (ak => ak))
};
})
.Select (g => g.First())
.ToList();
Sample Output of distinctUrls:
http://somesite.com/index.php?id=12
http://example.com/view.php?image=441
http://somesite.com/page.php?id=1
http://example.com/view.php?ivideo=4
This will also correctly handle the case where two urls have an identical set of querystring parameters, but in a different order, such as example.com/view.php?image=441&order=asc and example.com/view.php?order=desc&image=441 - treating them as similar.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With