I'm using the following queries to detect duplicates in a database.
Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".
var results = result
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)
My main goals are:
For your explicit "ignoring spaces" case, you can simply call
var results = result.GroupBy(c => c.Name.Replace(" ", ""))...
However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:
public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
public bool Equals(Company x, Company y)
{
return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
}
public int GetHashCode(Company obj)
{
return obj.Name.Replace(" ", "").GetHashCode();
}
}
which you could use as
var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...
It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.
Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With