Detecting "near duplicates" using a LINQ/C# query

Question

I'm using the following queries to detect duplicates in a database.

Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".

var results = result
                .GroupBy(c => new {c.CompanyName})
                .Select(g => new CompanyGridViewModel
                    {
                        LeadId = g.First().LeadId,
                        Qty = g.Count(),
                        CompanyName = g.Key.CompanyName,
                    }).ToList();

Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)

My main goals are:

To list the first record with a subset of all duplicates (or "near duplicates")
To have some flexibility over the fields and text comparisons I use for my duplicates.

Rawling · Accepted Answer

For your explicit "ignoring spaces" case, you can simply call

var results = result.GroupBy(c => c.Name.Replace(" ", ""))...

However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:

public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
    public bool Equals(Company x, Company y)
    {
        return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
    }

    public int GetHashCode(Company obj)
    {
        return obj.Name.Replace(" ", "").GetHashCode();
    }
}

which you could use as

var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...

It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.

Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)

Detecting "near duplicates" using a LINQ/C# query

Tags:

c#

duplicates

linq

Nick

1 Answers

Rawling

Recent Activity

Donate For Us

Detecting "near duplicates" using a LINQ/C# query

Tags:

c#

duplicates

linq

Nick

1 Answers

Rawling

Related questions

Recent Activity

Donate For Us