How can I find the set of items that occur in 2 or more sequences in a sequence of sequences?
In other words, I want the distinct values that occur in at least 2 of the passed in sequences.
Note: This is not the intersect of all sequences but rather, the union of the intersect of all pairs of sequences.
Note 2: The does not include the pair, or 2 combination, of a sequence with itself. That would be silly.
I have made an attempt myself,
public static IEnumerable<T> UnionOfIntersects<T>(
                                  this IEnumerable<IEnumerable<T>> source)
{
    var pairs =
            from s1 in source
            from s2 in source
            select new { s1 , s2 };
    var intersects = pairs
        .Where(p => p.s1 != p.s2)
        .Select(p => p.s1.Intersect(p.s2));
    return intersects.SelectMany(i => i).Distinct();
}
but I'm concerned that this might be sub-optimal, I think it includes intersects of pair A, B and pair B, A which seems inefficient. I also think there might be a more efficient way to compound the sets as they are iterated.
I include some example input and output below:
{ { 1, 1, 2, 3, 4, 5, 7 }, { 5, 6, 7 }, { 2, 6, 7, 9 } , { 4 } }
returns
{ 2, 4, 5, 6, 7 }
and
{ { 1, 2, 3} } or { {} } or { }
returns
{ }
I'm looking for the best combination of readability and potential performance.
EDIT
I've performed some initial testing of the current answers, my code is here. Output below.
Original valid:True
DoomerOneLine valid:True
DoomerSqlLike valid:True
Svinja valid:True
Adricadar valid:True
Schmelter valid:True
Original 100000 iterations in 82ms
DoomerOneLine 100000 iterations in 58ms
DoomerSqlLike 100000 iterations in 82ms
Svinja 100000 iterations in 1039ms
Adricadar 100000 iterations in 879ms
Schmelter 100000 iterations in 9ms
At the moment, it looks as if Tim Schmelter's answer performs better by at least an order of magnitude.
// init sequences
var sequences = new int[][]
    { 
        new int[] { 1, 2, 3, 4, 5, 7 },
        new int[] { 5, 6, 7 },
        new int[] { 2, 6, 7, 9 },
        new int[] { 4 }
    };
One-line way:
var result = sequences
    .SelectMany(e => e.Distinct())
    .GroupBy(e => e)
    .Where(e => e.Count() > 1)
    .Select(e => e.Key);
// result is { 2 4 5 7 6 }
Sql-like way (with ordering):
var result = (
          from e in sequences.SelectMany(e => e.Distinct())
          group e by e into g
          where g.Count() > 1
          orderby g.Key
          select g.Key);
// result is { 2 4 5 6 7 }
May be fastest code (but not readable), complexity O(N):
var dic = new Dictionary<int, int>();
var subHash = new HashSet<int>();
int length = array.Length;
for (int i = 0; i < length; i++)
{
    subHash.Clear();
    int subLength = array[i].Length;
    for (int j = 0; j < subLength; j++)
    {
        int n = array[i][j];
        if (!subHash.Contains(n))
        {
            int counter;
            if (dic.TryGetValue(n, out counter))
            {
                // duplicate
                dic[n] = counter + 1;
            }
            else
            {
                // first occurance
                dic[n] = 1;
            }
        }
        else
        {
            // exclude duplucate in sub array
            subHash.Add(n);
        }
    }
}
This should be very close to optimal - how "readable" it is depends on your taste. In my opinion it is also the most readable solution.
        var seenElements = new HashSet<T>();
        var repeatedElements = new HashSet<T>();
        foreach (var list in source)
        {
            foreach (var element in list.Distinct())
            {
                if (seenElements.Contains(element))
                {
                    repeatedElements.Add(element);
                }
                else
                {
                    seenElements.Add(element);
                }
            }
        }
        return repeatedElements;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With