We've got a slight performance issue in a section of code that's using LINQ and it's raised a question about how LINQ works in terms of lookups
My question is this (please note that I have changed all the code so this is an indicative example of the code, not the real scenario):
Given
public class Person {
 int ID;
 string Name;
 DateTime Birthday; 
 int OrganisationID;
}
If I had a list of say 100k Person objects and then a list of dates, say 1000, and I ran this code:
var personBirthdays = from Person p in personList
    where p.OrganisationID = 123
    select p.Birthday;
foreach (DateTime d in dateList)
{
    if (personBirthdays.Contains(d))
        Console.WriteLine(string.Format("Date: {0} has a Birthday", d.ToShortDateString()));
}
The resultant code would be an iteration of:
100k (the loop that needs to be done to find the users with the organisationID 123)
multiplied by
1000 (the amount of dates in the list)
multiplied by
x (the amount of users who have the organisationID 123 to be checked against for the date)  
This is a lot of iterations!
If I changed the code the personBirthdays to this:
List<DateTime> personBirthdays = 
        (from Person p in personList
        where p.OrganisationID = 123
        select p.Birthday).ToList();
This should remove the 100k as a multiple by, and just do it once?
So you would have 100k + (1000 * x) instead of (100k * 1000 * x).
The question is that this seems too easy, and I'm sure the LINQ is doing something clever somewhere that should mean that this doesn't happen.
If no one answers, I'll run some tests and report back.
Clarity update:
We're not considering Database lookups, the personList object is an In Memory list object. This all LINQ-to-Objects.
This should remove the 10k as a multiple by, and just do it once?
What it means is that instead of iterating personList 100k times, performing the where and select operations for each of those iterations that you'll be iterating the resulting List 100k times, and that the where and select operations will only have been performed on the underlying data source once.
The question is that this seems too easy, and I'm sure the LINQ is doing something clever somewhere that should mean that this doesn't happen.
Nope, your first query is simply something that you shouldn't be doing using LINQ, you should be taking the results of the query and placing them into a data structure if you plan to iterate over them many times (which is what you changed).
You can improve this query even more by using the appropriate data structure.  Searching on a List is rather inefficient, as it needs to do a linear search.  It would be preferable to use a HashSet to store the results of the query.  A HashSet has O(1) search speed in the average case, as opposed to O(n) search time of a List.
var dates = new HashSet<DateTime>(from Person p in personList
                                  where p.OrganisationID = 123
                                  select p.Birthday);
foreach (DateTime d in dateList.Where(date => dates.Contains(date)))
{
    Console.WriteLine(string.Format("Date: {0} has a Birthday", d.ToShortDateString()));
}
This is typical select n+1 problem, and after you applied .ToList() you have partially solved it. Next step could be following: you're constantly iterating over personBirthdays list, replace it with HashSet and you could perform Contains(d) much much faster and remove duplicates:
var personBirthdays = new HashSet<DateTime>((from Person p in personList
    where p.OrganisationID = 123
    select p.Birthday).ToArray());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With