Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently filtering a large collection of POCO entities

I'm working on improving the performance of a linq filter against a large collection of POCOs, but local testing is indicating a CPU bottleneck.

I was initially trying to do this to reduce the load on a SQL server by retrieving a large result set and loading it into memory on a separate processing server, then filtering this result set in .Net.

Here is the demo code:

public class CustomClass
{
    public int Id { get; set; }
    public int OtherId { get; set;}
    public DateTime Date { get; set; }
}

public void DoStuff()
{        
    // approx 800,000 items
    List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();

    foreach (OtherCustomClass foo in _bar)
    {
        // original linq-to-entities query,
        // get most recent Ids that apply to OtherId
        List<CustomClass> filteredItems = (
            from item in allItems
            where item.OtherId == foo.OtherId && item.Date <= foo.Date
            group item by item.Id into groupItems
            select groupItems.OrderByDescending(i => i.Date).First()).ToList();

        DoOtherStuff(filteredItems);
    }
}

This spikes my 4 cores to 100% CPU for 1m30s, and is not feasible for a production system. I ran the performance analyzer in VS2012 and 30% of the time is the get call to item.OtherId.

I started to rewrite the linq to plain code to see if I could get any speed improvement, but so far I have not had any luck. Here is the plain code rewrite:

private List<CustomClass> FilterCustomClassByIdAndDate(
    List<CustomClass> items, int id, DateTime date)
{
    var mostRecentCustomClass = new Dictionary<int, CustomClass>();

    foreach (CustomClass item in items)
    {
        if (item.Id != id || item.Date > date) { continue; }

        CustomClass mostRecent;
        if (mostRecentCustomClass.TryGetValue(item.Id, out mostRecent) &&
            mostRecent.Date >= item.Date) 
        { continue; }

        mostRecentCustomClass[item.Id] = item;
    }

    var filteredItems = new List<CustomClass>();

    foreach (KeyValuePair<int, CustomClass> pair in mostRecentCustomClass)
    {
        filteredItems.Add(pair.Value);
    }

    return filteredItems;
}

This is still hitting 100% CPU and 30% on the item.OrderId call. Has anyone had a similar issue in the past or perhaps has some idea on how to improve this?

Edit: Code showing a massive improvement

Thanks to @FastAl, this code ran through the _bar -> DoOtherStuff(filteredItems) loop in under a second:

public void DoStuff()
{        
    // approx 800,000 items
    List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();

    var indexedItems = new Dictionary<int, List<CustomClass>>();

    foreach (CustomClass item in allItems)
    {
        List<CustomClass> allByOtherId;

        if (!indexedItems.TryGetValue(item.OtherId, out allByOtherId)) 
        {
            allByOtherId = new List<CustomClass>();
            indexedItems[item.OtherId] = allByOtherId;
        }

        allByOtherId.Add(item);
    }

    foreach (OtherCustomClass foo in _bar)
    {
        List<CustomClass> filteredItems;

        if (!indexedItems.ContainsKey(foo.OtherId))
        {
            filteredItems = new List<CustomClass>();
        }
        else
        {
            List<CustomClass> filteredItems = (
                from item in indexedItems[foo.OtherId]
                where item.Date <= foo.Date
                group item by item.Id into groupItems
                select groupItems.OrderByDescending(i => i.Date).First())
                .ToList();
        }

        DoOtherStuff(filteredItems);
    }
}
like image 758
Laurence Avatar asked Oct 19 '25 03:10

Laurence


1 Answers

Use a dictionary of lists.

After you load your items, loop thru them once to build a dictionary of list . Note the inserted Loop and change in where Clause.

Please forgive my errors, I only had 4 minutes ;-) Learn to love the dictionary. It's wicked fast - uses one of the fastest search / insert methods there is. Really wonderful little tool from M$.

my honest suggestion - do it on the database. ask yourself - did your try it there? I've been at this a while and I can never still tell which of two unknowns would be faster without actually testing it first (unless it's really obvious, but you wouldn't have posted this here if it were). Double check the DB had an index on OtherID or else it's facing the same problem your linq statement is (linear search).

public class CustomClass
{
    public int Id { get; set; }
    public int OtherId { get; set;}
    public DateTime Date { get; set; }
}

public void DoStuff()
{        
    // approx 800,000 items
    List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();
    var index1 = new Dictionary <int, CustomClass>; 
    foreach (OtherCustomClass foo1 in allItems)
    {
        List<CustomClass> allOtherIDs ;
        allOtherIDs=null;
        if (!index1.TryGetValue(foo1.OtherID,allOtherIDs))
         {
            allOtherIDs=new List<CustomClass>;
            index1.add(foo1.OtherID,allOtherIDs);
        }
        allOtherIDs(foo1.OtherID)=foo1;
    }


    foreach (OtherCustomClass foo in _bar)
    {
        // original linq-to-entities query,
        // get most recent Ids that apply to OtherId
        List<CustomClass> filteredItems = (
            from item in allOtherIDs(foo.OtherID)
            where item.Date <= foo.Date
            group item by item.Id into groupItems
            select groupItems.OrderByDescending(i => i.Date).First()).ToList();

        DoOtherStuff(filteredItems);
    }
}
like image 104
FastAl Avatar answered Oct 21 '25 15:10

FastAl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!