I'm working on improving the performance of a linq filter against a large collection of POCO
s, but local testing is indicating a CPU bottleneck.
I was initially trying to do this to reduce the load on a SQL server by retrieving a large result set and loading it into memory on a separate processing server, then filtering this result set in .Net.
Here is the demo code:
public class CustomClass
{
public int Id { get; set; }
public int OtherId { get; set;}
public DateTime Date { get; set; }
}
public void DoStuff()
{
// approx 800,000 items
List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();
foreach (OtherCustomClass foo in _bar)
{
// original linq-to-entities query,
// get most recent Ids that apply to OtherId
List<CustomClass> filteredItems = (
from item in allItems
where item.OtherId == foo.OtherId && item.Date <= foo.Date
group item by item.Id into groupItems
select groupItems.OrderByDescending(i => i.Date).First()).ToList();
DoOtherStuff(filteredItems);
}
}
This spikes my 4 cores to 100% CPU for 1m30s, and is not feasible for a production system. I ran the performance analyzer in VS2012 and 30% of the time is the get
call to item.OtherId
.
I started to rewrite the linq to plain code to see if I could get any speed improvement, but so far I have not had any luck. Here is the plain code rewrite:
private List<CustomClass> FilterCustomClassByIdAndDate(
List<CustomClass> items, int id, DateTime date)
{
var mostRecentCustomClass = new Dictionary<int, CustomClass>();
foreach (CustomClass item in items)
{
if (item.Id != id || item.Date > date) { continue; }
CustomClass mostRecent;
if (mostRecentCustomClass.TryGetValue(item.Id, out mostRecent) &&
mostRecent.Date >= item.Date)
{ continue; }
mostRecentCustomClass[item.Id] = item;
}
var filteredItems = new List<CustomClass>();
foreach (KeyValuePair<int, CustomClass> pair in mostRecentCustomClass)
{
filteredItems.Add(pair.Value);
}
return filteredItems;
}
This is still hitting 100% CPU and 30% on the item.OrderId
call. Has anyone had a similar issue in the past or perhaps has some idea on how to improve this?
Edit: Code showing a massive improvement
Thanks to @FastAl, this code ran through the _bar
-> DoOtherStuff(filteredItems)
loop in under a second:
public void DoStuff()
{
// approx 800,000 items
List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();
var indexedItems = new Dictionary<int, List<CustomClass>>();
foreach (CustomClass item in allItems)
{
List<CustomClass> allByOtherId;
if (!indexedItems.TryGetValue(item.OtherId, out allByOtherId))
{
allByOtherId = new List<CustomClass>();
indexedItems[item.OtherId] = allByOtherId;
}
allByOtherId.Add(item);
}
foreach (OtherCustomClass foo in _bar)
{
List<CustomClass> filteredItems;
if (!indexedItems.ContainsKey(foo.OtherId))
{
filteredItems = new List<CustomClass>();
}
else
{
List<CustomClass> filteredItems = (
from item in indexedItems[foo.OtherId]
where item.Date <= foo.Date
group item by item.Id into groupItems
select groupItems.OrderByDescending(i => i.Date).First())
.ToList();
}
DoOtherStuff(filteredItems);
}
}
After you load your items, loop thru them once to build a dictionary of list . Note the inserted Loop and change in where Clause.
Please forgive my errors, I only had 4 minutes ;-) Learn to love the dictionary. It's wicked fast - uses one of the fastest search / insert methods there is. Really wonderful little tool from M$.
my honest suggestion - do it on the database. ask yourself - did your try it there? I've been at this a while and I can never still tell which of two unknowns would be faster without actually testing it first (unless it's really obvious, but you wouldn't have posted this here if it were). Double check the DB had an index on OtherID or else it's facing the same problem your linq statement is (linear search).
public class CustomClass
{
public int Id { get; set; }
public int OtherId { get; set;}
public DateTime Date { get; set; }
}
public void DoStuff()
{
// approx 800,000 items
List<CustomClass> allItems = _repo.GetCustomClassItemsFromDatabase();
var index1 = new Dictionary <int, CustomClass>;
foreach (OtherCustomClass foo1 in allItems)
{
List<CustomClass> allOtherIDs ;
allOtherIDs=null;
if (!index1.TryGetValue(foo1.OtherID,allOtherIDs))
{
allOtherIDs=new List<CustomClass>;
index1.add(foo1.OtherID,allOtherIDs);
}
allOtherIDs(foo1.OtherID)=foo1;
}
foreach (OtherCustomClass foo in _bar)
{
// original linq-to-entities query,
// get most recent Ids that apply to OtherId
List<CustomClass> filteredItems = (
from item in allOtherIDs(foo.OtherID)
where item.Date <= foo.Date
group item by item.Id into groupItems
select groupItems.OrderByDescending(i => i.Date).First()).ToList();
DoOtherStuff(filteredItems);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With