I have a simple method to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.
The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.
The inefficiency is obviously the Contains method call on the processedFiles collection.
First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
    if (!processedFiles.Contains(fileInfo.Name))
    {
        unprocessedFiles.Add(fileInfo);
    }
    }
    return unprocessedFiles;
}
You can improve work efficiency by setting compelling goals, learning how to manage your time and developing thoughtful habits. To learn more about achieving professional success and further mastering work efficiency, attend Business Mastery.
Spend your time wisely on tasks that align with goals. Organize your notes, inbox and workspaces for increased focus, motivation and time management. Take breaks throughout the day to avoid burnout. Identify motivators such as tasks, goals or colleagues.
A List<T>'s Contains method runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string> or similar instead. A HashSet<T>'s Containsmethod is designed to run in constant O(1) time, i.e it shouldn't depend on the number of items in the set. 
This small change should make the entire method run in linear time:
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, 
                                         List<string> processedFiles)
{
   List<FileInfo> unprocessedFiles = new List<FileInfo>();
   HashSet<string> processedFileSet = new HashSet<string>(processedFiles);
   foreach (FileInfo fileInfo in allFiles)
   {
       if (!processedFileSet.Contains(fileInfo.Name))
       {
           unprocessedFiles.Add(fileInfo);
       }
    }
   return unprocessedFiles;
}
I would suggest 3 improvements, if possible:
ISet<T> as a parameter. This way, you won't have to reconstruct the set every time.string and FileInfo) in this fashion. Pick one and go with it.HashSet<T>.ExceptWith method instead of doing the looping yourself. Bear in mind that this will mutate the collection.If you can use LINQ, and you can afford to build up a set on every call, here's another way:
public static IEnumerable<string> GetUnprocessedFiles
 (IEnumerable<string> allFiles, IEnumerable<string> processedFiles)
{
  // null-checks here
  return allFiles.Except(processedFiles);     
}
I would try to convert the processedFiles List to a HashSet. With a list, it needs to iterate the list everytime you call contains. A HashSet is an O(1) operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With