C# Multithreaded Mass Parse

Question

I'm sort of making a 'web parser', except it is just for 1 website, which will be parsing many different pages all at one time.

Currently, There may be 300,000 pages that I need to parse, in a relatively fast manner (I'm only grabbing a tiny amount of information that doesn't take too long to go through, every page takes about ~3 seconds max on my network). Of course, 900,000 seconds to days = 10 days, and that is terrible performance. I would like to reduce this to a couple of hours at the most, I'm reasonable with the time to amount of requests, but it still needs to be 'fast'. I also know that I can't just do 300,000 all at one time, or the website will block all of my requests, so there will have to be a few seconds delay in between each and every request.

I currently have it processing in a single foreach loop, not taking advantage of any multithreading what so ever, but I know that I could take advantage of it, I'm not sure about what path I should take whether it be threadpools, or another type of threading system or design.

Basically, I'm looking for someone to point me in the right direction of efficiency using multithreading so that I can ease the time it will take to parse that many pages on my end, some sort of system or structure for threading.

Thanks

Moo-Juice · Accepted Answer

Check out the answer to this question, as it sounds like you might want to check out Parallel.ForEach.

There are various other methods of achieving what you want to do in a multi-threaded fashion. To give yourself an idea of how this works:

Download LINQPad. (should be a prerequisite to any C# developer, imho!)
In the "Samples", "Download/import more samples..." and ensure you have "Asynchronous Functions in C#" downloaded.
Work through the samples, seeing how they fit together.

In fact, here is one of the asynchronous examples that works with Uris:

// The await keyword is really useful when you want to run something in a loop. For instance:

string[] uris =
{
    "http://linqpad.net",
    "http://linqpad.net/downloadglyph.png",
    "http://linqpad.net/linqpadscreen.png",
    "http://linqpad.net/linqpadmed.png",
};

// Try doing the following without the await keyword!

int totalLength = 0;
foreach (string uri in uris)
{
    string html = await (new WebClient().DownloadStringTaskAsync (new Uri (uri)));
    totalLength += html.Length;
}
totalLength.Dump();

// The continuation is not just 'totalLength += html.Length', but the rest of the loop! (And that final
// call to 'totalLength.Dump()' at the end.)

// Logically, execution EXITS THE METHOD and RETURNS TO THE CALLER upon reaching the await statement. Rather
// like a 'yield return' (in fact, the compiler uses the same state-machine engine to rewrite asynchronous
// functions as it does iterators).
//
// When the task completes, the continuation kicks off and execution jumps back into the middle of the 
// loop - right where it left off!

C# Multithreaded Mass Parse

Tags:

c#

multithreading

sl133

1 Answers

Moo-Juice

Recent Activity

Donate For Us

C# Multithreaded Mass Parse

Tags:

c#

multithreading

sl133

1 Answers

Moo-Juice

Related questions

Recent Activity

Donate For Us