I have a C# app that needs to scrape many many pages within a certain domain as fast as possible. I have a Parallel.Foreach that loops through all of the urls (multi-threaded) and scrapes them using the code below:
private string ScrapeWebpage(string url, DateTime? updateDate)
        {
            HttpWebRequest request = null;
            HttpWebResponse response = null;
            Stream responseStream = null;
            StreamReader reader = null;
            string html = null;
            try
            {
                //create request (which supports http compression)
                request = (HttpWebRequest)WebRequest.Create(url);
                request.Pipelined = true;
                request.KeepAlive = true;
                request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
                if (updateDate != null)
                    request.IfModifiedSince = updateDate.Value;
                //get response.
                response = (HttpWebResponse)request.GetResponse();
                responseStream = response.GetResponseStream();
                if (response.ContentEncoding.ToLower().Contains("gzip"))
                    responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
                else if (response.ContentEncoding.ToLower().Contains("deflate"))
                    responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
                //read html.
                reader = new StreamReader(responseStream, Encoding.Default);
                html = reader.ReadToEnd();
            }
            catch
            {
                throw;
            }
            finally
            {//dispose of objects.
                request = null;
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
                if (responseStream != null)
                {
                    responseStream.Close();
                    responseStream.Dispose();
                }
                if (reader != null)
                {
                    reader.Close();
                    reader.Dispose();
                }
            }
            return html;
        }
As you can see, I have http compression support and have set request.keepalive and request.pipelined to true. I'm wondering if the code I'm using is the fastest way to scrape many web pages within the same site or if there's a better way that will keep the session open for multiple requests. My code is creating a new request instance for each page I hit, should I be trying to use just one request instance to hit all of the pages? Is it ideal to have pipelined and keepalive enabled?
It turns out what I was missing was this:
ServicePointManager.DefaultConnectionLimit = 1000000;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With