Best practice for concurrent web crawler in .net 4.0

I need to load many pages through a proxy. What is best for creating a multi-threaded web crawler?

Is Parallel.For \ Foreach good enough or better for heavy CPU tasks?

What do you say about the following code?

var multyProxy = new MultyProxy();

   multyProxy.LoadProxyList();


   Task[] taskArray = new Task[1000];

        for(int i = 0; i < taskArray.Length; i++)
        {
            taskArray[i] = new Task( (obj) =>
                {                                                             
                       multyProxy.GetPage((string)obj);
                },

            (object)"http://google.com"
            );
            taskArray[i].Start();
        }


   Task.WaitAll(taskArray);

It works terribly. It is very slow, and I do not know why.

This code also works poorly.

 System.Threading.Tasks.Parallel.For(0,1000, new System.Threading.Tasks.ParallelOptions(){MaxDegreeOfParallelism=30},loop =>
            {
                 multyProxy.GetPage("http://google.com");
            }
            );

Well, I think I'm doing something wrong.

When I run my script, it uses the network only at 2% -4%.

+3
source share
3 answers

, , IO - .. , , - ThreadPool, .

, - - WebRequest, BeginGetResponse() EndGetResponse()

Reactive Extensions, :

IEnumerable<string> urls = ... get your urls here...;
var results = from url in urls.ToObservable()
             let req = WebRequest.Create(url)
             from rsp in Observable.FromAsyncPattern<WebResponse>(
                  req.BeginGetResponse, req.EndGetResponse)()
             select ExtractResponse(rsp);

ExtractResponse, , StreamReader.ReadToEnd, , ,

.Retry, , ..

+7

:

System.Net.ServicePointManager.DefaultConnectionLimit = 100;

, .

+1

This can help you when using a large number of connections (add to app.config or web.config):

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <system.net>
    <connectionManagement>
      <add address="*" maxconnection="50"/>
    </connectionManagement>
  </system.net>
</configuration>

Set the number of simultaneous connections instead of 50

Learn more about this at http://msdn.microsoft.com/en-us/library/fb6y0fyc.aspx

0
source

All Articles