HttpClient Multithreaded Performance

I have an application that loads more than 4500 html pages from 62 target hosts using HttpClient (4.1.3 or 4.2-beta). It runs on a 64-bit version of Windows 7. Processor - Core i7 2600K. The network bandwidth is 54 Mbps.

At this moment, he uses the following parameters:

  • DefaultHttpClientand PoolingClientConnectionManager;
  • He also has IdleConnectionMonitorThreadout
    http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html;
  • Maximum number of connections = 80;
  • The maximum number of default connections for a route = 5;
  • For flow control, it uses ForkJoinPoolwith parallelism
    level = 5 (do I understand correctly that this is the number of threads working?)

In this case, my network usage (in the Windows task manager) does not exceed 2.5%. It takes 70 minutes to load 4,500 pages. And in the HttpClient logs I have things like this:

DEBUG ForkJoinPool-2-worker-1 [org.apache.http.impl.conn.PoolingClientConnectionManager]: connection issued: [id: 209] [route: {} → http://stackoverflow.com] [total saved live: 6 ; selected route: 1 out of 5; total allocated: 10 of 80]

In total, the selected connections do not rise above 10-12, despite the fact that I have established up to 80 connections. If I try to rise to the level of parallelism to 20 or 80, the use of the network will remain the same, but a lot of time will be created for the connection.

hc.apache.org( HttpClient HttpClient Threading Guide), .

:

public class ContentDownloader extends RecursiveAction {
    private final HttpClient httpClient;
    private final HttpContext context;
    private List<Entry> entries;

    public ContentDownloader(HttpClient httpClient, List<Entry> entries){
        this.httpClient = httpClient;
        context = new BasicHttpContext();
        this.entries = entries;
    }

    private void computeDirectly(Entry entry){      
        final HttpGet get = new HttpGet(entry.getLink());
        try {
            HttpResponse response = httpClient.execute(get, context);
            int statusCode = response.getStatusLine().getStatusCode();

            if ( (statusCode >= 400) && (statusCode <= 600) ) {
                logger.error("Couldn't get content from " + get.getURI().toString() + "\n"  + response.toString());
            } else {        
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    String htmlContent = EntityUtils.toString(entity).trim();
                    entry.setHtml(htmlContent);
                    EntityUtils.consumeQuietly(entity);                             
                }
            }                           
        } catch (Exception e) {
        } finally {
            get.releaseConnection();
        }
    }

    @Override
    protected void compute() {
        if (entries.size() <= 1){           
            computeDirectly(entries.get(0));
            return;         
        }       
        int split = entries.size() / 2;     
        invokeAll(new ContentDownloader(httpClient, entries.subList(0, split)), 
                new ContentDownloader(httpClient, entries.subList(split, entries.size())));
    }
}

, HttpClient, , ConnectionManager HttpClient? 80 ?

.

+3
3

, , ( 1), . concurrency .

5. 10-12, , 2-3 , .

+4

IP-. , .

, , robots.txt ip, , .

( http://www.example.com/[whatever]) , 5 "". ( , , .)

+1

Apache HttpClient , loopback. , , . HTML , , . , HTML String , , , .

0
source

All Articles