I make requests to the head from 100,000 to 500,000 URLs to return the size and status code of the HTTP. I tried four different methods: threadpool, an asynchronous twisted client, an implementation of grequests, and a solution based on concurrent.futures. In a previous question similar to this, the threadpool implementation is said to end in 6-10 minutes. Trying to accurately code and submit a dummy list of 100,000 URLs takes more than 4 hours on my machine. My twisted solution (different from the one mentioned in the link question) is about 3.5 hours to complete, same thing with concurrent.futures solution.
I am sure that I wrote the implementations correctly, especially in the case of copying and pasting the code from the previous example. How can I diagnose where the slowdown occurs? I assume that it is when creating the connection, but I do not know how to prove it or fix it if this is a problem. I am pretty sure that this is not a problem with the CPU, since the processor time after 100,000 URLs is only 3 minutes. Any help in figuring out how to diagnose the problem, and in turn, will be very helpful.
Additional Information:
- Using queries to execute a query or treq using twisted ones.
- Adding the results to the list (with the garbage processor disabled) or the pandas dataframe does not seem to make a difference in speed.
- I experimented somewhere between 4 and 200 workers / threads in my various tests, and 15 seems optimal.
- , , 16 (100 MBPS)
-.