Will threading robots.txt affect?

I'm new to scraping, and recently I realized that threads are probably a way to quickly get around a site. Before I start cracking this, I decided that it would probably be wise to determine if I could strangle me. So the question is, if I rewrote my program to use streams for faster crawling, would it break most robots.txt sites?

+3
source share
2 answers

Depends: if your threads have their own separate queues of URLs to crawl and there is no synchronization between queues of any type, you may end up breaking the robots.txt website when two (or more) threads try to bypass the URLs of the same site in quick succession. Of course, a well-designed scanner would not do that!

Very “simple” scanners have a kind of queue with a common priority, where work is placed in a queue in accordance with various robot exclusion protocols, and all threads draw URLs to bypass this queue. There are many problems with this approach, especially when trying to scale and scan the entire Wild world .

"" (. BEAST), : -, robots.txt, .. !

+1

. robots.txt , . ", ".

0

All Articles