Robotparser doesn't seem to understand correctly

I am writing a crawler, and for this I am implementing a robots.txt parser, I am using the standard lib robotparser .

It seems that robotparser is not being processed correctly , I am debugging my crawler using Google robots.txt .

(The following are examples from IPython)

In [1]: import robotparser

In [2]: x = robotparser.RobotFileParser()

In [3]: x.set_url("http://www.google.com/robots.txt")

In [4]: x.read()

In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it on Disallow
Out[5]: False

In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it Allowed
Out[6]: False

In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False

This is ridiculous because sometimes it seems that it works, and sometimes it seems that it is failing, I also tried the same with robots.txt from Facebook and Stackoverflow. Is this a bug from the module robotpaser? Or am I doing something wrong? If so, then what?

I was wondering if this error had something related

+5
source
4

Google . - , reppy, , . pip;

pip install reppy

( IPython) reppy, , Google robots.txt

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

In [10]: # It also has a x.disallowed function. The contrary of x.allowed
+2

, . robots.txt ( ):

, URL-, URL-, . . , , URL- .

( 3.2.2, " " )

, "/catalogs/p?" , ":/".

- Google robots.txt , . :

Check for Allow. If it matches, crawl the page.
Check for Disallow. If it matches, don't crawl.
Otherwise, crawl.

, robots.txt. , Google , 1996 . , -, Google, , , , , , , , , , , .

+4

. ( python 2.4, , ), URL, , :

urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) 

:

>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo"))[2]) 
'/foo'
>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo?"))[2]) 
'/foo'

python, google robot.txt, "?" ( ).

[ , -. robotparser URL-. , URL- "?", . , /catalogs/p?, /catalogs/p. , .]

python ( ) [edit: ]. , ...

+2

​​, . 0.2.2 pip master , .

Version 0.2 contains a small change to the interface - now you have to create a RobotsCache object that contains the exact interface that you originally had reppy. This was mainly in order to make caching explicit and to allow different caches within the same process. But now it works again!

from reppy.cache import RobotsCache
cache = RobotsCache()
cache.allowed('http://www.google.com/catalogs', 'foo')
cache.allowed('http://www.google.com/catalogs/p', 'foo')
cache.allowed('http://www.google.com/catalogs/p?', 'foo')
+1
source

All Articles