Scrapy spider: handling pages with incorrect character encoding

Update: this error can be reproduced simply by executing it from the command line:

scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future

I use Scrapy to crawl a website. Every page I scratch claims to be UTF-8 encoded:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

But sometimes the pages contain bytes that go beyond UTF-8, and I get Scrapy errors, for example:

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte

I still need to clear these pages, although they contain irreplaceable characters. Is there a way to tell Scrapy to override the declared page encoding and use a different one (say UTF-16) instead?

Here, where the exception is thrown:

2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
        result = method(response=response, result=result, spider=spider)
+5
source share
3 answers

dev scrapy (0.15). , .

Scrapy response.body_as_unicode. , . scrapy 0.15, w3lib.encoding.html_to_unicode .

, - unicode. , , , .

, . - ? , - .

, scrapy, , .

+4
+1

, Pipeline, Downloader.

Clear (replace bytes that cannot be decoded) the data before you populate your objects.

0
source

All Articles