Update: this error can be reproduced simply by executing it from the command line:
scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future
I use Scrapy to crawl a website. Every page I scratch claims to be UTF-8 encoded:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
But sometimes the pages contain bytes that go beyond UTF-8, and I get Scrapy errors, for example:
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte
I still need to clear these pages, although they contain irreplaceable characters. Is there a way to tell Scrapy to override the declared page encoding and use a different one (say UTF-16) instead?
Here, where the exception is thrown:
2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
result = method(response=response, result=result, spider=spider)