I need to get all the text from the page using BeautifulSoup. In the BeautifulSoup documentation, she showed what you can do soup.get_text()for this. When I tried to do this on reddit.com, I got this error:
UnicodeEncodeError in soup.py:16
'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence
I get such errors on most sites that I checked.
I had similar errors when I did soup.prettify()too, but I fixed it by changing it to soup.prettify('UTF-8'). Is there any way to fix this? Thanks in advance!
June 24th Update
I found some code that seems to work for other people, but I still need to use UTF-8 instead of the standard one. The code:
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('', str(element)): return False
elif re.match('\n', str(element)): return False
return True
visible_texts = filter(visible, texts)
print visible_texts
The error is different. Progress?
UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)
source
share