Use soup.get_text () with UTF-8

I need to get all the text from the page using BeautifulSoup. In the BeautifulSoup documentation, she showed what you can do soup.get_text()for this. When I tried to do this on reddit.com, I got this error:


UnicodeEncodeError in soup.py:16
  'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence

I get such errors on most sites that I checked.
I had similar errors when I did soup.prettify()too, but I fixed it by changing it to soup.prettify('UTF-8'). Is there any way to fix this? Thanks in advance!

June 24th Update
I found some code that seems to work for other people, but I still need to use UTF-8 instead of the standard one. The code:


texts = soup.findAll(text=True)

   def visible(element):
      if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
         return False
      elif re.match('', str(element)): return False
      elif re.match('\n', str(element)): return False
      return True

   visible_texts = filter(visible, texts)

   print visible_texts

The error is different. Progress?


UnicodeEncodeError in soup.py:29
'ascii' codec can't encode character u'\xbb' in position 1: ordinal not in range
(128)
+5
source share
2

soup.get_text() Unicode, - .

, .

export PYTHONIOENCODING=UTF-8

sys , script.

if __name__ == "__main__":
  reload(sys)
  sys.setdefaultencoding("utf-8")

utf-8 . reddit :

import urllib
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/python"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# get text
text = soup.get_text()

print(text.encode('utf-8'))
+1

str (), unicode . str() unicode().

0

All Articles