How to open windows-1252 encoded HTML page in beautifulsoup

I am trying to parse an HTML document using beautifulsoup, but I am having problems. What is the best way to open an HTML document using Windows-1252 encoding?

I tried using iconv to convert to utf-8, but this also does not work.

doc = open("e.html").read()

soup = BeautifulSoup(doc)

soup.findAll('p')

UnicodeEncodeError: ascii codec cannot encode u '\ xfc' character at position 103: serial number not in range (128)

When I open it without an icon, I get the same error.

full trace:

>>> soup.findAll('p')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 103: ordinal not in range(128)
+3
source share
2 answers

Try something like this:

doc = open("e.html").read()

doc = doc.decode('cp1252')

soup = BeautifulSoup(doc)

soup.findAll('p')
0
source

I was getting a similar error:

UnicodeDecodeError: 'utf-8' 0xe9 723617:

, :

page = open("page.html", encoding="windows-1252")

soup = BeautifulSoup(page.read(), "html.parser")
0

All Articles