Why doesn't Python display this text correctly? (UTF-8 decoding problem)

import urllib.request as u

zipcode = str(47401)
url = 'http://watchdog.net/us/?zip=' + zipcode
con = u.urlopen(url)

page = str(con.read())
value3 = int(page.find("<title>")) + 7
value4 = int(page.find("</title>")) - 15
district = str(page[value3:value4])
print(district)
newdistrict = district.replace("\xe2\x80\x99","'")
print(newdistrict)

For some reason my code draws the header in the following format: IN-09: Indiana\xe2\x80\x99s 9th. I know that a character string \xeis unicode for a character ', but I cannot figure out how to get python to replace this character set with a character '. I tried to decode the string, but this is already in Unicode, and the replacement code above does not change anything. Any tips on what I'm doing wrong?

+3
source share
2 answers

con.text(), bytes. str() - , , . ( , , \\xe2\\x80\\x99, .) bytes str Python 2: . str Python 3 unicode Python 2; . , bytes str , . utf-8.

str() bytes.decode; , .

>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'

, , - bytes 'utf-8'.

+4

newdistrict = district.encode("**THE_INPUT_STRING_ENCODING**").replace("\\xe2\\x80\\x99","'")

, utf-8,

newdistrict = district.encode("utf-8").replace("\\xe2\\x80\\x99","'")

, unicode. , , , ,

script

# -*- coding: utf-8 -*-

, utf-8

page = con.read().decode('utf-8')

  newdistrict = district.replace( u "YOUR_UNICODE_STRING" , "'" )

newdistrict = district.replace(u"דכעדחלגעדיל","'")

http://docs.python.org/howto/unicode.html

0

All Articles