Python, .format () and UTF-8

Question

Python, .format () and UTF-8

My background is in Perl, but I give Python plus BeautifulSoup to try a new project.

In this example, I am trying to extract and present the purpose of the link and the link text contained on one page. Here's the source:

table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))

All these explicit calls to .encode ('utf-8') are my attempt to do this work, but they don't seem to help - maybe I don't completely understand how Python 2.7 handles Unicode strings.

Anyway. This works until it meets U + 2013 in the URL (yes, really). At that moment he is bombing:

Traceback (most recent call last):
File "./test2.py", line 30, in <module>
  line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)

Presumably .format (), even applied to a Unicode string, is playing stupid and trying to do the .decode () operation. And since ASCII is the default, it uses this, and of course, it cannot map U + 2013 to an ASCII character, and therefore ...

, , , , . ( ), .

BS3 ASCII UTF-8, , , .

Python 3.2 ( , Django, ), ?

+5

python unicode beautifulsoup

Matt McLeod 13 . '12 3:07

1

Ned Batchelder · Accepted Answer · 2012-06-13T03:12:36+0000

-, , :

line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))

line_out = unicode(table_row.format(link_text, link_target))

- , . , , table_row , . , Python 2 table_row unicode, ascii. "UnicodeDecodeError ascii".

, , , . Unicode .

, PyCon, : Pragmatic Unicode, , ?

Python, .format () and UTF-8

More articles: