My background is in Perl, but I give Python plus BeautifulSoup to try a new project.
In this example, I am trying to extract and present the purpose of the link and the link text contained on one page. Here's the source:
table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))
All these explicit calls to .encode ('utf-8') are my attempt to do this work, but they don't seem to help - maybe I don't completely understand how Python 2.7 handles Unicode strings.
Anyway. This works until it meets U + 2013 in the URL (yes, really). At that moment he is bombing:
Traceback (most recent call last):
File "./test2.py", line 30, in <module>
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)
Presumably .format (), even applied to a Unicode string, is playing stupid and trying to do the .decode () operation. And since ASCII is the default, it uses this, and of course, it cannot map U + 2013 to an ASCII character, and therefore ...
, , , , . ( ), .
BS3 ASCII UTF-8, , , .
Python 3.2 ( , Django, ), ?