Using Beautiful Soup with Accents and Different Characters

I use Beautiful Soup to get medal winners from past Olympics. He stumbles on the use of accents in some events and names of athletes. I have seen similar issues posted on the Internet, but I'm new to Python and can't apply them to my code.

If I print my soup, the accents will look good. but when I start to parse the soup (and write it to a CSV file), the shock symbols become distorted. Louis Perret becomes Louis Perre © e '

from BeautifulSoup import BeautifulSoup
import urllib2

response = urllib2.urlopen('http://www.databaseolympics.com/sport/sportevent.htm?sp=FEN&enum=130')
html = response.read()
soup = BeautifulSoup(html)

g = open('fencing_medalists.csv','w"')
t = soup.findAll("table", {'class' : 'pt8'})

for table in t:
    rows = table.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            theText=str(td.find(text=True))
            #theText=str(td.find(text=True)).encode("utf-8")
            if theText!="None":
                g.write(theText)
            else: 
                g.write("")
            g.write(",")
        g.write("\n")

Many thanks for your help.

+3
source share
1 answer

unicode, , , .

CSV , , utf-8, .

import codecs
# ...
content = response.read()
html = codecs.decode(content, 'utf-8')

unicode utf-8 , . codecs.open, , . .

g = codecs.open('fencing_medalists.csv', 'wb', encoding='utf-8')

:

    theText = td.find(text=True)
    if theText is not None:
        g.write(unicode(theText))

: BeautifulSoup, , Unicode, codecs.decode.

+2

All Articles