Suppose I read an html site and I get a list of names, such as: "Amiel, Henri-Frederic."
To get a list of names, I will decrypt html using the following code:
f = urllib.urlopen("http://xxx.htm")
html = f.read()
html=html.decode('utf8')
t.feed(html)
t.close()
lista=t.data
At this point, the lista variable contains a list of names, such as:
[u'Abatantuono, Diego ', ..., u'Amiel, Henri-Frédéric']
Now I would like to:
- put these names inside a DataFrame;
- save DataFrame in csv file;
- read csv in Python via DataFrame
For simplicity, we’ll only consider the name above to complete steps 1 through 3. I would use the following code:
name=u'Amiel, Henri-Fr\xe9d\xe9ric'
name=name.encode('utf8')
array=[name]
df=pd.DataFrame({'Names':array})
df.to_csv('names')
uni=pd.read_csv('names')
uni
At this moment, I get the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 67: invalid continuation byte
If I substitute the last line of code above:
print uni
I can read a DataFrame, but I don't think this is the right way to handle this problem.
, , .