How to handle European decimal separators efficiently using the pandas read_csv function?

I use read_csvto read CSV files into Pandas data frames. My CSV files contain a lot of decimal / floating point numbers. Numbers are encoded using the European decimal notation:

1.234.456,78

It means that '.' is used as a thousands separator, and ',' is a decimal place.

Pandas 0.8. provides an argument read_csvcalled "thousands" to set the thousands separator. Is there an additional argument for providing a decimal point? If not, what is the most efficient way to analyze European-style decimal?

I am currently using string replacement, which I consider to be a significant performance hit. I use encoding:

# Convert to float data type and change decimal point from ',' to '.'
f = lambda x: string.replace(x, u',', u'.')
df['MyColumn'] = df['MyColumn'].map(f)

Any help is appreciated.

+5
source share
2 answers

You can use converterskw in read_csv. Considering /tmp/data.csvas follows:

"x","y"                                                                         
"one","1.234,56"                                                                
"two","2.000,00"   

You can do:

In [20]: pandas.read_csv('/tmp/data.csv', converters={'y': lambda x: float(x.replace('.','').replace(',','.'))})
Out[20]: 
     x        y
0  one  1234.56
1  two  2000.00
+8
source

For numbers use a European-style thousandsand decimalparameters pandas.read_csv.

For instance:

pandas.read_csv('data.csv', thousands='.', decimal=',')

From the docs :

thousands :

str, an optional thousands separator.

decimal number :

str, default '. Character to be recognized as a decimal point (for example, use ', for European data).

0
source

All Articles