Need for speed: Slow nested group groups and apply in Pandas

I am performing a complex conversion to a DataFrame. I thought it would be fast for Pandas, but the only way I could do this was to nest a few nested groupbys and apply using lambda functions, and this is slow. This is similar to where the built-in, faster methods should be. At n_rows = 1000 it is 2 seconds, but I will do 10 ^ 7 lines, so it is too slow. It’s hard to explain what we are doing, so here is the code and profile, then I will explain:

n_rows = 1000

d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping

f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame

q = d.groupby(grps).apply(h) #Slow



824984 function calls (816675 primitive calls) in 1.850 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
221770    0.105    0.000    0.105    0.000 {isinstance}
  7329    0.104    0.000    0.217    0.000 index.py:86(__new__)
  8309    0.089    0.000    0.423    0.000 series.py:430(__new__)
  5375    0.081    0.000    0.081    0.000 {method 'reduce' of 'numpy.ufunc' objects}
 34225    0.068    0.000    0.133    0.000 {method 'view' of 'numpy.ndarray' objects}
36780/36779    0.067    0.000    0.067    0.000 {numpy.core.multiarray.array}
  5349    0.065    0.000    0.567    0.000 series.py:709(_get_values)
 985/1    0.063    0.000    1.847    1.847 groupby.py:608(apply)
  5349    0.056    0.000    0.198    0.000 _methods.py:42(_mean)
  5358    0.050    0.000    0.232    0.000 index.py:332(__getitem__)
  8309    0.049    0.000    0.228    0.000 series.py:3299(_sanitize_array)
  9296    0.047    0.000    0.116    0.000 index.py:1341(__new__)
   984    0.039    0.000    0.092    0.000 algorithms.py:105(factorize)

DataFrame . (.. 3 4). dgs . .

:: ::

, , .

+3
1

, :

import pandas as pd
from numpy import array, arange
from numpy.random import randint, seed

seed(42)
n_rows = 1000

d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping

f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame

print d.groupby(grps).apply(h) #Slow

### my code starts from here ###

def group_process(df2):
    s = df2.stack()
    v = np.repeat(dgs[None, :df2.shape[1]], df2.shape[0], axis=0).ravel()
    return pd.Series(v).groupby([s.index.get_level_values(0), s.values]).mean().mean(level=1)

print d.groupby(grps).apply(group_process)

:

               1         2         3         4         5         6         7  \
(1, 2]  4.621575  4.625887  4.775235  4.954321  4.566441  4.568111  4.835664   
(2, 3]  4.446347  4.138528  4.862613  4.800538  4.582721  4.595890  4.794183   
(3, 4]  4.776144  4.510119  4.391729  4.392262  4.930556  4.695776  4.630068   

               8         9  
(1, 2]  4.246085  4.520384  
(2, 3]  5.237360  4.418934  
(3, 4]  4.829167  4.681548  

[3 rows x 9 columns]
               1         2         3         4         5         6         7  \
(1, 2]  4.621575  4.625887  4.775235  4.954321  4.566441  4.568111  4.835664   
(2, 3]  4.446347  4.138528  4.862613  4.800538  4.582721  4.595890  4.794183   
(3, 4]  4.776144  4.510119  4.391729  4.392262  4.930556  4.695776  4.630068   

               8         9  
(1, 2]  4.246085  4.520384  
(2, 3]  5.237360  4.418934  
(3, 4]  4.829167  4.681548  

[3 rows x 9 columns]

70 , , 10 ** 7 .

+5

All Articles