What is equivalent to R match () for python Pandas / numpy?

Question

What is equivalent to R match () for python Pandas / numpy?

I am a user of R and I cannot understand the equivalent of pandas matching (). I need to use this function to iterate over a bunch of files, grab a key piece of information and merge it back into the current data structure on the "url". In R, I would do something like this:

logActions <- read.csv("data/logactions.csv")
logActions$class <- NA

files = dir("data/textContentClassified/")
for( i in 1:length(files)){
    tmp <- read.csv(files[i])
    logActions$class[match(logActions$url, tmp$url)] <- 
            tmp$class[match(tmp$url, logActions$url)]
}

I don’t think I can use merge () or join (), since every time I will rewrite logActions $ class. I cannot use update () or comb_first () since it does not have the necessary indexing capabilities. I also tried to create a match () function based on this SO post , but I can't figure out how to make it work with DataFrame objects. I apologize if I miss something obvious.

python, - match() pandas:

from pandas import *
left = DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = DataFrame({'url': ['bar.com'], 'class': [ 1]})

# Doesn't work:
left.join(right1, on='url')
merge(left, right, on='url')

# Also doesn't work the way I need it to:
left = left.combine_first(right1)
left = left.combine_first(right2)
left 

# Also does something funky and doesn't really work the way match() does:
left = left.set_index('url', drop=False)
right1 = right1.set_index('url', drop=False)
right2 = right2.set_index('url', drop=False)

left = left.combine_first(right1)
left = left.combine_first(right2)
left

:

    url  action  class
0   foo.com  0   0
1   foo.com  1   0
2   bar.com  0   1

, , .

+5

python merge pandas r match

Solomon 06 . '13 21:28

3

pandas.match, , R match.

+9

Wes McKinney 07 . '13 18:35

, :

#read in df containing actions in chunks:
tp = read_csv('/data/logactions.csv', 
  quoting=csv.QUOTE_NONNUMERIC,
  iterator=True, chunksize=1000, 
  encoding='utf-8', skipinitialspace=True,
  error_bad_lines=False)
df = concat([chunk for chunk in tp], ignore_index=True)

# set classes to NaN
df["klass"] = NaN
df = df[notnull(df['url'])]
df = df.reset_index(drop=True)

# iterate over text files, match, grab klass
startdate = date(2013, 1, 1)
enddate = date(2013, 1, 26) 
d = startdate

while d <= enddate:
    dstring = d.isoformat()
    print dstring

    # Read in each file w/ classifications in chunks
    tp = read_csv('/data/textContentClassified/content{dstring}classfied.tsv'.format(**locals()), 
        sep = ',', quoting=csv.QUOTE_NONNUMERIC,
        iterator=True, chunksize=1000, 
        encoding='utf-8', skipinitialspace=True,
        error_bad_lines=False)
    thisdatedf = concat([chunk for chunk in tp], ignore_index=True)
    thisdatedf=thisdatedf.drop_duplicates(['url'])
    thisdatedf=thisdatedf.reset_index(drop=True)

    thisdatedf = thisdatedf[notnull(thisdatedf['url'])]
    df["klass"] = df.klass.combine_first(thisdatedf.set_index('url').klass[df.url].reset_index(drop=True))

    # Now iterate
    d = d + timedelta(days=1)

0

Solomon 07 . '13 2:42

HYRY · Accepted Answer · 2013-04-06T22:36:59+0000

Edit

URL- , class url, URL- .

from pandas import *
left = DataFrame({'url': ['foo.com', 'bar.com', 'foo.com', 'tmp', 'foo.com'], 'action': [0, 1, 0, 2, 4]})
left["klass"] = NaN
right1 = DataFrame({'url': ['foo.com', 'tmp'], 'klass': [10, 20]})
right2 = DataFrame({'url': ['bar.com'], 'klass': [30]})

left["klass"] = left.klass.combine_first(right1.set_index('url').klass[left.url].reset_index(drop=True))
left["klass"] = left.klass.combine_first(right2.set_index('url').klass[left.url].reset_index(drop=True))

print left

, ?

import pandas as pd
left = pd.DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = pd.DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = pd.DataFrame({'url': ['bar.com'], 'class': [ 1]})

pd.merge(left.drop("class", axis=1), pd.concat([right1, right2]), on="url")

:

   action      url  class
0       0  foo.com      0
1       1  foo.com      0
2       0  bar.com      1

NaN, .

What is equivalent to R match () for python Pandas / numpy?

More articles: