Find and combine duplicate rows in data.frame, but ignore column order

Question

Find and combine duplicate rows in data.frame, but ignore column order

I have data.frame with 1000 rows and 3 columns. It contains a large number of duplicates, and I used plyr to combine repeating rows and add quantity for each combination, as described in this thread .

Here is an example of what I have (I still have the original data.frame with all the duplicates if I need to start from there):

   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15

However, the order of the columns does not matter. I just want to know how many lines have the same three entries in any order. How to combine rows containing the same records ignoring order? In this example, I would like to combine lines 1 and 5, as well as lines 3 and 4.

+5

r duplicates dataframe plyr

jdfinch3 Jun 09 '12 at 6:24

source share

2

, ddply :

:

dat <- "   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15"

x <- read.table(text=dat, header=TRUE)

:

xx <- x

apply , :

xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
  name1 name2 name3 total
1   Bob Frank   Joe    20
2   Bob  Fred   Sam    45
3 Frank   Sam   Tom    35

+4

Andrie 09 . '12 7:05

Tim P · Accepted Answer · 2012-06-09T07:03:14+0000

, " " , "--" 1 5. .

(, dd): . lookup ( ), total , ...

dd$lookup=apply(dd[,c("name1","name2","name3")],1,
                                  function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]

ee . - . , , !

( , OP:) :

outdf = with(ee,data.frame(name1,name2,name3,
                           total=newtotal,stringsAsFactors=FALSE))

total, newtotal.

Find and combine duplicate rows in data.frame, but ignore column order

More articles: