A data frame set based on unordered column pairs

I have a dataset that looks something like this:

     id1  id2   size
1   5400 5505      7
2   5033 5458      1
3   5452 2873     24
4   5452 5213      2
5   5452 4242     26
6   4823 4823      4
7   5505 5400     11

Where id1and id2are the only nodes in the graph, and sizeis the value assigned to the directional edge connecting them from id1to id2. This data set is quite large (just over 2 million rows). What I would like to do is summarize a size column grouped by unordered node pairs from id1and id2. For example, in the first line we have id1=5400and id2=5505. There is another line in the data frame, where id1=5505and id2=5400. In grouped data, the sum of the size columns for these two rows will be added to one row. So I want to summarize the data in which I group the (unordered) set (id1, id2). I found a way to do this usingapplywith a custom function that checks the inverse of a pair of columns in a complete dataset, but it is painfully slow. Does anyone know a way to do this differently, perhaps with plyror with something in the base packages that would be more efficient?

+5
source share
3 answers

One way is to create additional columns with pmaxand pminfrom id1and id2as follows. I am using a data.tablesolution here .

require(data.table)
DT <- data.table(DF)
# Following mnel suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)), 
        by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
#     id1  id2 size
# 1: 5400 5505   18
# 2: 5033 5458    1
# 3: 5452 2873   24
# 4: 5452 5213    2
# 5: 5452 4242   26
# 6: 4823 4823    4
+8
source

alternative method:

R> library(igraph)
R> DF
   id1  id2 size
1 5400 5505    7
2 5033 5458    1
3 5452 2873   24
4 5452 5213    2
5 5452 4242   26
6 4823 4823    4
7 5505 5400   11
R> g  <- graph.data.frame(DF, directed=F)
R> g  <- simplify(g, edge.attr.comb="sum", remove.loops=FALSE)
R> DF <- get.data.frame(g)
R> DF
   id1  id2 size
1 5400 5505   18
2 5033 5458    1
3 5452 2873   24
4 5452 5213    2
5 5452 4242   26
6 4823 4823    4
+3
source

My method with function aggregate{stats}:

> df
   id1  id2 size
1 5400 5505    7
2 5033 5458    1
3 5452 2873   24
4 5452 5213    2
5 5452 4242   26
6 4823 4823    4
7 5505 5400   11

> df[1:2] <- t(apply(df[1:2], 1, sort))
> aggregate(size ~ id1 + id2, data=df, FUN=sum)

   id1  id2 size
1 4823 4823    4
2 2873 5452   24
3 4242 5452   26
4 5213 5452    2
5 5033 5458    1
6 5400 5505   18
0
source

All Articles