Partitioning a list of character strings

Question

Partitioning a list of character strings

Here is my problem. I have a data set with 200 thousand lines.

Each row corresponds to a test conducted on the subject.
Subjects have an unequal number of tests.
Each test is out of date.

I want to assign an index to each test. For instance. The first test of item 1 will be 1, the second test of item 1 will be 2. The first test of item 2 will be 1, etc.

My strategy is to get a list of unique Subject identifiers, use lapply to subset the data set into a data list using unique Subject identifiers, with each subject having its own test framework. Ideally, I could sort each data frame of each object and assign an index to each test.

However, by doing this over a 200k x 32 file frame, my laptop (i5, Sandy Bridge, 4GB RAM) quickly ran out of memory.

I have 2 questions:

Is there a better way to do this?
If this does not happen, my only way to overcome the memory limit is to split my unique SubjectID list into smaller sets, such as 1000 SubjectIDs in the list, bind it to the data set and at the end of everything, merge the lists together. Then, how do I create a function to split my SubjectID list by specifying an integer indicating the number of sections. for example BreakPartition (Dataset, 5) splits a data set into 5 sections equally.

Here is the code to create some dummy data:

UniqueSubjectID <- sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse =""))
UniqueSubjectID <- subset(UniqueSubjectID, !duplicated(UniqueSubjectID))
Dataset <- data.frame(SubID = sample(sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse ="")),5000, replace = TRUE))
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), 5000, replace = TRUE)
Dataset <- cbind(Dataset, Dates)

+5

r data.table plyr subset

Jackejr May 16 '12 at 7:59

source share

2 answers

plyr. :

require(plyr)
system.time(new_dat <- ddply(Dataset, .(SubID), function(dum) {
    dum = dum[order(dum$SubID, dum$Dates), ]
    mutate(dum, index = 1:nrow(dum))
  }))

SubID . SubID, . 2 . , ddply , . , data.table. A , ( ) ddply data.table .

+4

Paul Hiemstra 16 '12 8:39

leif · Accepted Answer · 2012-05-16T08:41:30+0000

, /lapply - , . . :

n <- 200000
UniqueSubjectID <- replicate(500, paste(letters[sample(26, 5, replace=TRUE)], collapse =""))
UniqueSubjectID <- unique(UniqueSubjectID)
Dataset <- data.frame(SubID = sample(UniqueSubjectID , n, replace = TRUE))
Dataset$Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), n, replace = TRUE)

, , , , .

Dataset <- Dataset[order(Dataset$SubID, Dataset$Dates), ]
ids.rle <- rle(as.character(Dataset$SubID))
Dataset$SubIndex <- unlist(sapply(ids.rle$lengths, function(n) 1:n))

"SubIndex" "Dataset" . 4- Core 2 Duo.

Partitioning a list of character strings

More articles: