Applying the lm function to different data ranges and individual groups using data.table

How to perform linear regression using different intervals for data in different groups in the data table? I am currently using plyr, but with large datasets it becomes very slow. Any help to speed up the process is greatly appreciated.

I have a data table that contains 10 readings of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days are divided into different time periods, as described below.

I would like to perform a linear regression to determine the rate of change of CO2 for each combination of fence, graph and day using a different calculation interval for each period. Period 1 should regress CO2 during samples 1-5, period 2 using 1-7, and period 3 using 1-9.

CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots 
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)

I expect a result that gives me the R2 correction and line slope for each plot, fence and DOY.

The data that I provided is a small subsample, my real data has 1 * 10 ^ 6 rows. The following works, but slow:

model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}

model_flux <- dlply(flux, .(fence, plot, DOY), model)

rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")
+3
source share
1 answer

Here is the "data.table" method:

library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) | 
                (period==2 & count %in% 2:6) | 
                (period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)

model <- function(df) {
  fit <- lm(CO2 ~ count, data = df)
  return(list(intercept=coef(fit)[1], 
              slope=coef(fit)[2],
              rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]

- , , model(...), . . , , , dlply(...) , . .

+3

All Articles