A subset of data before the first appearance in R

Question

A subset of data before the first appearance in R

I am trying to multiply data, so it only saves the first occurrence of the variable. I look at the data panels that track workers' careers, and I try to multiply the data so that it is displayed only until each person has become a Boss.

id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1

Therefore, I would like the data to look like this:

id  year    name    job   job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1

This sounds like basic censorship, but for the sake of my analysis it is important ...! Any help would be appreciated.

+3

r

song0089 Feb 07 '14 at 5:14

source share

4 answers

The sqldf library could do the job.

library(sqldf)
miny <- sqldf("select id, min(year) as year from df where job='Boss' group by id")
sqldf("select df.* from df join miny on (df.id=miny.id and df.year<=miny.year)")

+2

Vyga Feb 07 '14 at 5:48

source share

dplyr-, lag() cumall():

df <- read.table(header = TRUE, text = "
id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1
", stringsAsFactors = FALSE)

library(dplyr)

# Use mutate to see the values of the new variables
df %.% 
  group_by(id) %.%
  mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))

# Use filter to see the results
df %.% 
  group_by(id) %.%
  filter(cumall(lag(job, default = "") != "Boss"))

We use lag()it to find out what kind of work each person had in the previous year, and then use cumall()it so that all the lines are up to the first instance of Boss. If the data has not yet been sorted by year, you can use lag(job, order_by = year)to lag()use the year value rather than the row order to determine which one was the "last".

+2

hadley Feb 07 '14 at 13:41

source share

If your data is stored in a data frame named df:

library(plyr)
ddply(.data=df, .variables=c("name"), .fun=function(x) {
  i <- which(x$job == "Boss")[1]
  if (!is.na(i)) x[1:i, ] # omit lifelong managers 
})
#   id year name     job job2
# 1  1 1990  Bon Manager    0
# 2  1 1991  Bon Manager    0
# 3  1 1992  Bon Manager    0
# 4  1 1993  Bon    Boss    1
# 5  2 1990 Jane Manager    0
# 6  2 1991 Jane    Boss    1

+1

lukeA Feb 07 '14 at 5:36

source share

thelatemail · Accepted Answer · 2014-02-07T05:47:40+0000

Basic solution:

do.call(
  rbind,
  by(dat,dat$name,function(x) {
    if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
  })
)

#       id year name     job job2
#Bon.1   1 1990  Bon Manager    0
#Bon.2   1 1991  Bon Manager    0
#Bon.3   1 1992  Bon Manager    0
#Bon.4   1 1993  Bon    Boss    1
#Jane.6  2 1990 Jane Manager    0
#Jane.7  2 1991 Jane    Boss    1

Alternative basic solution:

dat$keep <- with(dat, 
             ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2) 
            )
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )

#  id year name     job job2 keep
#1  1 1990  Bon Manager    0    0
#2  1 1991  Bon Manager    0    0
#3  1 1992  Bon Manager    0    0
#4  1 1993  Bon    Boss    1    1
#6  2 1990 Jane Manager    0    0
#7  2 1991 Jane    Boss    1    1

And the solution data.table:

dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]

#   name id year     job job2
#1:  Bon  1 1990 Manager    0
#2:  Bon  1 1991 Manager    0
#3:  Bon  1 1992 Manager    0
#4:  Bon  1 1993    Boss    1
#5: Jane  2 1990 Manager    0
#6: Jane  2 1991    Boss    1

A subset of data before the first appearance in R

More articles: