R: selecting rows from a data frame based on a set of values ​​of interest displayed in specific columns

I have a large data file about visits to the doctor. I want to select only those lines in which at least one of the 11 diagnostic codes listed is found in a specific set of diagnostic codes that interest me.

The data graph is 18 columns per 39019 rows. I'm interested in the diagnostic codes in columns 6:16. Below is sample data for only these 11 diagnostic columns (to protect identifiable information):

diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
786   272   401   782    250  91912  530    NA    NA    NA     NA   
845   530   338   311    NA    NA    NA     NA    NA    NA     NA

Here is the code I tried to use:

mydiag <- c(401, 410, 411, 413, 415:417, 420:429, 434, 435, 444, 445, 451, 460:466, 480:486, 490:493, 496, 786)
y = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x) sum((any(x !=NA %in% mydiag))))
y = as.data.frame(y)

As you can see, in the two examples that I provided, I would like to keep the first line, but throw out the second line, because it does not have the codes that I want. The code I presented does not work - I get a vector of 39.019 "1" values. Therefore, I assume that the apply statement is read as a boolean somehow, and yet I know that not all lines have an interest code, so in this case I would expect 1 and 0.

Is there a better way to accomplish this row selection task?

+5
source share
2 answers

I think that you are compiling a bit in excess !=NA. Since NA is not displayed in mydiag, you can completely remove it. So your expression expression can become:

goodRows <- apply(dat, 1, function(x) any(x %in% mydiag))
dat[goodRows,]
#---------------
  diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
1   786   272   401   782   250 91912   530    NA    NA     NA     NA
+5
source

The problem arises from your function function(x) sum((any(x !=NA %in% mydiag)))

x != NA !is.na(x), , . , , , mydiag. , , na, , - mydiag.

x[!is.na(x)] %in% mydiag

. NA, NA , x, NA, false x %in% mydiag

function(x){any(x %in% mydiag)}

, , .

# Get the row numbers of the rows you want
id = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x){any(x %in% mydiag)})
# Just grab those rows
y <- dt[id, ]
+2

All Articles