I have a large data file about visits to the doctor. I want to select only those lines in which at least one of the 11 diagnostic codes listed is found in a specific set of diagnostic codes that interest me.
The data graph is 18 columns per 39019 rows. I'm interested in the diagnostic codes in columns 6:16. Below is sample data for only these 11 diagnostic columns (to protect identifiable information):
diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
786 272 401 782 250 91912 530 NA NA NA NA
845 530 338 311 NA NA NA NA NA NA NA
Here is the code I tried to use:
mydiag <- c(401, 410, 411, 413, 415:417, 420:429, 434, 435, 444, 445, 451, 460:466, 480:486, 490:493, 496, 786)
y = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x) sum((any(x !=NA %in% mydiag))))
y = as.data.frame(y)
As you can see, in the two examples that I provided, I would like to keep the first line, but throw out the second line, because it does not have the codes that I want. The code I presented does not work - I get a vector of 39.019 "1" values. Therefore, I assume that the apply statement is read as a boolean somehow, and yet I know that not all lines have an interest code, so in this case I would expect 1 and 0.
Is there a better way to accomplish this row selection task?