R: cleanup outliers for each column in the data frame using quantiles of 0.05 and 0.95

Question

R: cleanup outliers for each column in the data frame using quantiles of 0.05 and 0.95

I'm a novice R. I want to do some outlier cleaning and all-scaling from 0 to 1 before placing the sample in a random forest.

g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

If I do a simple scaling from 0 to 1, the result will be:

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0

So my idea is to replace the values of each column that are greater than 0.95-quantile with the next value less than 0.95-quantile - and the same for 0.05-quantile.

Thus, the result with preliminary scaling:

g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)

and scales:

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0

I need this formula for the entire data frame, so the functional implementation inside R should look something like this:

> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))

Can anyone help?

: , , . cut cut2. cut - - ; cut2 , , 0 1.

:

a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)

b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

c<-cbind(a,b)

c<-as.data.frame(c)

,

+3

function r dataframe outliers scaling

Rainer 12 . '11 10:08

2

hadley · Answer 1 · 2011-03-12T16:05:39+0000

, . - , , 10% !

Sacha Epskamp · Answer 2 · 2011-03-12T10:37:54+0000

- R, , :

foo <- function(x)
{
    quant <- quantile(x,c(0.05,0.95))
    x[x < quant[1]] <- min(x[x >= quant[1]])
    x[x > quant[2]] <- max(x[x <= quant[2]])
    return(round((x - min(x))/abs(max(x) - min(x)),1))
}

sapply :

sapply(c,foo)
       a   b
 [1,] 1.0 1.0
 [2,] 0.7 0.7
 [3,] 0.3 0.3
 [4,] 0.7 0.7
 [5,] 0.3 0.3
 [6,] 0.0 0.0
 [7,] 0.3 0.3
 [8,] 0.7 0.7
 [9,] 1.0 1.0
[10,] 0.7 0.7
[11,] 0.0 0.0
[12,] 1.0 1.0
[13,] 0.3 0.3
[14,] 0.7 0.7
[15,] 0.3 0.3
[16,] 1.0 1.0
[17,] 0.0 0.0

: . ,

R: cleanup outliers for each column in the data frame using quantiles of 0.05 and 0.95

More articles: