Ggplot2 density filtering by number of observations

Is it possible to filter out subsets of data that have a small number of observations in a ggplot2 call?

For example, take the following chart: qplot(price,data=diamonds,geom="density",colour=cut)

Density plot

The plot is a little busy, and I would like to exclude the values cutwith a small number of observations, i.e.

> xtabs(~cut,diamonds)
cut
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

quality Fairand Goodfactor cut.

I want a solution that can correspond to an arbitrary data set and, if possible, can choose not only the threshold number of observations, but also, for example, from above 3.

+3
source share
4 answers
ggplot(subset(diamonds, cut %in% arrange(count(diamonds, .(cut)), desc(freq))[1:3,]$cut),
  aes(price, colour=cut)) + 
  geom_density() + facet_grid(~cut)
  • count counts each element in data.frame.
  • arrange orders the data.frame file based on the specified column.
  • desc allows you to sort in different ways.
  • , , 3 %in%.
+10

. , .

firstx <- function (category, data, x = 1:3) {
  tab <- xtabs(~category, data)

  dimnames(tab)$category[order(tab, decreasing = TRUE)[x]]
}

#Then use subset to subset the data and droplevels to drop unused levels
#so they don't clutter the legend.
ggplot(droplevels(subset(diamonds, cut %in% firstx(cut, diamonds))), 
       aes(price, color = cut)) + geom_density()

, .

+3

This seems to require a subset to write your own function, perhaps something like this:

mySubset <- function(dat,largestK=3,thresh=NULL){
   if (is.null(thresh)){
      tbl <- sort(table(dat)) 
      return(dat %in% tail(names(tbl),largestK))
   }
   else{
      return(dat >= thresh)
   }
}

This can be used in the ggplot call as follows:

ggplot(diamonds[mySubset(diamonds$cut),],...)

This code does not apply to lower levels of factors, so stay tuned. I usually leave categorical variables as characters for this reason, unless I order them absolutely.

+2
source
## Top 3 cuts
tmp <- names(sort(summary(diamonds$cut), decreasing = T))[1:3]
tmp <- droplevels(subset(diamonds, cut == tmp))
ggplot(tmp, aes(price, color=cut)) + geom_density()

enter image description here

But did you count the cut?

ggplot(diamonds, aes(price, color=cut)) + geom_density() + facet_grid(~cut)

enter image description here

+1
source

All Articles