Generate data when the number of cells is random, but the sums of the rows are always the same

I am in a situation where I need to create a bunch of fake data sets where the sum of the two variables is the same as in my real data, but the calculations for each variable are random. Here's the setting:

>df
    X.1  X.2
1   145   30
2    55   73   

The first line is summed up to 175, and the second - up to 128. What I'm looking for is a way to create a data frame (or a collection of data frames) as follows:

>df.2
    X.1  X.2
1   100   75
2    90   38

In df.2, the number of cells has changed, but the rows are still summed into the same table. The actual data contains hundreds of rows, but only two variables if that helps. I tried to figure out how to do this with help sample(), but no luck. Any suggestions?

Thank!

+5
source share
4

,


  • ( ), .

EDIT 2

  • /
  • pass expected

, rmultinom , , t

replicates <- 10
expected <- data.frame(X1  = c(100,90,30),X2 = c(75,28,120))
##    X1  X2
## 1 100  75
## 2  90  28
## 3  30 120
data_samples <- lapply(seq(replicates), function(i, expected){
   # create a list of expected cell counts (list element = row of expected)
  .list <- lapply(apply(expected,1,list),unlist)
   # sample from these expected cell counts and recombine into a data.frame
   as.data.frame(do.call(rbind,lapply(.list, function(.x) t(rmultinom(n = 1, prob = .x,  size = sum(.x) )))))
   }, expected = expected)

data.frames

data_samples[[1]]
##    X1  X2
## 1 104  71
## 2  84  34
## 3  19 131


data_samples[[5]]
##   X1  X2
## 1 88  87
## 2 92  26
## 3 27 123
+5

, r2dtable?

> r2dtable(2, c(175,128), c(190, 113))
[[1]]
     [,1] [,2]
[1,]  108   67
[2,]   82   46

[[2]]
     [,1] [,2]
[1,]  114   61
[2,]   76   52

, @mnel, rmultinom n, . , , , rmultinom , , , .

n <- 10
e <- cbind(X1  = c(100,90,30),X2 = c(75,28,120))
aperm(array(sapply(1:nrow(e), function(i) 
        rmultinom(n, rowSums(e)[i], (e/rowSums(e))[i,])),
      dim=c(ncol(e),n,nrow(e))), c(3,1,2))
+6

:

test <- data.frame(X.1=c(145,55),X.2=c(30,73))

sample:

t(sapply(
        rowSums(test),
        function(x) {
                one <- sample(1:x,1)
                two <- (x - one)
                result <- data.frame(one,two)
                names(result) <- names(test)
                return(result)
                }
         )
)

:

     X.1 X.2
[1,] 20  155
[2,] 127 1  

...

     X.1 X.2
[1,] 111 64 
[2,] 94  34 

....

:

jitter , .

t(apply(
        test,
        1,
        function(x) {
                rsum <- sum(x)
                one <- round(jitter(x[1],20,20),0)
                two <- (rsum - one)
                result <- c(one,two)
                names(result) <- names(test)
                return(result)
                }
    )
)

:

     X.1 X.2
[1,] 160  15
[2,]  47  81

     X.1 X.2
[1,] 127  48
[2,]  64  64
+2

If you have a total sample size of n = .. say 40, and the number of cells is 4 with the number of columns = say 2, then the call should be:

rmultinom(2, size = 40/4, prob = c(0.5,0.5))
     [,1] [,2]
[1,]    6    3
[2,]    4    7

If you want the function to deliver this result with a certain probability for each row, then:

 my_mat_rand <- function(tot, coln, probs){
     rmultinom(coln, size = tot/length(probs), prob = probs) }

> my_mat_rand(tot=40, coln=2, probs  = c(0.5,0.5))
     [,1] [,2]
[1,]   11   10
[2,]    9   10
> my_mat_rand(40, 2, probs  = c(0.5,0.5))
     [,1] [,2]
[1,]    8   13
[2,]   12    7

If you want the probabilities to be also “random,” use runifto indicate the first and 1- that-valueto indicate the second element of the vector probs.

0
source

All Articles