Seed and clusterApply - how to choose a specific run?

I am executing a k-tool in a large dataset (636 688 rows x 7 columns) and therefore turned to parallelization. My results should be reproducible. I can do this using clusterSetRNGStreamfrom package parallel. The following is an example of using a data set Bostonfrom a library MASS:

library(parallel)
cl <- makeCluster(detectCores())
clusterSetRNGStream(cl, iseed = 1234)
clusterEvalQ(cl, library(MASS))
results <- clusterApply(cl, rep(25, 4), function(nstart) kmeans(Boston, 4, nstart = nstart))
check.results <- sapply(results, function(result) result$size)
stopCluster(cl)

Each column check.resultsrepresents the number of observations per corresponding cluster for a given run of the k-means algorithm. My check.resultslooks like this:

     [,1] [,2] [,3] [,4]
[1,]   38  268  102  102
[2,]  268   98   98   38
[3,]   98  102   38  268
[4,]  102   38  268   98

If I change my variable resultsto include rep(25, 2)instead rep(25, 4), I get:

     [,1] [,2]
[1,]   38  268
[2,]  268   98
[3,]   98  102
[4,]  102   38

Perfect - , , 4 2. , , .

- , . 4- 3 ? , iseed clusterSetRNGStream ?

+3
1

clusterSetRNGStream , . , , , clusterApply . , , , , .

. , nextRNGSubStream:

library(parallel)
# This is based on the clusterSetRNGStream function from
# the parallel package, copyrighted by The R Core Team
getseeds <- function(ntasks, iseed) {
  RNGkind("L'Ecuyer-CMRG")
  set.seed(iseed)
  seeds <- vector("list", ntasks)
  seeds[[1]] <- .Random.seed
  for (i in seq_len(ntasks - 1)) {
    seeds[[i + 1]] <- nextRNGSubStream(seeds[[i]])
  }
  seeds
}

clusterSetRNGStream, "L'Ecuyer-CMRG" :

cl <- makeCluster(detectCores())
clusterEvalQ(cl, { library(MASS); RNGkind("L'Ecuyer-CMRG") })

".Random.seed" , :

worker <- function(nstart, seed, centers=4) {
  assign(".Random.seed", seed, envir=.GlobalEnv)
  kmeans(Boston, centers, nstart = nstart)
}

nstart seed, clusterMap clusterApply :

n <- 4
nstarts <- rep(25, n)
seeds <- getseeds(n, 1234)
results <- clusterMap(cl, worker, nstarts, seeds)

, :

itasks <- c(4)
results <- clusterMap(cl, worker, nstarts[itasks], seeds[itasks])

, clusterMap .scheduling="dynamic", , , clusterSetRNGStream.


, clusterMap MoreArgs, centers worker:

results <- clusterMap(cl, worker, nstarts, seeds, MoreArgs=list(centers=5))
+2

All Articles