Run cvb in mahout 0.8

Mahout 0.8-SNAPSHOT currently includes a minimized Bayesian variational version for theme modeling and has removed the Latent Dirichlet Analysis (lda) approach because cvb can be parallelized better. Unfortunately, there is only documentation for lda on how to run the example and create meaningful output.

So I want:

  • correctly process some texts
  • run cvb0_local version of cvb
  • check the results by looking at the first n words in each of the generated topics
+5
source share
2 answers

So, here are the following Mahout commands that I had to call in the linux shell to do this. $ MAHOUT_HOME points to my mahout / bin folder.

$MAHOUT_HOME/mahout seqdirectory \
    -i path/to/directory/with/texts \
    -o out/sequenced

$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
    -o out/sparseVectors \
    --namedVector \
    -wt tf

$MAHOUT_HOME/mahout rowid \
    -i out/sparseVectors/tf-vectors/ \
    -o out/matrix

$MAHOUT_HOME/mahout cvb0_local \
    -i out/matrix/matrix \
    -d out/sparseVectors/dictionary.file-0 \
    -a 0.5 \
    -top 4 -do out/cvb/do_out \
    -to out/cvb/to_out

, 10 :

$MAHOUT_HOME/mahout vectordump \
    -i out/cvb/to_out \
    --dictionary out/sparseVectors/dictionary.file-0 \
    --dictionaryType sequencefile \
    --vectorSize 10 \
    -sort out/cvb/to_out
+12

JoKnopp .

: "main" java.lang.ClassCastException: java.lang.Integer java.lang.String

"maxIterations": --maxIterations (-m) maxIterations

-m 20,

: https://issues.apache.org/jira/browse/MAHOUT-1141

+3

All Articles