Just changing the csv header in the apache Mahout attribute input creates a different model?

I am trying to run the Mahout classifier example (donut.csv). but I found that simply changing the name of some columns in the title bar and changing the corresponding name of the prediction variable in the classifier command will lead to a different model. this makes no sense.

first you get donut.csv

mahout cat donut.csv |tail -40 > donut0.csv

(the tail was caused by mahout cat creating some initial info lines)

then we use the following commands to train donut0.csv: (as suggested in the book "Mahout in action")

mahout trainlogistic --input donut0.csv \
--output ./model \
--target color --categories 2 \
--predictors x y a b c  --types numeric \
--features 20 --passes 100 --rate 50

he gave the following conclusion

color ~ 7.068*Intercept Term + 0.581*a + -1.369*b + -25.059*c + 0.581*x + 2.319*y
      Intercept Term 7.06759
                   a 0.58123
                   b -1.36893
                   c -25.05945
                   x 0.58123
                   y 2.31879
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -1.368933989     0.000000000     0.000000000     0.000000000     0.000000000     0.581234210     0.000000000     0.000000000     7.067587159     0.000000000     0.000000000     0.000000000     2.318786209     0.000000000   -25.059452292 
12/04/27 09:29:21 INFO driver.MahoutDriver: Program took 789 ms (Minutes: 0.01315)

but if you just change the “x” column in the header to “xa” and the corresponding predictor name in the command, the output model will completely change.

$ head -3 donut4.csv 
xa,y,shape,color,k,k0,xx,xy,yy,a,b,c,bias
0.923307513352484,0.0135197141207755,21,20,4,8,0.852496764213146,0.0124828536260896,0.000182782669907495,0.923406490600458,0.0778750292332978,0.644866125183976,1
0.711011884035543,0.909141522599384,22,20,3,9,0.505537899239772,0.64641042683833,0.826538308114327,1.15415605849213,0.953966686673604,0.46035073663368,1



mahout trainlogistic --input donut4.csv \
--output ./model \
--target color --categories 2 \
--predictors xa y a b c  --types numeric \
--features 20 --passes 100 --rate 50



color ~ 6.380*Intercept Term + -1.913*a + -0.577*b + -23.236*c + 2.647*xa + 3.009*y
      Intercept Term 6.38017
                   a -1.91308
                   b -0.57676
                   c -23.23552
                  xa 2.64657
                   y 3.00925
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -0.576759549     0.000000000     0.000000000     2.646572912     0.000000000    -1.913075634     0.000000000     0.000000000     6.380173126     0.000000000     0.000000000     0.000000000     3.009245162     0.000000000   -23.235521029 
12/04/27 10:21:10 INFO driver.MahoutDriver: Program took 728 ms (Minutes: 0.012133333333333333)

, , , , , . ??

+3
2

, . - , . , , - .

0

. , - .

20 org.apache.mahout.classifier.sgd.TrainNewsGroups.

Vector v = helper.encodeFeatureVector(file, actual, leakType, overallCounts);

- , .

" ", "" . NewsgroupHelper, :

    private final FeatureVectorEncoder encoder = new StaticWordValueEncoder("body");

20 ( --features 20), 5 (--predictors xa y a b c).

NewsgroupHelper, encoder.addToVector(word, Math.log1p(words.count(word)), v);, . , "" - 5 . , "xa", .

org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder, int n = hashForProbe(originalForm, data.size(), name, i); : originalForm - , data.size() - , - , - " ", .

TL; DR "x" "xa" , , , .

0

All Articles