Why does my scikit teach HashingVectorizor by giving me floats with binary = True set?

I am trying to use scikit-learn Bernoulli Naive Bayes. I had a class that worked fine on a small dataset using CountVectorizor, but ran into difficulties when I tried using HashingVectorizor to work with a large dataset. Keeping all other parameters (training documents, test documents, classifier settings and extractor functions) constant and just switching from CountVectorizor to HashingVectorizor, my classifier always spat out the same label for all documents.

I wrote the following script to study what will differ between the two function extractors:

from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer

cv = CountVectorizer(binary=True, decode_error='ignore')
h = HashingVectorizer(binary=True, decode_error='ignore')

with open('moby_dick.txt') as fp:
    doc = fp.read()

cv_result = cv.fit_transform([doc])
h_result = h.transform([doc])

print cv_result
print repr(cv_result)
print h_result
print repr(h_result)

(where "moby_dick.txt" is the gutenberg copy of moby dick project)

Results (concise):

  (0, 17319)    1
  (0, 17320)    1
  (0, 17321)    1
<1x17322 sparse matrix of type '<type 'numpy.int64'>'
    with 17322 stored elements in Compressed Sparse Column format>

  (0, 1048456)  0.00763203138591
  (0, 1048503)  0.00763203138591
  (0, 1048519)  0.00763203138591
<1x1048576 sparse matrix of type '<type 'numpy.float64'>'
    with 17168 stored elements in Compressed Sparse Row format>

, CountVectorizor 1 ( 1, ); HashVectorizor, , ( , ). , BernoulliNB.

, HashingVectorizor, CountVectorizor; binarize BernoulliNB, , , float ( , 1).

.

+3
1

HashingVectorizer :

>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027,  0.        ,  0.        ,  0.        ,  0.57735027,
         0.        , -0.57735027,  0.        ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0

binary=True , , .. . norm=None, :

>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5,  0. ,  0. ,  0. ,  0.5,  0.5,  0.5,  0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.]])

float : . , dtype, transform , , , .

+5

All Articles