Using sparse matrices / online learning in Naive Bayes (Python, scikit)

I am trying to make Naive Bayes in a data set that has over 6,000,000 records and each record is 150k. I tried to implement the code at the following link: Implementation of the naive Bayes classifier of the smallest bike in NLTK

Problem (as I understand it): when I try to run the train method with dok_matrix as a parameter, it cannot find iterkeys (I paired the lines with OrderedDict as labels):

Traceback (most recent call last):
  File "skitest.py", line 96, in <module>
    classif.train(add_label(matr, labels))
  File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
    for f in fs.iterkeys():
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
    return _cs_matrix.__getattr__(self, attr)
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
    raise AttributeError, attr + " not found"
AttributeError: iterkeys not found

My question is, is there a way to avoid using a sparse matrix by learning how to write a classifier by writing (online), or is there a sparse matrix format that I could use effectively instead of dok_matrix in this case? Or am I missing something obvious?

Thanks for any time. :)

EDIT, 6th place:

iterkeys, . , 32 . . :

matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()

#collect the data into the matrix

pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
                              for x in xrange(lentweets-foldsize)] 

classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))

, , 150k. , - , Naive Bayes , ?

+5
1

scikit-learn. , . NLTK, . (*)

, TfidfVectorizer, :

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)

X CSR, Naive Bayes, y (, , ):

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)

, , (, TfidfVectorizer ), , API HashingVectorizer partial_fit -. scikit-learn 0.14.

(*) , . , . scikit-learn, , , .

+3

All Articles