I am trying to make Naive Bayes in a data set that has over 6,000,000 records and each record is 150k. I tried to implement the code at the following link:
Implementation of the naive Bayes classifier of the smallest bike in NLTK
Problem (as I understand it): when I try to run the train method with dok_matrix as a parameter, it cannot find iterkeys (I paired the lines with OrderedDict as labels):
Traceback (most recent call last):
File "skitest.py", line 96, in <module>
classif.train(add_label(matr, labels))
File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
for f in fs.iterkeys():
File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
return _cs_matrix.__getattr__(self, attr)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
raise AttributeError, attr + " not found"
AttributeError: iterkeys not found
My question is, is there a way to avoid using a sparse matrix by learning how to write a classifier by writing (online), or is there a sparse matrix format that I could use effectively instead of dok_matrix in this case? Or am I missing something obvious?
Thanks for any time. :)
EDIT, 6th place:
iterkeys, . , 32 . . :
matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()
pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
for x in xrange(lentweets-foldsize)]
classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))
, , 150k. , - , Naive Bayes , ?