I am starting to use scikit-learn to create NLP. I have already used some classifiers from NLTK, and now I want to try those that were implemented in scikit-learn.
My data is mainly sentences, and I extract functions from some words of these sentences to perform any classification task. Most of my functions are nominal: part of speech (POS) of a word, word-to-left, word POS-word-to-left, word-to-right, POS-word-to-direct, syntactic path of relations from one word to another etc.
When I conducted some experiments using the NLTK classifiers (Decision Tree, Naive Bayes), the set of functions was just a dictionary with the corresponding values for the functions: nominal values. For example: [{"postag": "noun", "wleft": "home", "path": "VPNPNP", ...}, ....]. I just had to pass this on to the classifiers, and they did their job.
This is part of the code used:
def train_classifier(self):
if self.reader == None:
raise ValueError("No reader was provided for accessing training instances.")
argcands = self.get_argcands(self.reader)
training_argcands = []
for argcand in argcands:
if argcand["info"]["label"] == "NULL":
training_argcands.append( (self.extract_features(argcand), "NULL") )
else:
training_argcands.append( (self.extract_features(argcand), "ARG") )
self.classifier = DecisionTreeClassifier.train(training_argcands)
return
Here is an example of one of the feature sets:
[({'phrase': u'np', 'punct_right': 'NULL', 'phrase_left-sibling': 'NULL', 'subcat': 'fcl=np np vp np pu', 'pred_lemma': u'revelar', 'phrase_right-sibling': u'np', 'partial_path': 'vp fcl', 'first_word-postag': 'Bras\xc3\xadlia PROP', 'last_word-postag': 'Bras\xc3\xadlia PROP', 'phrase_parent': u'fcl', 'pred_context_right': u'um', 'pred_form': u'revela', 'punct_left': 'NULL', 'path': 'vp\xc2\xa1fcl!np', 'position': 0, 'pred_context_left_postag': u'ADV', 'voice': 0, 'pred_context_right_postag': u'ART', 'pred_context_left': u'hoje'}, 'NULL')]
As I mentioned earlier, most functions are nominal (string value).
scikit-learn. , , sklearn, , . "" , DictVectorizer. , , :
Traceback (most recent call last):
.....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 458, in fit
X = np.asarray(X, dtype=DTYPE, order='F')
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
Traceback (most recent call last):
....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 156, in fit
n_samples, n_features = X.shape
ValueError: need more than 0 values to unpack
, DictVectorizer(). , DictVectorizer (sparse = False), , :
Traceback (most recent call last):
train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 123, in fit_transform
return self.transform(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 212, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
ValueError: array is too big.
- , .
, : , , scikit-learn?
, .
UPDATE
, NLTK scikit-learn. , :
self.classifier = SklearnClassifier(DecisionTreeClassifier())
, "train", :
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 100, in train
X = self._convert(featuresets)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 109, in _convert
return self._featuresets_to_coo(featuresets)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 126, in _featuresets_to_coo
values.append(self._dtype(v))
ValueError: could not convert string to float: np
, , , . .