I would like to use the first step of the scikit-learn pipeline to create a toy dataset to evaluate the effectiveness of my analysis. The as-simple-as-it-gets-example solution I came up with is as follows:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn import cluster
class FeatureGenerator(TransformerMixin):
def __init__(self, num_features=None):
self.num_features = num_features
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transform_params):
return np.array(
range(self.num_features*self.num_features)
).reshape(self.num_features,
self.num_features)
def get_params(self, deep=True):
return {"num_features": self.num_features}
def set_params(self, **parameters):
self.num_features = parameters["num_features"]
return self
This transformer in action will be e. d. to be called like this:
pipeline = Pipeline([
('pick_features', FeatureGenerator(100)),
('kmeans', cluster.KMeans())
])
pipeline = pipeline.fit(None)
classes = pipeline.predict(None)
print classes
It becomes difficult for me as soon as I try to draw a grid along this pipeline:
parameter_sets = {
'pick_features__num_features' : [10,20,30],
'kmeans__n_clusters' : [2,3,4]
}
pipeline = Pipeline([
('pick_features', FeatureGenerator()),
('kmeans', cluster.KMeans())
])
g_search_estimator = GridSearchCV(pipeline, parameter_sets)
g_search_estimator.fit(None,None)
A grid search involves selections and labels as input and is not as strong as a pipeline that does not complain about the Nonequality of the input parameter:
TypeError: Expected sequence or array-like, got <type 'NoneType'>
This makes sense because a grid search should divide the data set into different cv sections.
, , . .
: X y GridSearch ? , GridSearch ( )? - GridSearchCV ?