I use scikit-learn for clustered text documents. I use the CountVectorizer, TfidfTransformer, and MiniBatchKMeans classes to help me with this. New text documents are added to the system all the time, which means that I need to use the above classes to convert the text and predict the cluster. My question is: how to store data on disk? Should I just sort the objects of the vectorizer, transformer and kmeans? Should I just save the data? If so, how can I add it back to the vectorizer, transformer and kmeans objects?
Any help would be greatly appreciated.
source
share