Save Tf-Idf data

Question

Save Tf-Idf data

I want to save the TF-IDF matrix, so I don’t need to recount it all the time. I am using scikit-learn TfIdfVectorizer. Is it more efficient to sort it or store it in a database?

In some context: I am using k-mean clustering to provide guidance on the document. Since new documents are added frequently, I would like to keep TF-IDF values for documents so that I can recount clusters.

+3

python scikit-learn machine-learning pickle

pnsilva Jun 19 '12 at 13:50

source share

1 answer

ogrisel · Accepted Answer · 2012-06-20T14:04:58+0000

Etching (especially with joblib.dump ) is useful for short-term storage, for example. to save partial results in an interactive session or send a model from a development server to a production server.

However, the etching format depends on the definitions of model classes, which can vary from one version of scikit-learn to another.

I would recommend writing your own implementation-independent save model if you plan to hold the model for a long time and let it load in future versions of scikit-learn.

I would also recommend using the HDF5 file format (such as used in PyTables) or other database systems that have some support for efficiently storing numeric arrays.

CSR COO scipy.sparse, .

Save Tf-Idf data

More articles: