Save Tf-Idf data

I want to save the TF-IDF matrix, so I don’t need to recount it all the time. I am using scikit-learn TfIdfVectorizer. Is it more efficient to sort it or store it in a database?

In some context: I am using k-mean clustering to provide guidance on the document. Since new documents are added frequently, I would like to keep TF-IDF values ​​for documents so that I can recount clusters.

+3
source share
1 answer

Etching (especially with joblib.dump ) is useful for short-term storage, for example. to save partial results in an interactive session or send a model from a development server to a production server.

However, the etching format depends on the definitions of model classes, which can vary from one version of scikit-learn to another.

I would recommend writing your own implementation-independent save model if you plan to hold the model for a long time and let it load in future versions of scikit-learn.

I would also recommend using the HDF5 file format (such as used in PyTables) or other database systems that have some support for efficiently storing numeric arrays.

CSR COO scipy.sparse, .

+6

All Articles