I use Gensim to calculate the similarities between two documents. For some reason, the tfidf [corpus] line returns an empty list. I'm not sure why though
articles = []
for x in range(0,25):
articles.append(str(WikiDoc(sorted_links[0]).jsonify()['text']))
texts = [[word for word in document.lower().split()] for document in articles]
print texts
articles_dict = corpora.Dictionary(texts)
articles_dict.save('./articles.dict')
articles_dict = Dictionary.load('./articles.dict')
corpus = [articles_dict.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./articles.mm', corpus)
corpus = corpora.MmCorpus('./articles.mm')
tfidf = models.TfidfModel(corpus)
one_doc_bow = WikiDoc('SpongeBob')
one_doc_bow = articles_dict.doc2bow(one_doc_bow.jsonify()['text'].lower().split())
print tfidf[one_doc_bow]
top = tfidf[one_doc_bow]
corpus_tfidf = tfidf[corpus]
When I print a dictionary, I get: Dictionary (2204 unique tokens) When I print MmCorpus, I get: MmCorpus (25 documents, 2204 functions, 55100 non-zero entries) tfidf [corpus] yield []. Can anyone diagnose my problem? Many thanks!
source
share