How to predict continuous value (time) from text documents?

I have about 3,000 text documents that are related to the length of time when the document was "interesting." Therefore, let's say that document 1 contains 300 lines of text with content, which led to 5.5 days of interest, while another document with 40 lines of text made it interesting for 6.7 days, etc. .

The challenge now is to predict the duration of interest (which is a continuous value) based on textual content.

I have two ideas to solve the problem:

  • Create a model of similar documents with technology, for example http://radimrehurek.com/gensim/simserver.html . When a new document arrives, you can try to find the 10 most similar documents in the past and simply calculate the average value of their duration and take this value as a forecast for the duration of interest for the new document.
  • Place documents in the duration category (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then prepare a classifier to predict the duration category based on text content.

The advantage of idea No. 1 is that I could also calculate the standard deviation of my prediction, while with idea No. 2 I was less clear how I could calculate a similar measure of the uncertainty of my prediction. Also, I don’t understand which categories to choose in order to get the best results from the classifier.

, , , , , ? , ? , , , . , ( Java Python), .

+5
2

(1) k- . . , , . .

script , scikit-learn (*):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor

# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents)         # documents: list of filenames

# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...

# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)

. , ,

Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)

. , , , . , .

(*) , . .

+3

( "", , ).

, :

, , .

" " . , , . , , , . , , . , "" , , , , ( http://en.wikipedia.org/wiki/N-gram).

, , , sim(doc1, doc2). , ( , ), . .

, sim() , ​​, :

sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.

, ducuments correlation.

, tf- IDF

. , , "" . , , . , "".

, clustering, .

, , . , " " " ".

, : N-Gram, tf-idf Python.

(IMHO, 3000 - , - ).

+1

All Articles