I have about 3,000 text documents that are related to the length of time when the document was "interesting." Therefore, let's say that document 1 contains 300 lines of text with content, which led to 5.5 days of interest, while another document with 40 lines of text made it interesting for 6.7 days, etc. .
The challenge now is to predict the duration of interest (which is a continuous value) based on textual content.
I have two ideas to solve the problem:
- Create a model of similar documents with technology, for example http://radimrehurek.com/gensim/simserver.html . When a new document arrives, you can try to find the 10 most similar documents in the past and simply calculate the average value of their duration and take this value as a forecast for the duration of interest for the new document.
- Place documents in the duration category (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then prepare a classifier to predict the duration category based on text content.
The advantage of idea No. 1 is that I could also calculate the standard deviation of my prediction, while with idea No. 2 I was less clear how I could calculate a similar measure of the uncertainty of my prediction. Also, I don’t understand which categories to choose in order to get the best results from the classifier.
, , , , , ? , ? , , , . , ( Java Python), .