A few weeks ago I asked some questions about text development, but I was a bit confused and still, but now I know what I want to do.
Situation: I have many downloadable pages with HTML content. For example, some of them may be bean in the text from the blog. They are not structured and come from different sites.
What I want to do: . I will separate all the words with spaces, and I want to classify each of them or a group of them in some predefined ions, such as names, numbers, phone, email, url, date, money, temperature, etc.
What I know: I know the concepts / heard about how to process natural languages, rename Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and much more to classify, etc., but there are several different NLP libraries with different classifiers and ways to do this, and I don’t know what to use or what to do.
WHAT I NEED: I need a sample code from the classifier, NLP, regardless of the fact that it can classify each word from the text individually, and not the entire text. Something like that:
//This is pseudo-code for what I want, and not a implementation
classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
classifiedWord = classifier.classify(word);
System.out.println(classifiedWord.getType());
}
Can anybody help me? I am confused by various APIs, classifiers and algorithms.