How do I classify a word of text in things like names, number, money, date, etc.?

A few weeks ago I asked some questions about text development, but I was a bit confused and still, but now I know what I want to do.

Situation: I have many downloadable pages with HTML content. For example, some of them may be bean in the text from the blog. They are not structured and come from different sites.

What I want to do: . I will separate all the words with spaces, and I want to classify each of them or a group of them in some predefined ions, such as names, numbers, phone, email, url, date, money, temperature, etc.

What I know: I know the concepts / heard about how to process natural languages, rename Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and much more to classify, etc., but there are several different NLP libraries with different classifiers and ways to do this, and I don’t know what to use or what to do.

WHAT I NEED: I need a sample code from the classifier, NLP, regardless of the fact that it can classify each word from the text individually, and not the entire text. Something like that:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

Can anybody help me? I am confused by various APIs, classifiers and algorithms.

0
source share
4 answers

Apache OpenNLP. .

, , , Amazonia Corpus. :

, , , , , ArtProd, , , .

  • OpenNLP Amazonia Corpus. amazonia.ad apache-opennlp-1.5.1-incubating.

  • TokenNameFinderConverter, Amazonia OpenNLP:

    bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt
    
  • ( corpus.txt, . ):

    bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
    
  • ( , ):

    $ bin/opennlp TokenNameFinder pt-ner.bin 
    Loading Token Name Finder model ... done (1,112s)
    Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
    Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .
    
  • API:

    InputStream modelIn = new FileInputStream("pt-ner.bin");
    
    try {
      TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
    }
    catch (IOException e) {
      e.printStackTrace();
    }
    finally {
      if (modelIn != null) {
        try {
           modelIn.close();
        }
        catch (IOException e) {
        }
      }
    }
    
    // load the name finder
    NameFinderME nameFinder = new NameFinderME(model);
    
    // pass the token array to the name finder
    String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};
    
    // the Span objects will show the start and end of each name, also the type
    Span[] nameSpans = nameFinder.find(toks);
    
  • 10- : ( 1.5.2-INCUBATOR, , SVN) ( )

    bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
    
  • / Custom Feature Generation ( ), , .

+5

Named Entity Recognizer (NER) , Stanford Core NLP ner . , Stanford NER. , Stanford Core NLP , , , , .

, :

// creates a StanfordCoreNLP object, with NER
    Properties props = new Properties();
    props.put("annotators", "ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
     Annotation document = new Annotation(word);
     pipeline.annotate(document);
     System.out.println(Annotation);
}
+2

. , . , , ( , ..) ( , , ).

, , , API.

+1

.

Or like his famous object-oriented name. That is, to determine the currency, what we do is check the dollar sign at the beginning or end and check for the presence of attached non-numeric characters, which means an error.

You must write what you are already doing with your mind. It is not so difficult if you follow the rules. There are 3 golden rules in Robotics / AI:

  • analyze it.
  • simplify it
  • digitize it.

This way you can talk to computers.

0
source

All Articles