Stop words and dictionary in java

Iโ€™m thinking about putting stop words in my affinity program and then stem (for porters 1 or 2 it depends on what is easiest to implement)

I was wondering that since I read my text from files as whole lines and saved them as a long line, so if I have two ex lines.

String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";

Now that I got these lines

Morphological: Can I just use the algorithmic algorithms directly on it, save it as a string and continue working on it in a similar way as it was before introducing the program into the program, for example running one.stem (); Such things?

Stop word: How does it work? oo I just use; one.replaceall ("I", ""); or is there any specific way to use this process? I want to continue working with the string and get the string before using the similarity algorithms to get the similarities. Wiki doesn't say much.

I hope you help me! Thank you

Edit: This is for a school project where I am writing an article on the similarities between different algorithms, so I donโ€™t think that I am allowed to use lucene or other libraries that do this work for me. Plus, I would like to try to understand how this works before I start using libraries like Lucene and co. Hope this doesn't bother you too much. ^^

+3
source share
3 answers

, Lucene. . , , . Lucene 3.0 - :

public static String removeStopWordsAndStem(String input) throws IOException {
    Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_30, new StringReader(input));
    tokenStream = new StopFilter(true, tokenStream, stopWords);
    tokenStream = new PorterStemFilter(tokenStream);

    StringBuilder sb = new StringBuilder();
    TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(termAttr.term());
    }
    return sb.toString();
}

:

public static void main(String[] args) throws IOException {
    String one = "I decided buy something from the shop.";
    String two = "Nevertheless I decidedly bought something from a shop.";
    System.out.println(removeStopWordsAndStem(one));
    System.out.println(removeStopWordsAndStem(two));
}

:

decid bui someth from shop
Nevertheless decidedli bought someth from shop
+10

, , -

String stemmedString = stemmer.stemAndRemoveStopwords(inputString, stopWordList);

, stemAndRemoveStopwords

  • -
  • StringBuilder
    • stopWordList; ,
    • ,
0

You do not need to deal with all the text. Just split it, apply your duration filter and extrusion algorithm, then re-create the line using StringBuilder:

StrinBuilder builder = new StringBuilder(text.length());
String[] words = text.split("\\s+");
for (String word : words) {
    if (stopwordFilter.check(word)) { // Apply stopword filter.
        word = stemmer.stem(word); // Apply stemming algorithm.
        builder.append(word);
    }
}
text = builder.toString();
0
source

All Articles