Java PatternSyntaxException: Unmatched Closing '('

I need to remove all urls found in Twitter posts. I have a file with 200,000 such messages, so speed is critical! For this, I use Java as a programming language, here is an example of my code:

public String performStrip(){

    String tweet = this.getRawTweet();
    String urlPattern = "((https?|http)://(bit\\.ly|t\\.co|lnkd\\.in|tcrn\\.ch)\\S*)\\b";

    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(tweet);

    int i = 0;

    while (m.find()) {
        tweet = tweet.replaceAll(m.group(i),"").trim();
        i++;
    }

    return tweet;
}

This works great in the following cases:

http://t.co/nhWp9hldEH        -> (empty string)
http://t.co/nhWp9hldEH"       -> "
http://t.co/nhWp9hldEH)aaa"   -> aaa"
aaa(http://t.co/nhWp9hldEH"   -> aaa("
aaa(http://t.co/nhWp9hldEH)"  -> aaa()"

However, when I get to the case as follows:

http://t.co/nhWp9hldEH)aaa"

I get an error

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 21

http://t.co/nhWp9hldEH)aa

at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.compile(Pattern.java:1669)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.replaceAll(String.java:2210)
at com.anturo.preprocess.url.UrlStripper.performStrip(UrlStripper.java:47)
at com.anturo.preprocess.testing.ReadIn.<init>(ReadIn.java:35)
at com.anturo.preprocess.testing.Main.main(Main.java:6)

I have already addressed several similar questions regarding this error, but so far no one has worked ... Hoping someone can help me here.

+3
source share
1 answer

The problem is that you may have special regular expression characters in the url, as you can see.

: Pattern.quote(). :

tweet = tweet.replaceAll(Pattern.quote(m.group(i)),"").trim();

: JDK 1.5, , ?

- .replace():

tweet = tweet.replace(m.group(i), "").trim();

, .replaceAll(), .replace() ; , . . .replaceFirst().

, : , , .group()! :

while (m.find())
    tweet = tweet.replace(m.group(), "").trim();

i; m.group(i) , i .

+5

All Articles