How to tell the difference between "as used in the abbreviation and as quotation markers"

Question

How to tell the difference between "as used in the abbreviation and as quotation markers"

I am trying to parse blocks of text and need a way to determine the difference between apostrophes in different contexts. Possession and abbreviation in one group, quotes in another.

eg.

“I am a car owner” → [“Me”, “Cars”, “Owner”]

but

"He said hello there" -> ["He", "said", "hello here")

Finding spaces on both sides will not help, since things like "ello" and "cars" will be parsed as one end of the quote, the same with the corresponding pairs of apostrophes. I get the feeling that there is no way to do this otherwise than the outrageously complicated NLP solution, and I just have to ignore any apostrophes not found in the middle of the word, which would be unsuccessful.

EDIT:

From the moment of writing, I realized that this is impossible. Any regular expression parser will need to parse:

'ello there my dog mates

in two different ways, and he could only do this with an understanding of the rest of the sentence. I think I am for an inelegant decision to ignore the least probable case and hope that it is rare enough to cause only rare anomalies.

+3

ruby regex parsing

JoMo 09 '12 21:38

3

Michael Kohl · Answer 1 · 2012-05-09T21:52:11+0000

, , . , , , , "" "":

>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"

, , , .

Finbarr · Answer 2 · 2012-05-10T04:35:24+0000

, :

.
.
, peoples'.
, , .

Triynko · Answer 3 · 2012-10-17T17:58:29+0000

.

1 2 , .

/(\w+)|(\W+)/gi

, ( AS3, ruby):

class MatchedWord
{
    var text:String;
    var charIndex:int;
    var isWord:Boolean;
    var isContraction:Boolean = false;
    function MatchedWord( text:String, charIndex:int, isWord:Boolean )
    {
        this.text = text; this.charIndex = charIndex; this.isWord = isWord;
    }
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
    matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it null, then it not a word and match[2] would be non-null)

2 2 , , , (, ) ENDS . , () , , 8 . , , 8 .

d
l
ll
m
re
s
t
ve

(-) = "'" (word) = "d", () , .

, , , - , , , "twas" "tis". () , , , , , ( , ). EQUALS , , ENDS , . , , , , ( -), , EQUALS - , .

, , - 8 , , , "g'day" "g_night". , , () . "g", .

, , .

.

Condition(Ending, PreCondition)

PreCondition -

"*", "!", or "<exact string>"

:

new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");

, , 86 ( ):

'tis' twas , , , --------- , , , , , , , , , , , , , , , , , , , , , ? , ?

, , , "gotta" > "got to" "gonna" > "going to".

Here is the final AS3 code. In general, you view less than 50 lines of code to analyze text in groups with a variable word and without words, as well as to define and combine abbreviations. Simply. You can even add the boolean variable "isContraction" to the MatchedWord class and set the flag in the code below when an abbreviation is detected.

//Automatically merge known contractions
var conditions:Array = [
    ["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
    ["l","*"],
    ["ll","*"],
    ["m","*"],
    ["re","*"],
    ["s","*"],
    ["t","*"],
    ["ve","*"],
    ["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
    ["tis","!"],
    ["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
    ["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
    var m:MatchedWord = matched_words[i];
    var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
    if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
    {
        var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
        var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
        for each (var condition:Array in conditions)
        {
            if (StringUtils.trim( m_next.text ) == condition[0])
            {
                var pre_condition:String = condition[1];
                switch (pre_condition)
                {
                    case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
                        if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                    case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
                        if (apostrophe_text == "'")
                        {
                            m.text += m_next.text;
                            m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
                            m.isContraction = true;
                            matched_words.splice( i + 1, 1 );
                        }
                        else
                        {   //strip apostrophe off end and merge with next item, nothing needs deleted
                            //preserve spaces and match start indexes by manipulating untrimmed strings
                            var apostrophe_end:int = m.text.lastIndexOf( "'" );
                            var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
                            m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
                            m_next.text = apostrophe_ending + m_next.text;
                            m_next.charIndex = m.charIndex + apostrophe_end;
                            m_next.isContraction = true;
                        }
                        break;
                    default: //conditional success, check prior match meets condition
                        if (m_prev != null && m_prev.text == pre_condition)
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                }
            }
        }
    }
}

How to tell the difference between "as used in the abbreviation and as quotation markers"

More articles: