How to combine these two regular expressions into one?

I am writing a rudimentary lexer using regular expressions in JavaScript, and I have two regular expressions (one for single quotes and one for strings with double quotes) that I want to combine into one. These are my two regular expressions (I added characters ^and $for testing purposes):

var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;

Now I tried to combine them into one regular expression as follows:

var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;

However, when I test the input "Hello"World!", it returns trueinstead false:

alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters

I realized that the problem is in [^\1\\]which must match any character except \1(which is either a single or double quote - a line separator) and \\(which is a backslash character).

The regular expression correctly filters the backslash and matches the delimiters, but does not filter the delimiter inside the string. Any help would be appreciated. Please note that I referenced the Crockford railway diagrams to write regular expressions.

+3
source share
3 answers

You can not refer to the appropriate group within the class characters: (['"])[^\1\\]. Try something like this:

(['"])((?!\1|\\).|\\[bnfrt]|\\u[a-fA-F\d]{4}|\\\1)*\1

(you need to add some more screens, but you will get my drift ...)

Brief explanation:

(['"])             # match a single or double quote and store it in group 1
(                  # start group 2
  (?!\1|\\).       #   if group 1 or a backslash isn't ahead, match any non-line break char
  |                #   OR
  \\[bnfrt]        #   match an escape sequence
  |                #   OR
  \\u[a-fA-F\d]{4} #   match a Unicode escape
  |                #   OR
  \\\1             #   match an escaped quote
)*                 # close group 2 and repeat it zero or more times
\1                 # match whatever group 1 matched
+6
source

This should work too (raw regex).
If speed is a factor, this is the “deployed” method that is considered the fastest for this kind of thing.

(['"])(?:(?!\\|\1).)*(?:\\(?:[\/bfnrt]|u[0-9A-F]{4}|\1)(?:(?!\\|\1).)*)*/1  

Advanced

(['"])            # Capture a quote
(?:
   (?!\\|\1).             # As many non-escape and non-quote chars as possible
)*

(?:                       
    \\                     # escape plus,
    (?:
        [\/bfnrt]          # /,b,f,n,r,t or u[a-9A-f]{4} or captured quote
      | u[0-9A-F]{4}
      | \1
    )
    (?:                
        (?!\\|\1).         # As many non-escape and non-quote chars as possible
    )*
)*

/1                # Captured quote
+2

, ,

/(?:single-quoted-regex)|(?:double-quoted-regex)/

:

var string = /(?:^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$)|(?:^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$)/gi;

Finally, if you want to avoid code duplication, you can dynamically create this regular expression using the constructor new Regex.

var quoted_string = function(delimiter){
    return ('^' + delimiter + '(?:[^' + delimiter + '\\]|\\' + delimiter + '|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*' + delimiter + '$').replace(/\\/g, '\\\\');
    //in the general case you could consider using a regex excaping function to avoid backslash hell.
};

var string = new RegExp( '(?:' + quoted_string("'") + ')|(?:' + quoted_string('"') + ')' , 'gi' );
0
source

All Articles