Regular expression for a specific tag

I am working on a regular expression in a project .NETto get a specific tag. I would like to match the entire DIV tag and its contents:

<html>
   <head><title>Test</title></head>
   <body>
     <p>The first paragraph.</p>
     <div id='super_special'>
        <p>The Store paragraph</p>
     </div>
     </body>
  </head>

the code:

    Regex re = new Regex("(<div id='super_special'>.*?</div>)", RegexOptions.Multiline);


    if (re.IsMatch(test))
        Console.WriteLine("it matches");
    else
        Console.WriteLine("no match");

I want to match this:

<div id="super_special">
   <p>Anything could go in here...doesn't matter.  Let get it all</p>
</div>

I thought I .should have gotten all the characters, but it looks like he has problems with carriage return. What is my regex missing?

Thank.

+2
source share
11 answers

Out of the box, without special modifiers, most regular expression implementations do not go beyond the end of the line to match the text. You should probably look in the documentation for the regex engine that you use for such a modifier.

I have one more tip: beware of greed! Traditionally regex , , , , :

<div id="super_special">
  I'm the wanted div!
</div>
<div id="not_special">
  I'm not wanted, but I've been caught too :(
</div>

"" , </div>, .

, , HTML . .

: - , , <div> ! HTML.

+1

, , : HTML HTML. . , .

HTML - . , , , , Regexp, , .

, , Regexp , . , /m.

: HTML. , - Regexp HTML, ...

+6

, . , perl regex s:

m{<div id="super_special">.*?</span>}s
+1

? .NET , , .

+1

. python re.S, ( ):

re.compile('<div id="super_special">.*?</div>',re.S).sub(your_html,'')

, "Single Line" "Multi Line" - .

REGEXPS TO PARSE HTML. . HTML, Beautiful Soup. .

+1

, . . . .NET RegexOptions.SingleLine , :

(?s)(<div id="super_special">.*?</div>)
+1

- . :

  • Java: Pattern.compile( "pattern", Pattern.MULTILINE);
  • Perl Ruby:/pattern/m
  • VB: Regex.IsMatch(s, "pattern", RegexOptions.Multiline)

regexp XML/HTML, XML/HTML , :

  <div id="super_special">
     <div>Nothing</div>
     <p>Anything could go in here...doesn't matter.  Let get it all</p>
  </div>

... :

  <div id="super_special">
     <div>Nothing</div>

, , HTML- , (, ).

+1

. () , \r \n. ., x ()

0

:. [\ r\n]. [\ r\n]

0

. , , </div> </div> , div, , .

, , HTML, , Microsoft , .NET., . .

0

Only regular expressions are simply not effective enough to solve your problem. You need something more powerful, such as context-free grammars. See Chom hierarchy on Wikipedia.

In other words (as mentioned earlier), do not use regex for parsing HTML.

0
source

All Articles