What is the best regex for HTML parsing (although you shouldn't)? Is it perfect?

Question

What is the best regex for HTML parsing (although you shouldn't)? Is it perfect?

Well, we all know that trying to parse HTML with Regex is Cthulhu's anger . Well enough. And there are some excellent answers to why you shouldn't . I accept them and have repeatedly posted these links on questions.

But let me pose this question in the following area: we have no other option than Regex for parsing HTML. What for? It does not matter. But suppose that at the moment, our developers want to lose consciousness of Tony Pony and make the best chance to do the impossible. If it hits your mind, then the question will be theoretical. No matter what your boat floats. Just consider the idea of parsing HTML with regex, even if you shouldn't.

Here we see the statement that this cannot be done, at least with perfection. But then there is a very wise comment under it from @NikiC:

This answer made the correct conclusion ("Bad idea to parse HTML with Regex") from incorrect arguments ("Because HTML is not an ordinary language"). What most people currently mean when they say that “regular expression” (PCRE) is good at not only analyzing context-free grammars (which is actually trivial), but also context-sensitive grammars (see <a3> )

True, you can do some incredibly powerful things with a modern regex even if it is quite verbose . But many make this problem sound like a stop problem: you can try, but there will always be another case for which your solution breaks.

, , - 2-.

HTML?
- , ? , , ?
, ?

+5

html regex html-parsing theory

Nick Miceli 21 . '12 20:12

1

Regexident · Accepted Answer · 2012-08-22T00:38:58+0000

, :

Regex HTML a . : " ".

. , 7 , . .

: , Regex, HTML. ? .

, , . , " " , , . , " why".

" " " ", : . 100% . , - , 100%. .

" " , , " ", regex. 100% , .

, , .

, , , , , , , . : . .

" " " " :

, , 6400+ .

" " , @([^\s]+) (?<=@)[^\s]+ ( ) 100%. , .

HTML?

, "": .

, ? , , ?

, " ?" " , !". QED

, ?

, , "none".

(, ,) HTML x|y|z|… x, y, z... ( ) HTML, . ( ), HTML ( , , ), < > (, , , , turing) .

Regex 3 ( ), HTML ( /) - -2 ( -). - . -n type- (nx), . HTML. - - . .

( "S → aSa", "S → aA, A → Sb, S → ε" ). HTML.

"S → aSa" ( ):

<div>
    <div>
        ...
    <div>
<div>

, , HTML/XML . , ? HTML . . .

"S → aA, A → Sb, S → ε" ():

() <td> :

<table>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
    </tr>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>

, : "X", "X", "Y",.

, . .

, , PCRE !: , - .

... , , :

, Regex

, . . .

( OP) , :

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(
?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]
|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) ... (6400+ chars)

- :

 address     =  mailbox                      ; one addressee
             /  group                        ; named list
 group       =  phrase ":" [#mailbox] ";"
 mailbox     =  addr-spec                    ; simple address
             /  phrase route-addr            ; name & addr-spec
 route-addr  =  "<" [route] addr-spec ">"
 route       =  1#("@" domain) ":"           ; path-relative
 addr-spec   =  local-part "@" domain        ; global address
 local-part  =  word *("." word)             ; uninterpreted
                                             ; case-preserved
 domain      =  sub-domain *("." sub-domain)
 sub-domain  =  domain-ref / domain-literal
 domain-ref  =  atom                         ; symbolic reference

, HTML, : " , ".
f * cktons .

. , , - , , , ?

What is the best regex for HTML parsing (although you shouldn't)? Is it perfect?

More articles: