StackOverflowException bypass

I use HtmlAgilityPack to parse approximately 200,000 HTML documents.

I cannot predict the contents of these documents, however one such document makes my application fail with StackOverflowException. The document contains this HTML:

<ol>
    <li><li><li><li><li><li>...
</ol>

It has approximately 10,000 <li>elements. Due to the way HtmlAgilityPack parses HTML, it calls StackOverflowException.

Unfortunately, a StackOverflowException is not perceptible in .NET 2.0 and later.

I really wondered about setting a larger size for the stream stack, but setting a larger stack is a hack: it will make my program use a lot more memory (my program runs about 50 threads for HTML processing, so all of these threads will have an increased stack size ) and manual tuning is required if it again encounters a similar situation.

Can I use other workarounds?

+5
source share
2 answers

, HtmlAgilityPack, , . CodePlex, , . , " " , HtmlAgilityPack HTML-, , - HTML- w3wp.exe.

, . , , (, , , --).

<ol><li> . , , 2^21 , 2^22 - 4 "" ... .

+2

, , , , . hap...

http://www.codeplex.com/site/users/view/sjdirect (. 3/8/2012)

. ....

https://code.google.com/p/abot/issues/detail?id=77

... HtmlDocument.OptionMaxNestedChildNodes, StackOverflowExceptions, . ApplicationException " X . , , ".

Hap After Patch...

HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
    hapDoc.LoadHtml(RawContent);    
}
catch (Exception e)
{
    //Instead of a stackoverflow exception you should end up here now
    hapDoc.LoadHtml("");
    _logger.Error(e);
}
+5

All Articles