Retrieving content between two HTML tags using the Html Agility Pack

We have an absolutely massive help document created in Word, and this was used to create an even more massive and fatal HTM document. Using C # and this library, I only want to capture and display one section of this file at any time in my application. Sections are divided as follows:

<!--logical section starts here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1>
</div>
 <div> Lots of unnecessary markup for simple formatting... </div>
 .....
<!--logical section ends here -->

<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1>
</div>

Logically speaking, it exists H1with the section name in the tag a. I want to select everything from the external containing the div, until I run into another H1and exclude that the div.

  • Each section name is in a tag <a>under H1, which has several children (about 6)
  • The logical section is marked with comments.

:

var startNode = helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(., '"+sectionName+"')]");
//go up one level from the a node to the h1 element
startNode=startNode.ParentNode;

//get the start index as the index of the div containing the h1 element
int startNodeIndex = startNode.ParentNode.ChildNodes.IndexOf(startNode);

//here I am not sure how to get the endNode location. 
var endNode =?;

int endNodeIndex = endNode.ParentNode.ChildNodes.IndexOf(endNode);

//select everything from the start index to the end index
var nodes = startNode.ParentNode.ChildNodes.Where((n, index) => index >= startNodeIndex && index <= endNodeIndex).Select(n => n);

, , node h1. .

+5
2

, , , H1 . , Where , H1. , div, , , .

private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
{
    HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
    if (startNode == null)
        return null; // section not found

    List<HtmlNode> section = new List<HtmlNode>();
    HtmlNode sibling = startNode.NextSibling;
    while (sibling != null && sibling.Descendants("h1").Count() <= 0)
    {
        section.Add(sibling);
        sibling = sibling.NextSibling;
    }

    return section;
}
+5

, , div h1-Tag? , .

helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(@name, '"+sectionName+"')]/ancestor::div");

SelectNodes Html. :

helpDocument.DocumentNode.SelectNodes("//h1/a[starts-with(@name,'_Toc')]/ancestor::div");

, , , , , , ​​ contains, name, .

0

All Articles