Read value from HTML node

I am new to XML / HTML parsing. Don't even know the right words for finding duplicates correctly.

I have this HTML file that looks like this:

<body id="s1" style="s1">
    <div xml:lang="uk">
        <p begin="00:00:00" end="00:00:29">
          <span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
        </p>

Now I need 00:00:00, 00:00:29and I'm great!. I could read it as follows:

XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Element)
        continue;

    if (reader.LocalName != "p")
        continue;

    var a = reader.GetAttribute(0);
    var b = reader.GetAttribute(1);

    if (reader.LocalName == "span")
    {
        XmlDocument doc = new XmlDocument();
        doc.Load(reader);
        XmlNode elem = doc.DocumentElement.FirstChild;
        var c = elem.InnerText;
    }
 }

I get values ​​in variables a, band c. But there was a slight change in HTML format. HTML now looks like this:

<body id="s1" style="s1">
  <div xml:lang="uk">
      <p begin="00:00:00" end="00:00:29">I'm great! </p>

In this scenario, how to parse 00:00:00, 00:00:29and I'm great!? I tried this:

XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Element)
        continue;

    if (reader.LocalName != "p")
        continue;

    var a = reader.GetAttribute(0);
    var b = reader.GetAttribute(1);

    XmlDocument doc = new XmlDocument();
    doc.Load(reader);
    XmlNode elem = doc.DocumentElement.FirstChild;
    var c = elem.InnerText;
}

But I get this error: This document already has a 'DocumentElement' node.in line doc.Load(reader). How to read and what causes problems? I am using .NET 2.0

+5
source share
2 answers

, HTML, XML. This document already has a 'DocumentElement' node.: root node, ( ) HTML, XML.

HTML. , .NET . . HTML, .

Edit:

, , HTML XML . , , :

Relation between SGML, HTML and XML

XML, HTML, . XHTML ( ), HTML-, XML. , XHTML , , . , XHTML , , HTML5 , ...

: , HTML.

+6

HTML, WebClient ( WebBrowser) , HTML DOM . Microsoft HTML Object Library (COM) :

  string html;
  WebClient webClient = new WebClient();
  using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
  using (StreamReader reader = new StreamReader(stream))
  {
    html = reader.ReadToEnd();
  }
  IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
  doc.write(html);
  foreach (IHTMLElement el in doc.all)
    Console.WriteLine(el.tagName);

HTML XML , (, <BR> ), , .. XSLT , HTML DOM , XML node HTML node. XML- HTML.

+3

All Articles