TEXT_NODE: returns ONLY text?

I use JavaScript to extract all text from a DOM object. My algorithm goes through the DOM object itself and its descendants, if node is of type TEXT_NODE, which accumulates its nodeValue.
For some strange reason, I also get things like:

#hdr-editions a { text-decoration:none; }
#cnn_hdr-editionS { text-align:left;clear:both; }
#cnn_hdr-editionS a { text-decoration:none;font-size:10px;top:7px;line-height:12px;font-weight:bold; }
#hdr-prompt-text b { display:inline-block;margin:0 0 0 20px; }
#hdr-editions li { padding:0 10px; }

How to do it? Do I need to use something else? I want ONLY text.

+3
source share
4 answers

In terms of things, you also collect text from elements <style>. You might want to run a check for them:

var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }

if (element.tagName in ignore)
    continue;

You can add any other elements to the map of objects to ignore them.

+7
source

You want to skip items style.

In your loop, you could do this ...

if (element.tagName == 'STYLE') {
   continue;
}

, script, textarea ..

+1

DOM. () <script> <style>.

0

[ ]

, , - , STYLE SCRIPT.

DOM, , , .

, DOM- :

function walker(domObject, extractorCallback) {
    if (domObject == null) return; // fail fast
    extractorCallback(domObject);
    if (domObject.nodeType != Node.ELEMENT_NODE) return;
    var childs = domObject.childNodes;
    for (var i = 0; i < childs.length; i++)
        walker(childs[i]);
}

var textvalue = "":
walker(document, function(node) { 
    if (node.nodeType == Node.TEXT_NODE)
        textvalue += node.nodeValue;
});

In this case, if your walker encounters tags that, as you know, you will not like, you just need to skip their contents in this part of the tree. Therefore, walker()it will be necessary to adapt in this way:

var ignore = { "STYLE":0, "SCRIPT":0, "NOSCRIPT":0, "IFRAME":0, "OBJECT":0 }

function walker(domObject, extractorCallback) {
    if (domObject == null) return; // fail fast
    extractorCallback(domObject);
    if (domObject.nodeType != Node.ELEMENT_NODE) return;

    if (domObject.tagName in ignore) return; // <--- HERE

    var childs = domObject.childNodes;
    for (var i = 0; i < childs.length; i++)
        walker(childs[i]);
}

Thus, if we see a tag that you don’t like, we just skip it and all its children, and your extractor will never be exposed to text nodes inside such tags.

0
source

All Articles