HTML parsing in sentences - how to handle tables / lists / headers / etc?

Question

HTML parsing in sentences - how to handle tables / lists / headers / etc?

How do you decide to parse HTML pages with free text, lists, tables, headers, etc. in suggestions?

Take this page on Wikipedia . There is /:

After messing up with python NLTK , I want to check out all these different body annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to -include ):

Toxification of the word . The spelling of the text does not uniquely identify its tokens. The labeled and normalized version, in addition to the regular spelling version, can be a very convenient resource.
Segmentation of offers . As we saw in chapter 3, sentence segmentation can be more complicated than it sounds. Therefore, some corporations use explicit annotations to indicate segmentation of offers.
Paragraph segmentation . Items and other structural elements (headings, chapters, etc.) can be explicitly annotated.
Part of the speech . The syntactic category of each word in a document.
Syntactic structure . A tree structure showing the composite structure of a sentence.
. , .
: ,

, . - HTML ? HTML XML , HTML-, , HTML , NLTK , , , .

- ? , ?

, NLTK!

+5

python html text-segmentation nlp nltk

Lance Pollard 30 . '12 20:20

4

alexis · Answer 1 · 2012-07-01T16:33:01+0000

, HTML , , . XML, XML, . , , , , . , , .. . XML, , XMLCorpusReader, NLTK.

ezio808 · Answer 2 · 2013-12-06T23:44:28+0000

, XML, .

, , - html . / , html. Ex. <h1 > - ; < & Li GT; ; < & GT;

XML, . Ex. <h1 > < > ; < & Li GT; to <paragraph> ; < & GT; to <token>

, (, [PHRASESTART] [PHRASEEND]), , POS EOS.

amirouche · Answer 3 · 2016-11-10T20:38:56+0000

, python-goose, html-.

, :

from html5lib import parse


with open('page.html') as f:
    doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()
body = html.xpath('//body')[0]


def sanitize(element):
    """Retrieve all the text contained in an element as a single line of
    text. This must be executed only on blocks that have only inlines
    as children
    """
    # join all the strings and remove \n
    out = ' '.join(element.itertext()).replace('\n', ' ')
    # replace multiple space with a single space
    out = ' '.join(out.split())
    return out


def parse(element):
    # those elements can contain other block inside them
    if element.tag in ['div', 'li', 'a', 'body', 'ul']:
        if element.text is None or element.text.isspace():
            for child in element.getchildren():
                yield from parse(child)
        else:
            yield sanitize(element)
    # those elements are "guaranteed" to contains only inlines
    elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        yield sanitize(element)
    else:
        try:
            print('> ignored', element.tag)
        except:
            pass


for e in filter(lambda x: len(x) > 80, parse(body)):
    print(e)

Blueoxile · Answer 4 · 2018-02-01T17:37:05+0000

As alexis replied , python-goose might be a good option.

There is also an HTML tone index , a (new) library that aims to solve this exact problem. Its syntax is very simple. In one line, parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one)you can get offers on an HTML page stored in an array parsed_sentences.

HTML parsing in sentences - how to handle tables / lists / headers / etc?

More articles: