HTML parsing in sentences - how to handle tables / lists / headers / etc?

How do you decide to parse HTML pages with free text, lists, tables, headers, etc. in suggestions?

Take this page on Wikipedia . There is /:

After messing up with python NLTK , I want to check out all these different body annotation methods (from http://nltk.googlecode.com/svn/trunk/doc/book/ch11.html#deciding-which-layers-of-annotation-to -include ):

  • Toxification of the word . The spelling of the text does not uniquely identify its tokens. The labeled and normalized version, in addition to the regular spelling version, can be a very convenient resource.
  • Segmentation of offers . As we saw in chapter 3, sentence segmentation can be more complicated than it sounds. Therefore, some corporations use explicit annotations to indicate segmentation of offers.
  • Paragraph segmentation . Items and other structural elements (headings, chapters, etc.) can be explicitly annotated.
  • Part of the speech . The syntactic category of each word in a document.
  • Syntactic structure . A tree structure showing the composite structure of a sentence.
  • . , .
  • : ,

, . - HTML ? HTML XML , HTML-, , HTML , NLTK , , , .

- ? , ?

, NLTK!

+5
4

, HTML , , . XML, XML, . , , , , . , , .. . XML, , XMLCorpusReader, NLTK.

+1

, XML, .

, , - html . / , html. Ex. <h1 > - ; < & Li GT; ; < & GT;

XML, . Ex. <h1 > < > ; < & Li GT; to <paragraph> ; < & GT; to <token>

, (, [PHRASESTART] [PHRASEEND]), , POS EOS.

+1

, python-goose, html-.

, :

from html5lib import parse


with open('page.html') as f:
    doc = parse(f.read(), treebuilder='lxml', namespaceHTMLElements=False)

html = doc.getroot()
body = html.xpath('//body')[0]


def sanitize(element):
    """Retrieve all the text contained in an element as a single line of
    text. This must be executed only on blocks that have only inlines
    as children
    """
    # join all the strings and remove \n
    out = ' '.join(element.itertext()).replace('\n', ' ')
    # replace multiple space with a single space
    out = ' '.join(out.split())
    return out


def parse(element):
    # those elements can contain other block inside them
    if element.tag in ['div', 'li', 'a', 'body', 'ul']:
        if element.text is None or element.text.isspace():
            for child in element.getchildren():
                yield from parse(child)
        else:
            yield sanitize(element)
    # those elements are "guaranteed" to contains only inlines
    elif element.tag in ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        yield sanitize(element)
    else:
        try:
            print('> ignored', element.tag)
        except:
            pass


for e in filter(lambda x: len(x) > 80, parse(body)):
    print(e)
0

As alexis replied , python-goose might be a good option.

There is also an HTML tone index , a (new) library that aims to solve this exact problem. Its syntax is very simple. In one line, parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one)you can get offers on an HTML page stored in an array parsed_sentences.

0
source

All Articles