Convert HTML list (<li>) to tabs (e.g. indentation)

Question

Convert HTML list (<li>) to tabs (e.g. indentation)

They worked in dozens of languages, but were new to Python.

My first (maybe second) question is here, so be careful ...

Trying to efficiently convert HTML-like markup text to wiki format (in particular Linux Tomboy / GNote notes Zim) and stuck in conversion lists.

For a 2-level unordered list like this ...

First level
- Second level

Tomboy / GNote uses something like ...

<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>

However, personal wiki Zim wants it to be ...

* First level
  * Second level

... with leading tabs.

I studied the functions of the regex module re.sub (), re.match (), re.search (), etc. and found Python's great ability to encode repeating text as ...

 count * "text"

So it seems like there should be a way to do something like ...

 newnote = re.sub("<list>", LEVEL * "\t", oldnote)

Where LEVEL is the serial number <list>in the note. Thus, it would be incountered 0for the first <list>, 1for the second, etc.

LEVEL will then decrease each time it occurs </list>.

Tags

<list-item>converted to an asterisk for the bullet (if necessary, precedes a new line) and </list-item>discarded.

Finally ... the question ...

How to get the LEVEL value and use it as a tab multiplier?

+3

python html regex tabs

DocSalvager Apr 15 '12 at 10:20

source share

2 answers

Beautiful , , .

from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')]  for list_tag in soup('list')]

Output : [[u'First level'], [u'Second level']]

,

for list_tag in soup('list'):
     for item in list_tag('list-item'):
         print item.text

, .

BeautifulSoup 3, BeautifulSoup4, .

from bs4 import BeautifulSoup

+2

Rach 15 . '12 11:30

jadkik94 · Accepted Answer · 2012-04-15T12:19:19+0000

You really should use an XML parser for this, but answer your question:

import re

def next_tag(s, tag):
    i = -1
    while True:
        try:
            i = s.index(tag, i+1)
        except ValueError:
            return
        yield i

a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"

a = a.replace("<list-item>", "* ")

for LEVEL, ind in enumerate(next_tag(a, "<list>")):
    a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)

a = a.replace("</list-item>", "")
a = a.replace("</list>", "")

print a

, . XML. xml.dom.minidom ( Python ( 2.7), ):

import xml.dom.minidom

def parseList(el, lvl=0):
    txt = ""
    indent = "\t" * (lvl)
    for item in el.childNodes:
        # These are the <list-item>s: They can have text and nested <list> tag
        for subitem in item.childNodes:
            if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
                # This is the text before the next <list> tag
                txt += "\n" + indent + "* " + subitem.nodeValue
            else:
                # This is the next list tag, its indent level is incremented
                txt += parseList(subitem, lvl=lvl+1)
    return txt

def parseXML(s):
    doc = xml.dom.minidom.parseString(s)
    return parseList(doc.firstChild)

a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)

:

* First level
    * Second level
    * Second level 2
        * Third level

Convert HTML list (<li>) to tabs (e.g. indentation)

More articles: