Convert HTML list (<li>) to tabs (e.g. indentation)
They worked in dozens of languages, but were new to Python.
My first (maybe second) question is here, so be careful ...
Trying to efficiently convert HTML-like markup text to wiki format (in particular Linux Tomboy / GNote notes Zim) and stuck in conversion lists.
For a 2-level unordered list like this ...
- First level
- Second level
Tomboy / GNote uses something like ...
<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>
However, personal wiki Zim wants it to be ...
* First level
* Second level
... with leading tabs.
I studied the functions of the regex module re.sub (), re.match (), re.search (), etc. and found Python's great ability to encode repeating text as ...
count * "text"
So it seems like there should be a way to do something like ...
newnote = re.sub("<list>", LEVEL * "\t", oldnote)
Where LEVEL is the serial number <list>in the note. Thus, it would be incountered 0for the first <list>, 1for the second, etc.
LEVEL will then decrease each time it occurs </list>.
<list-item>converted to an asterisk for the bullet (if necessary, precedes a new line) and </list-item>discarded.
Finally ... the question ...
- How to get the LEVEL value and use it as a tab multiplier?
You really should use an XML parser for this, but answer your question:
import re
def next_tag(s, tag):
i = -1
while True:
try:
i = s.index(tag, i+1)
except ValueError:
return
yield i
a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
a = a.replace("<list-item>", "* ")
for LEVEL, ind in enumerate(next_tag(a, "<list>")):
a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)
a = a.replace("</list-item>", "")
a = a.replace("</list>", "")
print a
, . XML. xml.dom.minidom ( Python ( 2.7), ):
import xml.dom.minidom
def parseList(el, lvl=0):
txt = ""
indent = "\t" * (lvl)
for item in el.childNodes:
# These are the <list-item>s: They can have text and nested <list> tag
for subitem in item.childNodes:
if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
# This is the text before the next <list> tag
txt += "\n" + indent + "* " + subitem.nodeValue
else:
# This is the next list tag, its indent level is incremented
txt += parseList(subitem, lvl=lvl+1)
return txt
def parseXML(s):
doc = xml.dom.minidom.parseString(s)
return parseList(doc.firstChild)
a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)
:
* First level
* Second level
* Second level 2
* Third level
Beautiful , , .
from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')] for list_tag in soup('list')]
Output : [[u'First level'], [u'Second level']]
,
for list_tag in soup('list'):
for item in list_tag('list-item'):
print item.text
, .
BeautifulSoup 3, BeautifulSoup4, .
from bs4 import BeautifulSoup