Safely removing all html code from a string in python

Question

Safely removing all html code from a string in python

I read a lot of q & a on how to remove all html code from a string using python, but none of them satisfied. I need a way to remove all tags, save / convert html objects and work well with utf-8 strings.

Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get only texts, but I was losing objects

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, name):
        self.data.append(name)

    def handle_entityref(self, ent):
        self.data.append(ent)

gives me something like

[u'Asia, sp ', u'cialiste du voyage', ...

lose the object for the accented "e" in spécialiste.

Using one of many regular expressions that you can find as answers to similar questions, it will always have some cases of edges that have not been considered.

, ?

+5

python html security parsing utf-8

Arjuna Del Toso 09 . '13 0:37

2

, pyquery? easy_install/pip install pyquery; - :

from pyquery import PyQuery as jQ

dom = jQ("<html>...</html>")
print dom("body").text()

+1

pinkdawn 09 . '13 1:58

Tim Heap · Accepted Answer · 2013-04-09T00:42:10+0000

bleach . , . , , . .

Safely removing all html code from a string in python

More articles: