I read a lot of q & a on how to remove all html code from a string using python, but none of them satisfied. I need a way to remove all tags, save / convert html objects and work well with utf-8 strings.
Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get only texts, but I was losing objects
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = []
def handle_data(self, data):
self.data.append(data)
def handle_charref(self, name):
self.data.append(name)
def handle_entityref(self, ent):
self.data.append(ent)
gives me something like
[u'Asia, sp ', u'cialiste du voyage', ...
lose the object for the accented "e" in spécialiste.
Using one of many regular expressions that you can find as answers to similar questions, it will always have some cases of edges that have not been considered.
, ?