Beautifulsoup finds an element by text with `find_all` regardless of whether it contains elements
for instance
bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))
returns [<a>sometext</a>], but if the item search has a child, i.e.img
bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))
he returns []
Is there a way to use find_allto match a later example?
+5
1 answer
You will need to use a hybrid approach, because it text=will fail when an element has children, as well as text.
bs = BeautifulSoup("<html><a>sometext</a></html>")
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]
Background
When BeautifulSoup searches for an element and textis callable, it ultimately ultimately calls :
self._matches(found.string, self.text)
In the two examples you provided, the method .stringreturns different things:
>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None
.string :
@property
def string(self):
"""Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the 'string' attribute of the child tag,
recursively.
"""
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
, , None:
>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]
+14