s...">

Beautifulsoup finds an element by text with `find_all` regardless of whether it contains elements

for instance

bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))

returns [<a>sometext</a>], but if the item search has a child, i.e.img

bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))

he returns []

Is there a way to use find_allto match a later example?

+5
source share
1 answer

You will need to use a hybrid approach, because it text=will fail when an element has children, as well as text.

bs = BeautifulSoup("<html><a>sometext</a></html>")    
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]

Background

When BeautifulSoup searches for an element and textis callable, it ultimately ultimately calls :

self._matches(found.string, self.text)

In the two examples you provided, the method .stringreturns different things:

>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None

.string :

@property
def string(self):
    """Convenience property to get the single string within this tag.

    :Return: If this tag has a single string child, return value
     is that string. If this tag has no children, or more than one
     child, return value is None. If this tag has one child tag,
     return value is the 'string' attribute of the child tag,
     recursively.
    """
    if len(self.contents) != 1:
        return None
    child = self.contents[0]
    if isinstance(child, NavigableString):
        return child
    return child.string

, , None:

>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]
+14

All Articles