Beautifulsoup cannot find href in file using regex

I have an html file, for example:

<form action="/2811457/follow?gsid=3_5bce9b871484d3af90c89f37" method="post">
<div>
<a href="/2811457/follow?page=2&amp;gsid=3_5bce9b871484d3af90c89f37">next_page</a>
&nbsp;<input name="mp" type="hidden" value="3" />
<input type="text" name="page" size="2" style='-wap-input-format: "*N"' />
<input type="submit" value="jump" />&nbsp;1/3
</div>
</form>

how to extract href "/ 2811457 / follow? page = 2 & gsid = 3_5bce9b871484d3af90c89f37" in the next page?

This is part of the html, I intend to make it clear. When I use beautifulsoup,

print soup.find('a',href=re.compile('follow?page'))

he returns None, why? I am new to beautifulsoup and I am looking at the document, but still confused.

I now use the ugly way:

    urls = soup.findAll('a',href=True))
    for url in urls:
        if follow?page in url:
            print url

I need a cleaner and more elegant way.

+5
source share
1 answer

You need to avoid the question mark. Regular expression w?means zero or one w. Try the following:

print soup.find('a', href = re.compile(r'.*follow\?page.*'))
+14
source

All Articles