Regular Expression Template for Content in HTML Tags

I encoded a simple Python script that connects to a specific site and gets all the links there.

import urllib2
import re


request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href=".\w+.\d+">.+</a>', content)
if match:
    for i in match:
        print i + "\n"

else:
    print 'Not Found!'

Result:

<a href="/video/3878"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3878.jpg"  alt=
"avatar" /></a>

<a href="/video/3878">NodeZero Linux Review</a>

<a href="/video/3877"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3877.jpg"  alt=
"avatar" /></a>

<a href="/video/3877">Post Attack Uploading Shell in Real Time</a>

<a href="/video/3867"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3867.jpg"  alt=
"avatar" /></a>

<a href="/video/3867">Using SQLMAP in Real Time (SQLinjection WEB)</a>

<a href="/video/3866"><img class="corner iradius20  ishadow33" width="100" heigh
t="75" src="http://videothumbs.securitytube.net.s3.amazonaws.com/3866.jpg"  alt=
"avatar" /></a>
....
...
...

I am trying to get these links with a friendly name, for example <a href="/video/3867">Using SQLMAP in Real Time (SQLinjection WEB)</a>.

My template: <a href=".\w+.\d+">.+</a>

+3
source share
2 answers

If you really want to use regular expressions instead of the correct parser, you can match groupsand refer to them later.

See http://docs.python.org/library/re.html

(...)

Matches any regular expression inside parentheses and indicates the beginning and end of the group; group content can be retrieved after a match is made

Try:

request = urllib2.urlopen('http://www.securitytube.net/')
content = request.read()
match = re.findall(r'<a href="(.*?)".*>(.*)</a>', content)
if match:
    for link, title in match:
        print "link %s -> %s" % (link, title)

these outputs:

link /video/3822 -> SecurityTube SpeakUp: Cloud Computing
link /video/3587 -> 
link /video/3587 -> Securitytube Speak Up: Antivirus Evasion attacks
link /video/3489 -> 
link /video/3489 -> SecurityTube SpeakUp: ThePirateBay LOSS
link /video/3375 -> 
link /video/3375 -> SecurityTube SpeakUp: .COM and .NET Domain Seizures
link /video/3130 -> 
link /video/3130 -> SecurityTube Speak Up: The MS12-020 Fiasco!
...etc

, , , . , #, ... , .

+2

html .; -)

, , -HTML , , :

  • .\w+.\d+ ( / /video/3877. ` "[^" ] +"
  • .+, ... : .+?
0

All Articles