Get all HTML links using lxml

Question

Get all HTML links using lxml

I want to find out all the urls and their name from an html page using lxml.

I can parse the url and find out this thing, but is there any simple way by which I can find all url links using lxml?

+3

python lxml

sam Apr 30 '12 at 12:02

source share

2 answers

kev · Answer 1 · 2012-04-30T12:08:44+0000

from lxml.html import parse
dom = parse('http://www.google.com/').getroot()
links = dom.cssselect('a')

lmokto · Answer 2 · 2014-01-23T19:06:18+0000

from lxml import etree, cssselect, html

with open("/you/path/index.html", "r") as f:
    fileread = f.read()

dochtml = html.fromstring(fileread)

select = cssselect.CSSSelector("a")
links = [ el.get('href') for el in select(dochtml) ]

links = iter(links)
for n, l in enumerate(links):
    print n, l

Get all HTML links using lxml

More articles: