HTML Treebuilder XPath for link extraction

Question

HTML Treebuilder XPath for link extraction

I am writing a basic script that simply extracts all the links from a web page. It is written in Perl and uses the WWW :: Mechanize and HTML :: Treebuilder :: Xpath modules, both of which I installed through CPAN.

I know that this can be easily done using only WWW :: Mechanize, but I would like to learn how to do it using XPath.

So, the script will analyze the entire web page and check the href attribute for each anchor tag, extract the link and print it to the console / write to the file. Note that in the script below I did not use use strict, since I am only writing this to clarify and understand the concept of using XPath to move an HTML tree.

here is the script:

#! /usr/bin/perl

use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use warnings;

$url="https://example.com";

$mech=WWW::Mechanize->new();
$mech->get($url);

$tree=HTML::TreeBuilder::XPath->new();

$tree->parse($mech->content);

$nodes=$tree->findnodes(q{'//a'}); # line is modified later.

foreach $node($nodes)
{
    print $node->attr('href');
}

And this gives an error:

Can't locate object method "attr" via package "XML::XPathEngine::Literal" at pagegetter.pl line 23.

I changed the script as follows:

$nodes=$tree->findnodes(q{'//a/@href'});

while($node=$nodes->shift)
{
  print $node->attr('href');
}

Error:

Can't locate object method "shift" via package "XML::XPathEngine::Literal"

, href.

$nodes href? , , ?

, , .

.

+5

html perl xpath html-tree

Neon Flash 31 . '12 12:55

1

daxim · Accepted Answer · 2012-07-31T13:07:55+0000

. :

# list context
my @nodes = $tree->findnodes(
    q{//a}       # just a string, not a string containings quotes
);

# iterate over array
for my $node (@nodes) {

HTML Treebuilder XPath for link extraction

More articles: