Scrapy HtmlXPathSelector

Question

Scrapy HtmlXPathSelector

Just try to screen and try to get the base spider to work. I know that it’s just, maybe something that I’m missing, but I tried everything I could think of.

The error I get is:

line 11, in JustASpider
    sites = hxs.select('//title/text()')
NameError: name 'hxs' is not defined

My code is very simple at the moment, but I still can not find where I am wrong. Thanks for any help!

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//title/text()')
        for site in sites:
            print site.extract()


SPIDER = JustASpider()

+5

scrapy

Keanan koppenhaver Sep 03 '12 at 10:39

source share

9 answers

The code looks pretty old. I recommend using these codes instead

from scrapy.spider import Spider
from scrapy.selector import Selector

class JustASpider(Spider):
    name = "googlespider"
    allowed_domains=["google.com"]
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//title/text()').extract()
        print sites
        #for site in sites: (I dont know why you want to loop for extracting the text in the title element)
            #print site.extract()

Run code Hide result

, , - .

+5

pink bunny 04 . '15 6:28

, NameError: name 'hxs' is not defined , : IDE , .

+2

Rendy Bambang Junior 23 . '13 23:22

.

Scrapy
HtmlXPathSelector . :

hxs = Selector(response)
sites = hxs.xpath('//title/text()')

+1

dimka665 14 . '14 5:14

, , .

*.pyc .

0

warvariuc 05 . '12 4:47

:

test.py
scrapy runspider <filename.py>

:

scrapy runspider test.py

0

Jezeel Muhammed 19 . '13 15:01

, . .

!/usr/bin/env python

scrapy.spider BaseSpider scrapy.selector import HtmlXPathSelector

DmozSpider (BaseSpider): name = "dmoz" allowed_domains = [ "dmoz.org" ] start_urls = [ " http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    for site in sites:
        title = site.select('a/text()').extract()
        link = site.select('a/@href').extract()
        desc = site.select('text()').extract()
        print title, link, desc

0

user3672836 21 . '14 19:20

from scrapy.selector import HtmlXPathSelector

from scrapy.selector import Selector

hxs=Selector(response).

0

neal 26 . '15 5:38

Scrapy BeautifulSoup4.0. . , HtmlXPathSelector. , !

import scrapy
from bs4 import BeautifulSoup
import Item

def parse(self, response):

    soup = BeautifulSoup(response.body,'html.parser')
    print 'Current url: %s' % response.url
    item = Item()
    for link in soup.find_all('a'):
        if link.get('href') is not None:
            url = response.urljoin(link.get('href'))
            item['url'] = url
            yield scrapy.Request(url,callback=self.parse)
            yield item

0

sarc360 11 . '16 19:13

Keanan koppenhaver · Accepted Answer · 2012-09-10T16:27:07+0000

I removed the SPIDER call at the end and removed the for loop. There was only one title tag (as you would expect), and it seems to throw a loop. The code I have is as follows:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//title/text()')
        final = titles.extract()

Scrapy HtmlXPathSelector

!/usr/bin/env python

More articles: