Scrapy - analysis of all subpages of a specific domain

I would like to analyze kickstarter.com projects using scrapy, but I cannot figure out how to create spider search projects, which I will not explicitly specify in start_urls. I have the first part of scrapy code (I can extract the necessary information from one site), I just can not get it to do this for all projects under the kickstarter.com/projects domain.

From what I read, I believe that parsing is possible (1) using the links on the start page (kickstarter.com/projects), (2) using the links from one page of the project to go to another project, and (3 ) using the sitemap (which kickstarter.com doesn't seem to work for) to find the parsed web pages.

I spent hours trying to use each of these methods, but I'm not going anywhere.

I used the course code and built it.

Here is the part that works:

from scrapy import log
from scrapy.contrib.spiders import CrawlSpider   
from scrapy.selector import HtmlXPathSelector  

from tutorial.items import kickstarteritem

class kickstarter(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']    
    start_urls = ["http://www.kickstarter.com/projects/brucegoldwell/dragon-keepers-book-iv-fantasy-mystery-magic"]

    def parse(self, response):
        x = HtmlXPathSelector(response)

        item = kickstarteritem()
        item['url'] = response.url
        item['name'] = x.select("//div[@class='NS-project_-running_board']/h2[@id='title']/a/text()").extract()
        item['launched'] = x.select("//li[@class='posted']/text()").extract()
        item['ended'] = x.select("//li[@class='ends']/text()").extract()
        item['backers'] = x.select("//span[@class='count']/data[@data-format='number']/@data-value").extract()
        item['pledge'] = x.select("//div[@class='num']/@data-pledged").extract()
        item['goal'] = x.select("//div[@class='num']/@data-goal").extract()
        return item
+5
source share
1 answer

Since you are subclassing CrawlSpider, do not redefine parse. CrawlSpiderThe link traversal logic is contained in parsewhat you really need.

As for the workaround itself, then for this class attribute rules. I have not tested it, but it should work:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector

from tutorial.items import kickstarteritem

class kickstarter(CrawlSpider):
    name = 'kickstarter'
    allowed_domains = ['kickstarter.com']    
    start_urls = ['http://www.kickstarter.com/discover/recently-launched']

    rules = (
        Rule(
            SgmlLinkExtractor(allow=r'\?page=\d+'),
            follow=True
        ),
        Rule(
            SgmlLinkExtractor(allow=r'/projects/'),
            callback='parse_item'
        )
    )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=kickstarteritem(), response=response)

        loader.add_value('url', response.url)
        loader.add_xpath('name', '//div[@class="NS-project_-running_board"]/h2[@id="title"]/a/text()')
        loader.add_xpath('launched', '//li[@class="posted"]/text()')
        loader.add_xpath('ended', '//li[@class="ends"]/text()')
        loader.add_xpath('backers', '//span[@class="count"]/data[@data-format="number"]/@data-value')
        loader.add_xpath('pledge', '//div[@class="num"]/@data-pledged')
        loader.add_xpath('goal', '//div[@class="num"]/@data-goal')

        yield loader.load_item()

The spider crawls the pages of recently launched projects.

Also use yieldinstead return. It is better that your spider displays a generator, and it allows you to give multiple items / queries without creating a list to store them.

+4
source

All Articles