Scrapy CrawlSpider: How to Access an Element at Different Parsing Levels

I am browsing a website (only two levels in depth) and I want to clear information from sites on both levels. The problem I am facing is that I want to fill in the fields of one element with information from both levels. How to do it?

I thought that the list of elements is an instance variable that will be available for all threads (since this is the same spider instance), and parse_1 will fill in some fields, and parse_2 will have to check it before filling in the corresponding value. This method seems burdensome, and I'm still not sure how to make it work.

I think there should be a better way, perhaps somehow passing the element to a callback. However, I do not know how to do this using the Request () method. Ideas?

+5
source share
1 answer

From the documentation for the violin:

In some cases, you might be interested in passing arguments to these callback functions so that you can get the arguments later in the second callback. You can use the Request.meta attribute for this.

Here is an example of how to pass an element using this mechanism to fill in different fields from different pages:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

So basically you can clear the first page and save all the information in the element, and then send the whole element with a request to this second level URL and get all the information in one element.

+8

All Articles