Writing to multiple files with Scrapy

I am cleaning up the Scrapy site and would like to split the results into two parts. I usually call Scrapy as follows:

$ scrapy crawl articles -o articles.json
$ scrapy crawl authors  -o  authors.json

Two spiders are completely independent and do not communicate at all. This setting works on small websites, but for larger sites there are too many authors for me to crawl this.

How could I have a spider articlestell a authorsspider which pages are scanned and support this structure of two files? Ideally, I prefer not to write the author’s URLs to a file, and then read it with another spider.

+5
source share
2 answers

I ended up using command-line arguments for the author's scraper:

class AuthorSpider(BaseSpider):
    ...

    def __init__(self, articles):
        self.start_urls = []

        for line in articles:
            article = json.loads(line)
            self.start_urls.append(data['author_url'])

, Scrapy:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

, JSON :

$ scrapy crawl authors -o authors.json -a articles=articles.json

, .

+1

-,

, /.

, ?

, , , Scrapy pipeline, json , .

another point for very large json data is not recommended to use jsonlines

0
source

All Articles