I am new to Scrapy and really python. I am trying to write a scraper that will retrieve the article title, link and article description ALMOST as an RSS feed from a web page to help me in my dissertation. I wrote the following scraper, and when I launched it and exported it as .txt, it comes back. I believe that I need to add a Loader element, but I'm not sure.
Items.py
from scrapy.item import Item, Field
class NorthAfricaItem(Item):
title = Field()
link = Field()
desc = Field()
pass
Spider
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafricatutorial.items import NorthAfricaItem
class NorthAfricaItem(BaseSpider):
name = "northafrica"
allowed_domains = ["http://www.north-africa.com/"]
start_urls = [
"http://www.north-africa.com/naj_news/news_na/index.1.html",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = NorthAfricaItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
UPDATE
Thanks to Talvalin for help and, in addition, with some disturbances, I was able to fix this problem. I used the stock script that I found on the Internet. However, as soon as I used the shell, I was able to find the right tags to get what I needed. Ive finished with:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafrica.items import NorthAfricaItem
class NorthAfricaSpider(BaseSpider):
name = "northafrica"
allowed_domains = ["http://www.north-africa.com/"]
start_urls = [
"http://www.north-africa.com/naj_news/news_na/index.1.html",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = NorthAfricaItem()
item['title'] = site.select('//div[@class="short_holder"] /h2/a/text()').extract()
item['link'] = site.select('//div[@class="short_holder"]/h2/a/@href').extract()
item['desc'] = site.select('//span[@class="summary"]/text()').extract()
items.append(item)
return items
If anyone sees anything here, I made a mistake, let me know ...... but it works.