Work with Scrapy Div Class

Question

Work with Scrapy Div Class

I am new to Scrapy and really python. I am trying to write a scraper that will retrieve the article title, link and article description ALMOST as an RSS feed from a web page to help me in my dissertation. I wrote the following scraper, and when I launched it and exported it as .txt, it comes back. I believe that I need to add a Loader element, but I'm not sure.

Items.py

from scrapy.item import Item, Field

class NorthAfricaItem(Item):
    title = Field()
    link = Field()
    desc = Field()
    pass

Spider

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafricatutorial.items import NorthAfricaItem

class NorthAfricaItem(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

 def parse(self, response):
 hxs = HtmlXPathSelector(response)
 sites = hxs.select('//ul/li')
 items = []
 for site in sites:
     item = NorthAfricaItem()
     item['title'] = site.select('a/text()').extract()
     item['link'] = site.select('a/@href').extract()
     item['desc'] = site.select('text()').extract()
     items.append(item)
 return items

UPDATE

Thanks to Talvalin for help and, in addition, with some disturbances, I was able to fix this problem. I used the stock script that I found on the Internet. However, as soon as I used the shell, I was able to find the right tags to get what I needed. Ive finished with:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafrica.items import NorthAfricaItem

class NorthAfricaSpider(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = NorthAfricaItem()
           item['title'] = site.select('//div[@class="short_holder"]    /h2/a/text()').extract()
       item['link'] = site.select('//div[@class="short_holder"]/h2/a/@href').extract()
       item['desc'] = site.select('//span[@class="summary"]/text()').extract()
       items.append(item)
   return items

If anyone sees anything here, I made a mistake, let me know ...... but it works.

+5

python scrapy

Mike 24 . '13 16:15

1

Talvalin · Accepted Answer · 2013-01-24T16:55:36+0000

, . spider , - :

        exceptions.TypeError: 'NorthAfricaItem' object does not support item assignment

2013-01-24 16:43:35+0000 [northafrica] INFO: Closing spider (finished)

, , , : NorthAfricaItem

, NorthAfricaItem (, , desc), . Spider NorthAfricaItem , .

, NorthAfricaSpider, .

Work with Scrapy Div Class

More articles: