How to make this Craigslist page this way?

This is the page in question: http://phoenix.craigslist.org/cpg/

What I would like to do is create an array that looks like this:

Date (as shown by the h4 tag on this page) => in the cell [0][0][0],
Link text => in the cell [0][1][0]
Link href => in the cell[0][1][1]

i.e. in each row, I store each of these elements in a row.

What I did was just pull out all the tags h4and save them in a hash, for example:

contents2[link[:date]] = content_page.css("h4").text

The problem is that one cell saves all the text from h4 tags on the whole page ... whereas I would like to have 1 date in 1 cell.

So, as an example:

0 => Mon May 28 - Leads need follow up - (Phoenix) - http://phoenix.craigslist.org/wvl/cpg/3043296202.html
1=> Mon May 28 - .Net/Java Developers - (phoenix) - http://phoenix.craigslist.org/cph/cpg/3043067349.html

Any thoughts on how I could approach this with the code would be greatly appreciated.

+5
2

?

require 'rubygems'
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://phoenix.craigslist.org/cpg/"))

# Postings start inside the second blockquote on the page
bq = doc.xpath('//blockquote')[1]

date  = nil         # Temp store of date of postings
posts = Array.new   # Store array of all postings here

# Loop through all blockquote children collecting data as we go along...
bq.children.each { |nod|
  # The date is stored in the h4 nodes. Grab it from there.
  date = nod.text if nod.name == "h4"

  # Skip nodes until we have a date
  next if !date

  # Skip nodes that are not p blocks. The p blocks contain the postings.
  next if nod.name != "p"

  # We have a p block. Extract posting data.
  link = nod.css('a').first['href']
  text = nod.text

  # Add new posting to array
  posts << [date, text, link]
}

# Output everything we just collected
posts.each { |p| puts p.join(" - ") }
+3

, , , :

doc.traverse do |node|
  @date = node.text if node.name == 'h4'
  next unless @date
  break if node.text['next 100 postings']
  puts [@date, node.parent.text, node[:href]].join(' - ') if node.name == 'a'
end
+2

All Articles