Extract Background Images from Web Page / HTML + CSS Analysis

Question

Extract Background Images from Web Page / HTML + CSS Analysis

I am creating a sharing site that allows you to exchange links to web pages with Ruby on Rails.

I would like to extract some representative images for each page (like on Facebook when you share a link).

I currently use the opengraph gem to parse the meta tag og:imagefirst, and then use Nokogiri to parse the content of the page and get all the <img>tag attributes src. This gives good results (with the exception of some decorations, so I filter the results by size ...).

-

Now I would like to go further and analyze the css property background-image: the website logo is often displayed as the background for the <h1>or tag <a>.

I am thinking of the following process:

Parse an HTML document with regex (something like /background(-image)?:.../) to find inline CSS
Extract CSS style sheets with Nokogiri and parse these sheets with the same regular expression

... and absolute URLs according to document URLs.

-

My questions:

Do you think there is a more effective alternative?
Is there any library that can improve process performance?
, HTML + CSS, CSS DOM, HTML (h1, a,...) .

+3

html css ruby-on-rails web-scraping screen-scraping

Thomas Guillory 18 . '12 20:56

1

Paul McClean · Accepted Answer · 2012-04-20T13:49:06+0000

CSS - , , (, ), .

, , . //, "".

" " , , , ( ) : - ruby unix-?

, HTML-?

, , - .

Extract Background Images from Web Page / HTML + CSS Analysis

More articles: