Extract Background Images from Web Page / HTML + CSS Analysis

I am creating a sharing site that allows you to exchange links to web pages with Ruby on Rails.

I would like to extract some representative images for each page (like on Facebook when you share a link).

I currently use the opengraph gem to parse the meta tag og:imagefirst, and then use Nokogiri to parse the content of the page and get all the <img>tag attributes src. This gives good results (with the exception of some decorations, so I filter the results by size ...).

-

Now I would like to go further and analyze the css property background-image: the website logo is often displayed as the background for the <h1>or tag <a>.

I am thinking of the following process:

  • Parse an HTML document with regex (something like /background(-image)?:.../) to find inline CSS

  • Extract CSS style sheets with Nokogiri and parse these sheets with the same regular expression

... and absolute URLs according to document URLs.

-

My questions:

  • Do you think there is a more effective alternative?

  • Is there any library that can improve process performance?

    , HTML + CSS, CSS DOM, HTML (h1, a,...) .

+3
1

CSS - , , (, ), .

, , . //, "".

" " , , , ( ) : - ruby ​​ unix-?

, HTML-?

, , - .

+1

All Articles