Remote cleanup page and get the most relevant title or description for XPath images

What I'm doing is essentially the same as the Tweet button or the Share / Like Facebook button, namely clearing the page and the most relevant title for a piece of data. The best example I can think of is when you are on the front page of a website with many articles and you click on the Facebook Like button. He will then receive the correct information for the message regarding the (nearest) Like button. Some sites have Open Graph tags, but some of them do not work and still work.

Since this is done remotely, I control only the data that I want to configure. In this case, the data is images. Instead of extracting only the <title>pages, I want to somehow cross the dom in the opposite direction from the starting point of each image and find the closest "title". The problem is that not all headers appear in front of the image. However, the likelihood of an image appearing after the name in this case seems rather high. With that said, I hope that it will work well for almost any site.

Thoughts:

  • Find the "container" of the image, and then use the first block of text.
  • Find blocks of text in elements that contain specific classes ("description", "name") or elements (h1, h2, h3, h4).

Title backups:

  • Open Graph
  • <title>
  • ALT
  • META

: , .

: ? , DomDocument XPath?

+5
1

, / XPath, -, , . - :

i = 0

while (//img[i][@src])
  if (//img[i][@alt])
    return alt
  else if (//img[i][@description])
    return description
  else if (//img[i]/../p[0])
    return p
  else
    return (//title)

  i++

XPath ( ):

function ph_DOM($html, $xpath = null)
{
    if (is_object($html) === true)
    {
        if (isset($xpath) === true)
        {
            $html = $html->xpath($xpath);
        }

        return $html;
    }

    else if (is_string($html) === true)
    {
        $dom = new DOMDocument();

        if (libxml_use_internal_errors(true) === true)
        {
            libxml_clear_errors();
        }

        if ($dom->loadHTML(ph()->Text->Unicode->mb_html_entities($html)) === true)
        {
            return ph_DOM(simplexml_import_dom($dom), $xpath);
        }
    }

    return false;
}

:

$html = file_get_contents('http://en.wikipedia.org/wiki/Photography');

print_r(ph_DOM($html, '//img')); // gets all images
print_r(ph_DOM($html, '//img[@src]')); // gets all images that have a src
print_r(ph_DOM($html, '//img[@src]/..')); // gets all images that have a src and their parent element
print_r(ph_DOM($html, '//img[@src]/../..')); // and so on...
print_r(ph_DOM($html, '//title')); // get the title of the page
+1

All Articles