Clear only x number of characters - how?

BACKGROUND

I have a website that indexes all of Danish psychologists. My site provides contact information for all clinics, as well as user ratings.

Currently, I list 12,000 psychologists, of whom about 6,000 have a website. About 1000 psychologists visited my site and filled out their profile with additional “descriptive” information (for example, opening hours, prices, etc.).

I am trying to automatically clean (with PHP and RegEx) the sites of those who did not provide details to my community for informative reasons.

I looked at about 150 random sites and came to the conclusion that more than 85% of them have valuable text that continues the word "Velkommen" (= welcome, in Denish). PRECIOUS!

QUESTIONS

# 1

As I definitely have in my script, that I would only like to capture approx. 360 characters and nothing more. Ofc. this should precede (including) the word Velkommen. In addition, the script should not be case sensitive (although Velkommen is usually written with capital V, it may appear in another sentence.)

In addition, it should be the last “velcommen” on the entire front page, as it sometimes appears as a Menu / Navigation option, which would suck, since then I would capture navigation options.

# 2

Currently - my script stores information in arrays and then in the database.

I don’t know how I should do it. What would be optimal for SEO;

  • Save the scraped text in MySQL and display it every time.
  • Display the same 360-character texts each time [what follows Velcommen]
  • 360- , , - .

:

$web = "http://www.psykologdorthelau.dk/";
$website = file_get_contents ($web);

preg_match_all("/velkommen.+?/sim", $website, $information);

//THIS SHOULD SPECIFICY THE VERY LAST 'VELKOMMEN' - it doesn't, I know :(
for($i = 0; $i < count($information[0]); $i++){

preg_match_all("/Velkommen (.+?)\"/sim", $information[0][$i], $text, PREG_SET_ORDER);

$psychologist[$i]['text'] = mysql_real_escape_string($text[0][1]);
}

, , .

+3
2

, .

:

$handle = fopen("http://www.example.com/", "r"); // open a filestream
// Fetch for example only 10 bytes each time we check
$chunkSize = 10;
$contents = "";
while ( !feof( $handle ) && strlen($contents) < 360) { 
    $buffer = fread( $handle, $chunkSize ); 

    $contents .= $buffer;

} 
$status = fclose( $handle ); 

//your data is stored in $contents
+1

" " velkommen ":

preg_replace_callback('/velkommen(.*){360}/i',
  function($matched) {
    // Use $matched[1] to perform further testing
  },
  $contents
);

, . PHP 5.4. .

+1

All Articles