How to detect a modified web page?

In my application, I periodically get web pages using LWP. Is it necessary in any case to check whether the web page has changed between two consecutive samples (except for the comparison made explicitly)? Is there any signature (e.g. CRC) that is created at lower protocol levels that can be extracted and compared with older signatures to see possible changes?

+3
source share
2 answers

There are two possible approaches. One of them is to use page digest, for example

use strict;
use warnings;

use Digest::MD5 'md5_hex';
use LWP::UserAgent;

# fetch the page, etc.
my $digest = md5_hex $response->decoded_content;

if ( $digest ne $saved_digest ) { 
    # the page has changed.
}

- HTTP ETag, . , If-None-Match . ETag , 304 Not Modified . . ( ETag.) . RFC2616.

, ETag, . , .

+4

If-Modified-Since , gotchas RFC. . , , . , , 304 .

, , , , . , HTTP- -.

LWP :

use HTTP::Request;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new( GET => $url );
$r->header( 'If-Modified-Since' => $time );

$ua->request( $request );

:

$ua->add_handler(
    request_send => sub { 
        my($request, $ua, $h) = @_; 
        # ... look up time from local store
        $r->header( 'If-Modified-Since' => $time );
        }
    );

LWP mirror, :

$ua->mirror( $url, $filename )
+3

All Articles