I am wondering if there is a library around which you can do something like http://tool.motoricerca.info/similarity-analyzer.phtml
The results list something called HTML fingerprint, which gives a percentage value of the probability that two pages will be structurally similar.
source
share