Advertisement
Guest User

Untitled

a guest
Apr 8th, 2011
687
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
PHP 0.76 KB | None | 0 0
  1. /*
  2.      * checks two urls and determines if they are more than $min_similarity% similar. requires bash access, a decent OS, w3m and wdiff
  3.      * @param $url first url
  4.      * @param $other_url other url
  5.      * @param $min_similarity an integer between 0 and 100, as in percentage
  6.     */
  7.     function urlEqualsUrl($url, $other_url, $min_similarity = 80) {
  8.        
  9.         $dir = "/tmp/equ_".substr(uniqid(),6);
  10.         mkdir($dir);
  11.        
  12.         $cmd = "w3m -dump \"$url\" 2>/dev/null > $dir/1.html;
  13.         w3m -dump \"$other_url\" 2>/dev/null > $dir/2.html;
  14.         wdiff -nis $dir/1.html $dir/2.html | tail -2 | awk '{print $5}'";
  15.        
  16.         $percentages = explode(PHP_EOL, `$cmd`);
  17.        
  18.         $percentage = (substr($percentages[0], 0, -1) + substr($percentages[1], 0, -1)) / 2;
  19.        
  20.         return $percentage > $min_similarity;
  21.        
  22.     }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement