How To Extract All URLs From A Page Using PHP

Recently I needed a crawler script that would create a list of all pages on a single domain. As a part of that I wrote some functions that could download a page, extract all URLs from the HTML and turn them into absolute URLs (so that they themselves can be crawled later). Here’s the PHP code.

Extracting All Links From A Page
Here’s a function that will download the specified URL and extract all links from the HTML. It also translates relative URLs to absolute URLs, tries to remove repeated links and is overall a fine piece of code 🙂 Depending on your goal you may want to comment out some lines (e.g. the part that strips ‘#something’ (in-page links) from URLs).

function crawl_page($page_url, $domain) {

/* $page_url - page to extract links from, $domain -

crawl only this domain (and subdomains)

Returns an array of absolute URLs or false on failure.

*/

/* I'm using cURL to retrieve the page */

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $page_url);

curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

/* Spoof the User-Agent header value; just to be safe */

curl_setopt($ch, CURLOPT_USERAGENT,

'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');

/* I set timeout values for the connection and download

because I don't want my script to get stuck

downloading huge files or trying to connect to

a nonresponsive server. These are optional. */

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);

curl_setopt($ch, CURLOPT_TIMEOUT, 15);

/* This ensures 404 Not Found (and similar) will be

treated as errors */

curl_setopt($ch, CURLOPT_FAILONERROR, true);

/* This might/should help against accidentally

downloading mp3 files and such, but it

doesn't really work :/ */

$header[] = "Accept: text/html, text/*";

curl_setopt($ch, CURLOPT_HTTPHEADER, $header);

/* Download the page */

$html = curl_exec($ch);

curl_close($ch);

if(!$html) return false;

/* Extract the BASE tag (if present) for

relative-to-absolute URL conversions later */

if(preg_match('/]+)[\'\" >]/i',$html, $matches)){

$base_url=$matches[1];

} else {

$base_url=$page_url;

}

$links=array();

$html = str_replace("\n", ' ', $html);

preg_match_all('/]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)/i', $html, $m);

/* this regexp is a combination of numerous

versions I saw online; should be good. */

foreach($m[2] as $url) {

$url=trim($url);

/* get rid of PHPSESSID, #linkname, & and javascript: */

$url=preg_replace(

array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&/','/^(javascript:.*)/i'),

array('','','&',''),

$url);

/* turn relative URLs into absolute URLs.

relative2absolute() is defined further down

below on this page. */

$url = relative2absolute($base_url, $url);

// check if in the same (sub-)$domain

if(preg_match("/^http[s]?:\/\/[^\/]*".str_replace('.', '\.', $domain)."/i", $url)) {

//save the URL

if(!in_array($url, $links)) $links[]=$url;

}

return $links;

}

How To Translate a Relative URL to an Absolute URL
This script is based on a function I found on the web with some small but significant changes.

 
  function relative2absolute($absolute, $relative) {
         $p = @parse_url($relative);
         if(!$p) {
             //$relative is a seriously malformed URL
             return false;
         }
         if(isset($p["scheme"])) return $relative;
  
         $parts=(parse_url($absolute));
  
         if(substr($relative,0,1)=='/') {
             $cparts = (explode("/", $relative));
             array_shift($cparts);
         } else {
             if(isset($parts['path'])){
                  $aparts=explode('/',$parts['path']);
                  array_pop($aparts);
                  $aparts=array_filter($aparts);
             } else {
                  $aparts=array();
             }
            $rparts = (explode("/", $relative));
            $cparts = array_merge($aparts, $rparts);
            foreach($cparts as $i => $part) {
                 if($part == '.') {
                     unset($cparts[$i]);
                 } else if($part == '..') {
                     unset($cparts[$i]);
                     unset($cparts[$i-1]);
                 }
             }
         }
         $path = implode("/", $cparts);
  
         $url = '';
         if($parts['scheme']) {
             $url = "$parts[scheme]://";
         }
         if(isset($parts['user'])) {
             $url .= $parts['user'];
             if(isset($parts['pass'])) {
                 $url .= ":".$parts['pass'];
             }
             $url .= "@";
         }
         if(isset($parts['host'])) {
             $url .= $parts['host']."/";
         }
         $url .= $path;
  
         return $url;
 }

Post Views: 1,438

How To Force File Download With PHP

Checking If Page Contains a Link In PHP

You may also like

Leave a Comment Cancel Reply