Create a PHP web crawler or scraper in 5 minutes
June 4, 2009
Credits: http://vision-media.ca/resources/php/create-a-php-web-crawler-or-scraper-5-minutes
Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.
The Crawler Framework
First we need to create the crawler class as follows:
class Crawler {
}
?>
We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.
class Crawler {
protected $markup = ”;
public function __construct($uri) {
}
public function getMarkup() {
}
public function get($type) {
}
protected function _get_images() {
}
protected function _get_links() {
}
}
?>
Fetching Site Markup
The constructor will accept a URI so we can instantiate it such as new Crawler(‘http://vision-media.ca’); which then will set our $markup property using PHP’s file_get_contents() function which fetches the sites markup.
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
?>
Crawling The Markup For Data
Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get(‘images’);
We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.
Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression
public function get($type) {
$method = “_get_{$type}”;
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
?>
Final PHP Web Crawler Code And Usage
class Crawler {
protected $markup = ”;
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
public function get($type) {
$method = “_get_{$type}”;
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
}
a
$crawl = new Crawler(‘http://vision-media.ca’);
$images = $crawl->get(‘images’);
$links = $crawl->get(‘links’);
?>
add type for html
March 18, 2008
RemoveHandler .html .htm
AddType application/x-httpd-php .html .htm
After adding this to your .htaccess .html files can now process .php commands.
GET URL variables
March 12, 2008
<?php
function setUrlVariables() {
$arg = array();
$string = “?”;
$vars = $_GET;
for ($i = 0; $i < func_num_args(); $i++)
$arg[func_get_arg($i)] = func_get_arg(++$i);
foreach (array_keys($arg) as $key)
$vars[$key] = $arg[$key];
foreach (array_keys($vars) as $key)
if ($vars[$key] != “”) $string.= $key . “=” . $vars[$key] . “&”;
if (SID != “” && SID != “SID” && $_GET["PHPSESSID"] == “”)
$string.= htmlspecialchars(SID) . “&”;
return htmlspecialchars(substr($string, 0, -1));
}
echo setUrlVariables();
?>
PHP get current page
March 10, 2008
<?php
$currentFile = $_SERVER["PHP_SELF"];
$parts = Explode(‘/’, $currentFile);
echo $parts[count($parts) - 1];
?>
same as with
<?php
basename($_SERVER[’PHP_SELF’]);
?>
