Create a PHP web crawler or scraper in 5 minutes

Credits: http://vision-media.ca/resources/php/create-a-php-web-crawler-or-scraper-5-minutes

Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.

The Crawler Framework

First we need to create the crawler class as follows:

<?php

class Crawler {

}

?>

We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.

<?php

class Crawler {

protected $markup = ”;

public function __construct($uri) {

}

public function getMarkup() {

}

public function get($type) {

}

protected function _get_images() {

}

protected function _get_links() {

}

}

?>

Fetching Site Markup

The constructor will accept a URI so we can instantiate it such as new Crawler(‘http://vision-media.ca&#8217;); which then will set our $markup property using PHP’s file_get_contents() function which fetches the sites markup.

<?php

public function __construct($uri) {

$this->markup = $this->getMarkup($uri);

}

public function getMarkup($uri) {

return file_get_contents($uri);

}

?>

Crawling The Markup For Data

Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get(‘images’);

We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.

Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression

<?php

public function get($type) {

$method = “_get_{$type}”;

if (method_exists($this, $method)){

return call_user_method($method, $this);

}

}

protected function _get_images() {

if (!empty($this->markup)){

preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);

return !empty($images[1]) ? $images[1] : FALSE;

}

}

protected function _get_links() {

if (!empty($this->markup)){

preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);

return !empty($links[1]) ? $links[1] : FALSE;

}

}

?>

Final PHP Web Crawler Code And Usage

<?php

class Crawler {

protected $markup = ”;

public function __construct($uri) {

$this->markup = $this->getMarkup($uri);

}

public function getMarkup($uri) {

return file_get_contents($uri);

}

public function get($type) {

$method = “_get_{$type}”;

if (method_exists($this, $method)){

return call_user_method($method, $this);

}

}

protected function _get_images() {

if (!empty($this->markup)){

preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);

return !empty($images[1]) ? $images[1] : FALSE;

}

}

protected function _get_links() {

if (!empty($this->markup)){

preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);

return !empty($links[1]) ? $links[1] : FALSE;

}

}

}

a

$crawl = new Crawler(‘http://vision-media.ca&#8217;);

$images = $crawl->get(‘images’);

$links = $crawl->get(‘links’);

?>

Advertisements