Create a PHP web crawler or scraper in 5 minutes

Credits: http://vision-media.ca/resources/php/create-a-php-web-crawler-or-scraper-5-minutes

Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.

The Crawler Framework

First we need to create the crawler class as follows:

<?php

class Crawler {

}

?>

We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.

<?php

class Crawler {

protected $markup = ”;

public function __construct($uri) {

}

public function getMarkup() {

}

public function get($type) {

}

protected function _get_images() {

}

protected function _get_links() {

}

}

?>

Fetching Site Markup

The constructor will accept a URI so we can instantiate it such as new Crawler(‘http://vision-media.ca&#8217;); which then will set our $markup property using PHP’s file_get_contents() function which fetches the sites markup.

<?php

public function __construct($uri) {

$this->markup = $this->getMarkup($uri);

}

public function getMarkup($uri) {

return file_get_contents($uri);

}

?>

Crawling The Markup For Data

Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get(‘images’);

We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.

Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression

<?php

public function get($type) {

$method = “_get_{$type}”;

if (method_exists($this, $method)){

return call_user_method($method, $this);

}

}

protected function _get_images() {

if (!empty($this->markup)){

preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);

return !empty($images[1]) ? $images[1] : FALSE;

}

}

protected function _get_links() {

if (!empty($this->markup)){

preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);

return !empty($links[1]) ? $links[1] : FALSE;

}

}

?>

Final PHP Web Crawler Code And Usage

<?php

class Crawler {

protected $markup = ”;

public function __construct($uri) {

$this->markup = $this->getMarkup($uri);

}

public function getMarkup($uri) {

return file_get_contents($uri);

}

public function get($type) {

$method = “_get_{$type}”;

if (method_exists($this, $method)){

return call_user_method($method, $this);

}

}

protected function _get_images() {

if (!empty($this->markup)){

preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);

return !empty($images[1]) ? $images[1] : FALSE;

}

}

protected function _get_links() {

if (!empty($this->markup)){

preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);

return !empty($links[1]) ? $links[1] : FALSE;

}

}

}

a

$crawl = new Crawler(‘http://vision-media.ca&#8217;);

$images = $crawl->get(‘images’);

$links = $crawl->get(‘links’);

?>

Advertisements

47 thoughts on “Create a PHP web crawler or scraper in 5 minutes

  1. Inside method get() there is a deprecated php function ‘call_user_method()’

    You can successfully replace this as follows:

    return call_user_method($method, $this);

    becomes:

    return call_user_func(array($this, $method));

  2. Good article to share. I am pretty new to PHP (as well as coding in general) and was wondering if you could help with some questions?

    Is the page supposed to be blank because a front-end needs to be added? I’m a little confused… @.@

    • You’ll need to dump out the information, because it’s returned as an array. Try print_r($links) and print_r($images). That will show you everything within the array, and you can modify your PHP to handle that.

      You’ll also want to update the call_user_method function, as it’s deprecated. Change that to call_user_func(array($this, $method)). That should get you going.

    • preg_match_all(‘/\(.*?)\/i’, $this->markup, $links);

      It is not working…..

      Warning: preg_match_all() [function.preg-match-all]: No ending delimiter ‘/’ found in C:\xampp\htdocs\crawler\index.php on line 97

      this is the warning when i used above RegEx….

  3. sibylsys, not always fopen works. FOr it to work, you need to set it configuration on php.ini. It may only works if you own the server or if you have this option enabled. It’s also possible to use curl instead, but I don’t think the semantic will be the same as this one.

  4. As stated above, function get has a deprecated function call.
    I went about replacing it as so:

    public function get($type) {
    $method = array($this,”_get_”.$type);
    if (method_exists($this,$method[1]))
    return call_user_func($method);
    return false;
    }

    $method is now an array. The first index contains the object we want to call and the second index contains the function name.
    We now check that the second index (1) exists as a method and if so we call it with call_user_func by passing $method array to the function.

    Happy scripting 😀

  5. heyyy m getting errror in line 20….plzz help…i copied d same code n tried to run dat!!! getting errorss…help!!!

  6. I got this website from my buddy who told me regarding this
    site and now this time I am browsing this web site and reading very informative articles or reviews at this place.

  7. Wonderful goods from you, man. I’ve have in mind your stuff prior to and you’re simply
    too great. I really like what you have obtained right here, certainly like what you are stating and the way in which
    wherein you assert it. You’re making it enjoyable and you still take care of to stay it smart. I cant wait to learn much more from you. This is actually a wonderful website.

  8. First off I want to say excellent blog! I had a quick question that
    I’d like to ask if you don’t mind. I was interested to know how you center yourself and
    clear your head prior to writing. I’ve had trouble clearing my mind in getting my thoughts out. I do take pleasure in writing however it just seems like the first 10 to 15 minutes tend to be lost just trying to figure out how to begin. Any ideas or tips? Thank you!

  9. Hello it’s me, I am also visiting this website on a regular basis, this web page is genuinely fastidious and the users are really sharing nice thoughts.

  10. I think this is one of the most significant information for me.
    And i’m glad reading your article. But should remark on some general things, The website style is great, the articles is really great : D. Good job, cheers

  11. Save All Paperwork: Whatever paperwork arrives with your parts or which is provided from the seller should be
    maintained. He knew the system well enough to not pay many of his suppliers and sub-contractors,
    then would cover it up up by handing out fake lien releases to make
    it look like they were paid. Once safely at Thebes,
    though, the obelisks were brought to the temple
    at Karnak with much fanfare.

  12. Those that cannot apply for this kind of card or that would prefer a
    different solution could consider a prepaid card that comes with a credit building element as an alternative.

    It won’t take you extremely long to work out what way to employ it. If a picture is worth a thousand words then you can just image how much you will absorb by browsing this site.

  13. This allows the website to be more attractive and catchy. s why most web master prefers using this builder for designing and developing their web form.
    Plus, the satisfaction of watching it turn on for the first time
    is something to behold.

  14. We are a gaggle of volunteers and starting a brand new scheme in our community.
    Your web site offered us with useful information to work on.
    You have done a formidable job and our entire group can be thankful to you.

  15. Excellent blog! Do you have any hints for aspiring writers?
    I’m planning to start my own site soon but I’m a little lost on everything.
    Would you advise starting with a free platform
    like WordPress or go for a paid option? There are so many choices out there that I’m totally confused ..

    Any recommendations? Thanks a lot!

  16. hi, i’ve tried this code and done some changes but still cant understand this error::on line “”””$crawl = new Crawler(‘http://vision-media.ca’);””””…….on this line its showing this error:::Parse error: syntax error, unexpected ‘:’ in C:\wamp\www\crawler.php on line 61

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s