Create a PHP web crawler or scraper in 5 minutes
June 4, 2009
Credits: http://vision-media.ca/resources/php/create-a-php-web-crawler-or-scraper-5-minutes
Utilizing the PHP programming language we show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.
The Crawler Framework
First we need to create the crawler class as follows:
class Crawler {
}
?>
We then will create methods to fetch the web pages markup, and to parse it for data that we are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.
class Crawler {
protected $markup = ”;
public function __construct($uri) {
}
public function getMarkup() {
}
public function get($type) {
}
protected function _get_images() {
}
protected function _get_links() {
}
}
?>
Fetching Site Markup
The constructor will accept a URI so we can instantiate it such as new Crawler(‘http://vision-media.ca’); which then will set our $markup property using PHP’s file_get_contents() function which fetches the sites markup.
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
?>
Crawling The Markup For Data
Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below we construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get(‘images’);
We set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.
Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit http://en.wikipedia.org/wiki/Regular_expression
public function get($type) {
$method = “_get_{$type}”;
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
?>
Final PHP Web Crawler Code And Usage
class Crawler {
protected $markup = ”;
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
public function get($type) {
$method = “_get_{$type}”;
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all(‘/<img([^>]+)\/>/i’, $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all(‘/<a([^>]+)\>(.*?)\<\/a\>/i’, $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
}
a
$crawl = new Crawler(‘http://vision-media.ca’);
$images = $crawl->get(‘images’);
$links = $crawl->get(‘links’);
?>
Fix PNG transparency in IE 6
July 29, 2008
Author: Peter Cooper
Link : http://codesnippets.joyent.com/user/therad/tag/javascript#post681
function correctPNG() // correctly handle PNG transparency in Win IE 5.5 & 6.
{
var arVersion = navigator.appVersion.split(“MSIE”)
var version = parseFloat(arVersion[1])
if ((version >= 5.5) && (document.body.filters))
{
for(var i=0; i<document.images.length; i++)
{
var img = document.images[i]
var imgName = img.src.toUpperCase()
if (imgName.substring(imgName.length-3, imgName.length) == “PNG”)
{
var imgID = (img.id) ? “id=’” + img.id + “‘ ” : “”
var imgClass = (img.className) ? “class=’” + img.className + “‘ ” : “”
var imgTitle = (img.title) ? “title=’” + img.title + “‘ ” : “title=’” + img.alt + “‘ “
var imgStyle = “display:inline-block;” + img.style.cssText
if (img.align == “left”) imgStyle = “float:left;” + imgStyle
if (img.align == “right”) imgStyle = “float:right;” + imgStyle
if (img.parentElement.href) imgStyle = “cursor:hand;” + imgStyle
var strNewHTML = ““
img.outerHTML = strNewHTML
i = i-1
}
}
}
}
window.attachEvent(“onload”, correctPNG);
get base link of a web page
July 17, 2008
echo “http://”.$_SERVER['HTTP_HOST'].dirname($_SERVER['PHP_SELF']);
Install Flash Ubuntu Hardy Heron
June 1, 2008
Test Drive Flash Player 10 Beta in Ubuntu
Adobe has released the Flash Player 10 Beta simultaneously for Linux, Mac, and Windows. This version includes performance improvements, new 3D transformations, Adobe Pixel Bender filters, streaming video improvements, and new text layout capabilities.
Websites very likely won’t be taking advantage of these new features until the stable release is out, so Adobe has a page of demos you can try.
Two large complaints about Flash on Linux are fullscreen video and 64-bit support. Neither have been resolved in this release. Playback of fullscreen video (which causes low framerates and high CPU usage) seems to be only slightly improved. I have found that there is a general performance increase.

If you want to try Flash Player 10 you can download and install it yourself, but here are some terminal commands that you can copy and paste to get going quickly:
- Remove your existing Flash plugin, if you have one installed. This command will remove Flash 9 if you installed it from Ubuntu’s repository:
sudo apt-get remove flashplugin-nonfree - Download and extract the Flash Player 10 Beta to your home directory:
wget -O - http://download.macromedia.com/pub/labs/flashplayer10/flashplayer10_install_linux_051508.tar.gz | tar xz -C ~ - The user plugins folder may not exist yet, try to create it but ignore any errors if the directory already exists:
mkdir ~/.mozilla/plugins/ - Copy the Flash plugin the the Firefox plugins directory to install it:
cp ~/install_flash_player_10_linux/libflashplayer.so ~/.mozilla/plugins/libflashplayer.so - Remove the directory that was downloaded (if you get a warning about deleting a write-protected file, press y and Enter to continue):
rm -r ~/install_flash_player_10_linux
Restart Firefox to enable the new plugin.
And here’s how to uninstall it:
- Remove the new plugin:
rm ~/.mozilla/plugins/libflashplayer.so - Reinstall Flash 9 from the repositories (if you wish):
sudo apt-get install flashplugin-nonfree
[update] I’ve been using the Flash 10 plugin for over a week now, and the only issue I’ve had is the occasional website that thinks my version of Flash is too old.
UBUNTU HARDY HERON MANUAL
May 9, 2008
http://joeabiraad.com/linuxunix/installing-lamp-on-ubuntu-710-linuxapachemysqlphp/100
installing UBUNTU’s APACHE2 LOCAL WEBSERVER
May 9, 2008
Lately I’ve been using ubuntu 7.10 for all my projects/daily work.
As a web developer i should have LAMP on my machine and now i would guide you through installing it on yours.
This guide is divided into 3 steps: installing/tesing Apache, PHP and finally MySQL.
Lets start with Apache:
1. Open the terminal (we will be using it through most of my guide) from Applications > Accessories > Terminal
2. Install apache2 using apt-get by typing the following
Note that you should know the root password.
Now everything should be downloaded and installed automatically.
To start/stop apache2 write:
Your www folder should be in: /var/www/
If everything is OK you should see an ordinary HTML page when you type: http://localhost in your firefox browser
Finished with Apache ? lets conquer PHP:
1. Also in terminal write:
or any php version you like
2. restart apache
This is it for PHP ![]()
Wanna test it ? Just create an ordinary PHP page in /var/www/ and run it.
Example:
and write in it: < ?php echo “Hello World”; ?>
Now run it by typing http://localhost/test.php in firefox… You should see your ” Hello World ”
66 % is over, lets continue to installing MySQL:
1. Again and again in terminal execute:
2. (optional) If you are running a server you should probably bind your address by editing bind-address in /etc/mysql/my.cnf and replacing its value (127.0.0.1) by your IP address
3. set your root password (although mysql should ask you about that when installing)
4. Try running it
where xxx is your password.
Note: You can install PHPMyAdmin for a graphical user interface of MySQL by executing
5. restart apache for the last time
Congratulions your LAMP system is installed and running ![]()
Happy Coding
UPDATE:
Due to the large number of people emailing about installing/running phpmyadmin.
Do the following:
The phpmyadmin configuration file will be installed in: /etc/phpmyadmin
Now you will have to edit the apache config file by typing
and include the following line:
Restart Apache
Another issue was making mysql run with php5
First install these packages:
then edit php.ini and add to it this line : ” extensions=mysql.so” if it isnt already there
Restart Apache
Ubuntu 8.04 Rails Server Using Passenger 6
May 9, 2008
Requirements
This section will go over the simple requirements of the entire setup.
Hardware
Ubuntu 8.04 Server – This could be anything below:
- Slicehost
- VMware
- Bare Metal Install
Software
- Apache 2.2.8
- MySQL/PostgreSQL/SQLite3
- Git
- Ruby
- Rubygems
- Rails
- Capistrano
- RSpec
- Ultrasphinx
- Passenger
Installation of Software
First thing before we start installing anything on this machine we must update the server. This is very simple with Ubuntu, it is two simple commands and you are all set. You only need to reboot the machine if a kernel was installed.
sudo apt-get install update
sudo apt-get install dist-upgrade
Now that the machine is updated we must install some essential tools in order to build software on this server. Once we are done with the setup it would be a good idea to remove these tools to increase security on our server.
sudo apt-get install build-essential
Now we are all set with the preparation of the server and we can start installing the software we need to get going.
Web Server
For the web server I chose to use Apache 2 because of the new Passenger gem or (mod_rails). This gem is great because of the simplicity to deploy new applications.
sudo apt-get install apache2 apache2-dev
Database Server
The database server that should be used is completely up to your preference. My recommendation is PostgreSQL. PostgreSQL is a very robust and fast database server that is rock solid. It does use a lot of resources so for Slicehost it may not be the best choice. A major player for a slim and fast database for Slicehost should be SQLite3. It is a wonderful database and should be thrown out so quickly because of its lack of a client/server architecture.
For this tutorial I will install MySQL because of its popularity with the Rails community.
sudo apt-get install mysql-server
When prompted enter a root password, make this complex and write it down.
Version Control
Git is the most sexy version control system every created. I will never look back to subversion again. Now that capistraon and redmine both support git I have no reason to even thing about those awful three letters.
To install git is yet another apt-get command away. Run the following command in the terminal of your new server.
sudo apt-get install git-core curl *gitweb*
gitweb is an optional web frontend for your applications. I do not use it because I use GitNub a RubyCocoa application for the Mac.
Once that finishes git is completely installed and ready to go.
Ruby
Installing Ruby on Ubuntu 8.04 is quite simple. Just another apt-get and you are all set… almost. Since the inception of Ruby 1.9.0 distributions have been naming the current stable release of ruby “ruby1.8″ That being said we will make a couple symlinks.
Ruby 1.8.6
To install all the tools you will want on this server run the following command:
sudo apt-get install ruby1.8 ruby1.8-dev rdoc1.8 ri1.8 libopenssl-ruby1.8
Rubygems
I refuse to install Rubygems with apt-get. This is such a terrible idea in my opinion. There is no reason to install rubygems with a package manager because it can update itself. I will go over how to update rubygems later in this howto.
wget http://rubyforge.org/frs/download.php/35283/rubygems-1.1.1.tgz
tar -xzf rubygems-1.1.1.tgz
cd rubygems-1.1.1
ruby setup.rb
Optional: Once you are done with install just run the next three commands to make using gems and Rubygems just as before.
sudo ln -s /usr/bin/gem1.8 /usr/bin/gem
sudo ln -s /usr/bin/ruby1.8 /usr/bin/ruby
sudo ln -s /usr/bin/irb1.8 /usr/bin/irb
Recommended Gems
Here is a list of recommended gems that should be installed once rubygems is installed. At the very least you must install rails and passenger.
sudo gem install rails
sudo gem install capistrano
sudo gem install rspec
sudo gem install ultrasphinx
sudo gem install passenger
add type for html
March 18, 2008
RemoveHandler .html .htm
AddType application/x-httpd-php .html .htm
After adding this to your .htaccess .html files can now process .php commands.
GET URL variables
March 12, 2008
<?php
function setUrlVariables() {
$arg = array();
$string = “?”;
$vars = $_GET;
for ($i = 0; $i < func_num_args(); $i++)
$arg[func_get_arg($i)] = func_get_arg(++$i);
foreach (array_keys($arg) as $key)
$vars[$key] = $arg[$key];
foreach (array_keys($vars) as $key)
if ($vars[$key] != “”) $string.= $key . “=” . $vars[$key] . “&”;
if (SID != “” && SID != “SID” && $_GET["PHPSESSID"] == “”)
$string.= htmlspecialchars(SID) . “&”;
return htmlspecialchars(substr($string, 0, -1));
}
echo setUrlVariables();
?>
PHP get current page
March 10, 2008
<?php
$currentFile = $_SERVER["PHP_SELF"];
$parts = Explode(‘/’, $currentFile);
echo $parts[count($parts) - 1];
?>
same as with
<?php
basename($_SERVER[’PHP_SELF’]);
?>
Experimental Game
December 5, 2007
http://imonline.alpha-phi-epsilon.net/_labs/game_01.html
hey guys… I donno why I made this freaking game, out of boredom, I made this game… The engine is not very throughly done… but its a worth trying for…
Flash MySQL XML
November 21, 2007
//Figure 1.0 Main.fla
var theXML:XML = new XML();
theXML.ignoreWhite = true;
theXML.onLoad = function() {
var nodes = this.firstChild.childNodes;
for(i=0;i<nodes.length;i++) {
theList.addItem(nodes[i].firstChild.nodeValue,i);
}
}
theXML.load(“http://www.yoursite.com/products.php”);
//Figure 1.1 products.php
<?PHP
$link = mysql_connect(“localhost”,”lee”,”password”);
mysql_select_db(“brimelow_store”);
$query = ‘SELECT * FROM products’;
$results = mysql_query($query);
echo “<?xml version=\”1.0\”?>\n”;
echo “<products>\n”;
while($line = mysql_fetch_assoc($results)) {
echo “<item>” . $line["product"] . “</item>\n”;
}
echo “</products>\n”;
mysql_close($link);
?>
Flash Preloader
November 21, 2007
loadedBytes = _root.getBytesLoaded();
totalBytes = _root.getBytesTotal();
if (loadedBytes<totalBytes) {
percentage = ((loadedBytes/totalBytes)*100);
loader_graphic._xscale = percentage;
txt = percentage;
gotoAndPlay(1);
} else {
nextScene();
}
