The ultimate scrapper class: Simple_Html_Dom
It doesn’t happen to me too often to find a good script that I’ll put on my gold shelf of top priority scripts,but these days I’ve came across this script that is for scrapping what jQuery is for JavaScript. Really, this script can be considered the ultimate scrapper class!
This class has been created based on the implementation of HTML Parser for PHP 4 by Jose Solorzano and now being maintained by S.C. Chen with contribution from:
- Yousuke Kumakura (Attribute filters)
- Vadim Voituk (Negative indexes supports of “find” method)
- Antcs (Constructor with automatically load contents either text or file/url)
This script consists of two classes
- simple_html_dom_node
- simple_html_dom
but the primary class you’re going to work with is the simple_html_dom.
To load the content of a file is pretty straightforward:
// Create a DOM object $html = new simple_html_dom(); // Load HTML from a string $html->load('<html><body>Hello!</body></html>'); // Load HTML from a URL $html->load_file('http://www.google.com/'); // Load HTML from a HTML file $html->load_file('test.htm');
So you can load the content to scrap from either a string, a remote file or from a local file. Neat!
Now having the content loaded, let’s see how we can find different elements:
Basic usage:
// Find all anchors, returns a array of element objects $ret = $html->find('a'); // Find (N)th anchor, returns element object or null if not found (zero based) $ret = $html->find('a', 0); // Find all <div> which attribute id=foo $ret = $html->find('div[id=foo]'); // Find all <div> with the id attribute $ret = $html->find('div[id]'); // Find all element has attribute id $ret = $html->find('[id]');
Advanced usage:
// Find all element which id=foo $ret = $html->find('#foo'); // Find all element which class=foo $ret = $html->find('.foo'); // Find all anchors and images $ret = $html->find('a, img'); // Find all anchors and images with the "title" attribute $ret = $html->find('a[title], img[title]');
Using descendand selectors:
// Find all <li> in <ul> $es = $html->find('ul li'); // Find Nested <div> tags $es = $html->find('div div div'); // Find all <td> in <table> which class=hello $es = $html->find('table.hello td'); // Find all td tags with attribite align=center in table tags $es = $html->find('table td[align=center]');
Finding text and comments:
// Find all text blocks $es = $html->find('text'); // Find all comment (<!--...-->) blocks $es = $html->find('comment');
This is pretty simple so far, and you can see how close is to jQuery!
Keep this close, because this will give you more power for narrowing down your selection:
[attribute] Matches elements that have the specified attribute. [attribute=value] Matches elements that have the specified attribute with a certain value. [attribute!=value] Matches elements that don't have the specified attribute with a certain value. [attribute^=value] Matches elements that have the specified attribute and it starts with a certain value. [attribute$=value] Matches elements that have the specified attribute and it ends with a certain value. [attribute*=value] Matches elements that have the specified attribute and it contains a certain value.
How to access the HTML element’s attributes
Get, Set and Remove attributes
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false) $value = $e->href; // Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false) $e->href = 'my link'; // Remove a attribute, set it's value as null! $e->href = null; // Determine whether a attribute exist? if(isset($e->href)) { echo 'href exist!'; }
Magic attributes
// Example $html = str_get_html("<div>foo <b>bar</b></div>"); $e = $html->find("div", 0); echo $e->tag; // Returns: " div" echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>" echo $e->innertext; // Returns: " foo <b>bar</b>" echo $e->plaintext; // Returns: " foo bar"
Keep in mind the following, it will help!
$e->tag // Read or write the tag name of element. $e->outertext // Read or write the outer HTML text of element. $e->innertext // Read or write the inner HTML text of element. $e->plaintext // Read or write the plain text of element.
How to traverse the DOM tree
// Example echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id; // or echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');
Keep in mind the following, it will help!
// mixed $e->children( [int $index] ) // Returns the Nth child object if index is set, otherwise return an array of children. // element $e->parent() // Returns the parent of element. //element $e->first_child() // Returns the first child of element, or null if not found. //element $e->last_child() // Returns the last child of element, or null if not found. //element $e->next_sibling() // Returns the next sibling of element, or null if not found. //element $e->prev_sibling() // Returns the previous sibling of element, or null if not found.
Putting Simple_Html_Dom to the test
Now that we’ve seen how to use this class, let’s do some real world examples!
Example #1: Scrapping Digg.
include_once('../../simple_html_dom.php'); function scraping_digg() { // create HTML DOM $html = file_get_html('http://digg.com/'); // get news block foreach($html->find('div.story-item') as $article) { // get title $item['title'] = trim($article->find('h3.story-item-title', 0)->plaintext); // get title url $item['title_url'] = trim($article->find('h3.story-item-title', 0)->find('a',0)->href); // get url $item['site_url'] = trim($article->find('a.story-item-source',0)->plaintext); // get details $item['details'] = trim($article->find('p a.story-item-teaser', 0)->plaintext); // get intro $item['diggs'] = trim($article->find('span.digg-count span', 0)->plaintext); $ret[] = $item; } // clean up memory $html->clear(); unset($html); return $ret; } // ----------------------------------------------------------------------------- // test it! // "http://digg.com" will check user_agent header... ini_set('user_agent', 'My-Application/2.5'); $ret = scraping_digg(); if ( ! empty($ret)) { echo '<ul>'; $digg = 'http://digg.com'; foreach($ret as $v) { echo '<li><p><a href="'.$digg.$v['title_url'].'">'.$v['title'].'</a></p>'; echo '<p>'.$v['site_url'].$v['details'].'</p>'; echo '<p>Diggs: '.$v['diggs'].'</p>'; echo '</li>'; } echo '</ul>'; } else { echo 'Could not scrap Digg!'; }
This works perfectly and returns a pretty list with all the topics from the Digg’s front page.
Getting the “Opening This Week” movies from IMDB? No problemo!
require( 'simple_html_dom.php' ); function ScrappingIMDB($url) { // create HTML DOM $html = file_get_html($url); $data = array(); foreach( $html->find('table.movies .movie') as $table ) { $_title = $table->find('a.title', 0); if ( ! empty($_title)) { $item['title'] = $_title->innertext; $item['title_url'] = 'http://www.imdb.com'.$_title->href; $item['plot'] = $table->find('p.smallgap',1)->plaintext; array_push($data, $item); } } return $data; } $url = 'http://www.imdb.com/nowplaying/'; $entries = ScrappingIMDB($url); if ( ! empty($entries)) { echo '<ul>'; foreach( $entries as $data ) { echo '<li>'; echo '<p><a href="'.$data['title_url'].'">'.$data['title'].'</a></p>'; echo '<p>'.$data['plot'].'</p>'; echo '</li>'; } echo '</ul>'; } else { echo 'Could not scrap IMDB!'; }
Now this is what I call a well done work so if you everfind yourself in the need of a reliable scrapping script, remember this class, it’s a life savior!
You can find this class here.

Hi kos!
I must say that after using this class a bit I started to love it!!! Not long ago I wanted to get the list of shows, published by DnrTV (http://www.dnrtv.com/archives.aspx), as they appear, and have them presented somewhere. So I thought I should try this class instead of going to the web site, view source, copy the html, paste it somewhere and starting to transform it using regex and xml etc. I was very impressed to see how everything is done in just a few lines of code!!!
For those interested to see how this is done, click on this link: http://www.criss-dev.com/code/dnrtv-shows-scrapper.zip
yup! This class is a gold mine and a must have! I’m glad you find it useful!
I downloaded the archive and the output is xml..you should display the results in an html list with links to the videos instead, this way it will be much easier for you to view those movies
Pingback: Validate Backlinks To Your Site | V7N Web Development Blog