I found great tool for html dom parsing called PHP Simple HTML DOM Parser or Simplehtmldom. it works like jquery. I tested it and it's very powerfull and we can use it to simplify finding or replacing certain dom and the proporties of html content. It's powerfull. By using Simplehtmldom, we can build custom extension for Joomla, Wordpress, Drupal, etc to manipulate html content.
This is it's feature:
- A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
- Require PHP 5+.
- Supports invalid HTML.
- Find tags on an HTML page with selectors just like jQuery.
- Extract contents from HTML in a single line.
Unfortunately, I got error when using on my own CMS. on certain page, it make the script terminated without reason. there's no debug tool and there's no error message thrown. After looking to the script, I found this method:
function load_file() {
$args = func_get_args();
$this->load(call_user_func_array('file_get_contents', $args), true);
// Per the simple_html_dom repositiry this is a planned upgrade to the codebase.
// Throw an error if we can't properly load the dom.
if (($error=error_get_last())!==null) {
$this->clear();
return false;
}
}
It use error_get_last() function to check condition if there's error on content loading with file_get_contents. it's mean when there's any php error or warning from other script or other function/method, the script will call clear() method and it's mean the object will be empty and find() method that called later following the code will make fatal error and script is termicated. if in other part of cms set display_errors to be 0, so the CMS will terminated without error. this condition is frustating for Simplehtmldom beginner.
An Example, if we create this code, the script will be terminated. if display_errors is not enabled (if we including many library, it may be set off on certain library), we will not see any error message.on this code, we echo $undefinedvar that it's not defined before and it make php throw warning. and it's catch by load_file() method and call clear() method so dom is cleared and find() method is not defined.
echo $undefinedvar;
require_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://www.google.com');
$tmp=$html->find('div',0);
I change the method to solve the problem. here's my replacement:
function load_file($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1) {
//$args = func_get_args();
if($maxLen> -1)$contents=file_get_contents($url, $use_include_path, $context, $offset, $maxLen);
else $contents=file_get_contents($url, $use_include_path, $context, $offset);
// Per the simple_html_dom repositiry this is a planned upgrade to the codebase.
// Throw an error if we can't properly load the dom.
//if (($error=error_get_last())!==null) {
//don't use error_get_last() because it will also catch errors from outer of this script.
//for example if it's used on other script and it have any error, this script will also terminated.
//checking is integrated on load()
/*
if (empty($contents)) {
//may be the object called multi times, if not, it's no need to clear
if(isset($this->root))$this->clear();
return false;
}
*/
$this->load($contents, true);
}
I also change other methods (find(), load(), file_get_html(), etc) to fix other problem and adding some feature for compatiibility. here's my note:
- remove halted script caused by outside error/warning
- tuning checkker and remove unused process
- changing load() and file_get_html() consistentcy
- hide/display php error message. it's important if used on template/presentation.
- Adding debugging class to avoid $debugObject variable that may conflict with other script;
- Enable/disable quick way to avoid conflict with other script with same function name. default is enabled.
You can download my tweaking here and the original is file can be downloaded from here that hosted on Sourceforge.net. Please check my example on example.php file to check the different.
Comment