I'm trying to use PHP Simple HTML Dom Parser to parse some information from some sites. Does not matter what and where. But it seems, that there is some HUGE memory problem with it. I managed to cut the html code to only 6kB, but script that finds some elements and saves them to database takes even 700MB of ram and over 1GB of virtual memory! I read somewhere that I should use ->clear() to free up some memory, but seems that this is not the case.
I use str_get_html() once and 5 times using ->find() assigning the result to variable.
$main_html = str_get_html($main_site);
$x = $main_html->find(...);
$y = $main_html->find(...);
etc.
I tried to use for example $y->clear() after using $y but I get an error PHP Fatal error: Call to a member function clear() on a non-object even tho $y does exist and if($y) is true. Even foreach($y) echo $y->plaintext does return plaintext of $y.
From htop:
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
8839 username 20 0 1068M 638M 268 R 23.0 8.0 0:08.41 php myscript.php
What is wrong?
Simple test:
echo "(MEM:".memory_get_usage()."->";
$product = $p->find('a',0)->href;
echo memory_get_usage()."->";
unset($product);
$p->clear();
unset($p);
echo memory_get_usage().")";
The result is:
(MEM:11865648->11866192->11865936)
More readable form:
11865648->
11866192-> (+544 in total)
11865936 (+288 in total)
Of course I can't use $product->clear() as it says that PHP Fatal error: Call to a member function clear() on a non-object
Seems there are some memory problems when using str_html_get or similar function that creates simple_html_dom object few times without clearing and destroying the previous one. Especially when using ->find that creates array of simple_html_dom_node objects. Even FAQ on authors site says to clear and destroy previous simple_html_dom object before creating new one, but sometimes it can't be done without additional code and memory.
That's why I created this function, to remove all PHP Simple HTML Dom Parser traces from memory:
function clean_all(&$items,$leave = ''){
foreach($items as $id => $item){
if($leave && ((!is_array($leave) && $id == $leave) || (is_array($leave) && in_array($id,$leave)))) continue;
if($id != 'GLOBALS'){
if(is_object($item) && ((get_class($item) == 'simple_html_dom') || (get_class($item) == 'simple_html_dom_node'))){
$items[$id]->clear();
unset($items[$id]);
}else if(is_array($item)){
$first = array_shift($item);
if(is_object($first) && ((get_class($first) == 'simple_html_dom') || (get_class($first) == 'simple_html_dom_node'))){
unset($items[$id]);
}
unset($first);
}
}
}
}
Usage:
Clean ALL traces of PHP Simple HTML Dom Parser from memory: clean_all($GLOBALS);
Clean all traces of PHP Simple HTML Dom Parser from memory, except $myobj: clean_all($GLOBALS,'myobj');
Clean all traces of PHP Simple HTML Dom Parser from memory, except list of objects ($myobj1,$myobj2...): clean_all($GLOBALS,array('myobj1','myobj2'));
Hope it will help others too.
Generally I use it when I use str_to_html() two times like:
$site=file_get_contents('http://google.com');
$site_html=str_get_html($site);
foreach($site->find('a') as $a){
$site2=file_get_contents($a->href);
$site2_html=str_get_html($site2);
echo $site2->find('p',0)->plaintext;
}
clean_all($_GLOBALS);
In this example I can't $site_html->clear() before foreach{}, because foreach then will fail. And because calling multiple str_get_html() without clearing previous ones, the redundant dependencies are being broken and clearing it after all leaves memory leaks. That's why my function has to search the defined variables for simple_html_dom objects and clear them manually.
In my case I forked inside foreach and after few steps main php script used like 100MB of memory. And when forked few times, it have been increasing and increasing and finally killing my server to death. Well almost. Of course when PHP script ends, it does free up memory. But when using 8GB of memory, it took like ages to end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With