Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Parallel processing for a Metasearch Engine

I have developed a metasearch engine and one of the optimisations I would like to make is to process the search APIs in parallel. Imagine that results are retrieved from Search Engine A in 0.24 seconds, SE B in 0.45 Seconds and from SE C in 0.5 seconds. With other overheads the metasearch engine can return aggregated results in about 1.5 seconds, which is viable. Now what I would like to do is to send those requests in parallel rather than in series, as at present, and get that time down to under a second. I have investigated exec, forking, threading and all, for various reasons, have failed. Now I have only spent a day or two on this so I may have missed something. Ideally i would like to implement this on a WAMP stack on my development machine (localhost) and see about implementing on a Linux webserver thereafter. Any help appreciated.

Let's take a simple example: say we have two files we want to run simultaneously. File 1:

<?php
// file1.php
echo 'File 1 - Test 1'.PHP_EOL;
$sleep = mt_rand(1, 5);
echo 'Start Time: '.date("g:i:sa").PHP_EOL;
echo 'Sleep Time: '.$sleep.' seconds.'.PHP_EOL;
sleep($sleep);
echo 'Finish Time: '.date("g:i:sa").PHP_EOL;
?>

Now, imagine file two is the same... the idea is that if run in parallel the command line output for the times should be the same, for example:

File 1 - Test 1
Start Time: 9:30:43am
Sleep Time: 4 seconds.
Finish Time: 9:30:47am

But whether I use exec, popen or whatever, I just cannot get this to work in PHP!

like image 308
Conor Ryan Avatar asked Feb 19 '26 07:02

Conor Ryan


2 Answers

I would use socket_select(). Doing so, only the connection time would be cummulative as you can read from the sockets in parralel. This will give you a big performance boost.

like image 51
hek2mgl Avatar answered Feb 20 '26 22:02

hek2mgl


There is one viable approach. Make a cli php file that gets in arguments what it have to do and returns whatever result is produced serialized.

In your main app you may popen as many of these workers as you need and then in a simple loop collect the outputs:

[edit] I used your worker example, just had to chmod +x and add a #!/usr/bin/php line on top:

#!/usr/bin/php
<?php
echo 'File 1 - Test 1'.PHP_EOL;
$sleep = mt_rand(1, 5);
echo 'Start Time: '.date("g:i:sa").PHP_EOL;
echo 'Sleep Time: '.$sleep.' seconds.'.PHP_EOL;
sleep($sleep);
echo 'Finish Time: '.date("g:i:sa").PHP_EOL;
?>

also modified the run script a little bit - ex.php:

#!/usr/bin/php
<?php
$pha=array();
$res=array();
$pha[1]=popen("./file1.php","r");
$res[1]='';
$pha[2]=popen("./file2.php","r");
$res[2]='';
while (list($id,$ph)=each($pha)) {
    while (!feof($ph))
        $res[$id].=fread($ph,8192);
    pclose($ph);
}
echo $res[1].$res[2];

here is the result, when tested in cli (its the same when ex.php is called from web, but paths to file1.php and file2.php should be fixed):

$ time ./ex.php 
File 1 - Test 1
Start Time: 11:00:33am
Sleep Time: 3 seconds.
Finish Time: 11:00:36am
File 2 - Test 1
Start Time: 11:00:33am
Sleep Time: 4 seconds.
Finish Time: 11:00:37am

real  0m4.062s
user  0m0.040s
sys   0m0.036s

As seen in the result one script takes 3 seconds to execute and the other takes 4. Both run for 4 seconds together in parallel.

[end edit]

In this way the slow operation will run in parallel, you will only collect the result in serial.

Finally it will take (the slowest worker time)+(time for collecting) to execute. Since the time for collecting the results and time to unserialize, etc., may be ignored you get all data for the time of the slowest request.

As a side note you may try to use the igbinary serialiser that is much faster than the built-in one.

As noted in comments:

worker.php is executed outside of the web request and you have to pass all its state via arguments. Passing arguments may also be a problem to handle all escaping, security and etc., so not-effective but simple way is to use base64.

A major drawback in this approach is that it is not easy to debug.

It can be further improved by using stream_select instead of fread and also collecting data in parallel.

like image 45
bbonev Avatar answered Feb 20 '26 22:02

bbonev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!