I'd be very grateful if you could compare the winning O’Rourke's Perl solution to Lundh's Python solution, as I don't know Perl good enough to understand what's going on there. More specifically I'd like to know what gave Perl version 3x advantage: algorithmic superiority, quality of C extensions, other factors?
Wide Finder: Results
The better regex implementation of perl is one part of the story. That can't explain however why the perl implementation scales better. The difference become bigger with more processors. For some reason the python implementation has an issue there.
Perl is heavily optimized for text processing. There are so many factors that it's hard to say what's the exact difference. Text is represented completely differently internally (utf-8 versus utf-16/utf-32) and the regular expression engines are completely different too. Python's regular expression engine is a custom one and not as much used as the perl one. There are very few developers working on it (I think it's largely unmaintained) in contrast to the Perl one which is basically the "core of the language".
After all Perl is the text processing language.
Many-core Engine (MCE) has been released for Perl. MCE does quite well at this, even when reading directly from disk with 8 workers (cold cache). Compare to wf_mmap. MCE follows a bank queuing model when reading input data. Look under the images folder for slides on it.
The source code is hosted at http://code.google.com/p/many-core-engine-perl/
The perl documentation can be read at https://metacpan.org/module/MCE
An implementation of the Wide Finder with MCE is provided under examples/tbray/
https://metacpan.org/source/MARIOROY/MCE-1.514/examples/tbray/
Enjoy MCE.
Script....:  baseline1  baseline2  wf_mce1  wf_mce2  wf_mce3  wf_mmap
Cold cache:      1.674      1.370    1.252    1.182    1.174    3.056
Warm cache:      1.236      0.923    0.277    0.106    0.098    0.092
The Perl implementation uses the mmap system call. What that call does is establish a pointer which to the process appears to be a normal segment of memory or buffer to the program. It maps the contents of a file to a region of memory. There are performances advantages of doing this vs normal file IO (read) - one is that there are no user-space library calls necessary to get access to the data, another is that there are often less copy operations necessary (eg: moving data between kernel and user space).
Perl's strings and regular expressions are 8-bit byte based (as opposed to utf16 for Java for example), so Perl's native 'character type' is the same encoding of the mmapped file.
When the regular expression engine then operates on the mmap backed variable, it is directly accessing the file data via the mamped memory region - without going through Perl's IO functions, or even libc's IO functions.
The mmap is probably largely responsible for the performance difference vs the Python version using the normal Python IO libraries - which additionally introduce the overhead of looking for line breaks.
The Perl program also supports a -J to parallelize the processing, where the oepen "-|" causes a fork() where the file handle in the parent is to the child's stdout. The child processes serialize their results to stdout and the parent de-serializes them to coordinate and summarize the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With