Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of Jsoup vs regexes vs XPath for extracting content from HTML?

I know that in common case HTML shouldn't be parsed with regex.

But I want to make a performance test for web application. I know for sure how HTML may look like. So I can use regexes to extract some data from page source.

As I do performance test (using Jmeter), I want to take less resources from master machine.

What option will be less resource intensive: XPath, regexes (Jakarta ORO) or Jsoup?

like image 478
Andrei Botalov Avatar asked Oct 24 '25 19:10

Andrei Botalov


1 Answers

As of JMeter 2.8, the answer is Regexp. But it depends of course on Regexp expressions you use. Regexp implementation in JMeter is rather optimized and the main post processing way for correlation.

Regarding JSoup, it would need custom coding based on JSR223 post processor for example.

JMeter 2.9 will introduce a new CSS/JQuery selector based Extractor with 2 possible underlying implementations:

  • JSOUP

  • Jodd Lagarto (CSSelly)

See :

  • https://issues.apache.org/bugzilla/show_bug.cgi?id=54259

Its performance will be lower than Regexp as it builds a DOM document, but it eases much syntax in Test Plans that don't require ultra-optimised Test Plans.

Finally, regarding XPath, as it builds a DOM Tree:

  • http://www.developer.com/xml/article.php/3397691/Does-StAX-Belong-in-Your-XML-Toolbox.htm

It has a memory and CPU cost which is higher than regex particularly if you want to extract many elements, an enhancement has been created:

  • https://issues.apache.org/bugzilla/show_bug.cgi?id=53973
like image 161
UBIK LOAD PACK Avatar answered Oct 26 '25 09:10

UBIK LOAD PACK



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!