Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOM vs SAX XML parsing for large files

Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

  • SAX allows for the document to be read node by node, so the whole document is not in memory.
  • DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.

SAX vs DOM for large files:

As far as I understand it,

  • If I use SAX, I would have to iterate through 1.5 millions lines of code, node by node.
  • If I use DOM, I would have a big overhead, but then the results would be returned rapidly.

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

Therefore, which parser should I use?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more comfortable with JavaScript.

like image 350
Shrey Gupta Avatar asked Oct 16 '25 01:10

Shrey Gupta


1 Answers

Meet StAX

Just like SAX, StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM's bidirectional read/write support, its ease of use and SAX's CPU and memory efficiency.

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

JAXP API comparison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

like image 132
Ravi K Thapliyal Avatar answered Oct 18 '25 17:10

Ravi K Thapliyal



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!