I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:
As far as I understand it,
I need to be able to use this parser multiple times on similar files of the same length.
Therefore, which parser should I use?
Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more comfortable with JavaScript.
Just like SAX
, StAX
follows a Streaming programming model for parsing XML. But, it's a cross between DOM
's bidirectional read/write support, its ease of use and SAX
's CPU and memory efficiency.
SAX
is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX
on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.
╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗ ║ JAXP API Property ║ StAX ║ SAX ║ DOM ║ TrAX ║ ╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣ ║ API Style ║ Pull events; streaming ║ Push events; streaming ║ In memory tree based ║ XSLT Rule based templates ║ ║ Ease of Use ║ High ║ Medium ║ High ║ Medium ║ ║ XPath Capability ║ No ║ No ║ Yes ║ Yes ║ ║ CPU and Memory Utilization ║ Good ║ Good ║ Depends ║ Depends ║ ║ Forward Only ║ Yes ║ Yes ║ No ║ No ║ ║ Reading ║ Yes ║ Yes ║ Yes ║ Yes ║ ║ Writing ║ Yes ║ No ║ Yes ║ Yes ║ ║ Create, Read, Update, Delete (CRUD) ║ No ║ No ║ Yes ║ No ║ ╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝
Reference:
Does StAX Belong in Your XML Toolbox?
StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With