Download: Tool Finds Best XML Parsers for Projects
When it comes to XML parsers, there's no shortage of choices: Apache's Xerces, GNU's JAXP, kXML and Sun's J2SE SDK parser are all available to developers as free downloads. You can also select from a few different XML-parsing APIs, not to mention JVMs (Java Virtual Machines).
So, for developers getting ready to XML-enable their legacy code, are there any road maps for navigating this thicket of parsing choices? Are some parsers better than others? What's the best JVM for XML parsing? Do some parsers work better with massive XML files than others?
Well, it turns out that someone has tried to answer these gnarly questions -- and yes, some parsers do work better than others. Developer Pankaj Kumar has authored a nifty little XML benchmarking tool called XPB4J to quantify XML parser performance. As a result of his testing, Kumar has also come up with a short checklist (see below) that developers can use to help them choose the right parsing technologies. Developers can download the XPB4J parser free of charge from Kumar's website.
Does Your XML Parser Measure Up?
Kumar's XPB4J measures the software across a number of variables including parsing, transformation, validation, encryption/decryption, custom access/manipulation or any combination of these applied to one or more XML files and/or byte streams as XML processing. "Everyone was telling me that XML was bloated, so I wanted to check it out," Kumar said.
Indeed, Kumar discovered that performance differs among implementations of the four major types of XML parsing APIs: SAX, DOM, XMLPull and XSLT. Kumar also found out how SAX, DOM and XSLT-based parsers performed on different file sizes.
When choosing a parser, Kumar advised, performance and memory usage are just two of the criteria that should be considered. He also suggested that developers consider several other metrics, including:
- The disk and memory footprint of the parser;
- Support of namespaces and other XML features;
- Conformance to standards, (such as JAXP);
- Validation support (whether it's DTD- or XML-schema- based); and
- Whether the parser comes with support.
Inside the Testing -- Rating XML Parsers
In specific, Kumar's XPB4J considers XML operations, and consists of:
- Code to carry out certain XML processing tasks;
- Code and script to run the processing tasks and report performance measurements;
- XPB4J Framework to plug code and scripts for any processing; and
- XPB4J (version .90), which also includes code for a specific processing activity, called XStat Processing, to collect certain statistics on an XML file
This software was used to benchmark many popular XML parsers and JVMs. The following summarizes Kumar's product-specific results:
Chief among his findings is that XML parser performance for each type varies by file size. For instance, he determined that while SAX-based parsers were the fastest in 100 KB, 1 MB and 10 MB XML files, the most noticeable difference was in terms of memory utilization -- especially with larger files. A SAX parser processed a 10 MB file in one half the time of DOM software, but when it came to memory management, the SAX parser used 1/75 the resources of its DOM alternative. The lesson, Kumar said, is "If you're dealing with very large XML files, something beyond 10 megabytes, stick with SAX."
He also found the wrong choice of parser API can seriously impact the scalability of your system, and that even within a given API, "the footprint of the different parser implementations and their capabilities vary widely."
Also noteworthy was Kumar's finding that XSLT-based transformations don't scale well for large XML files, because these "need to create a DOM-like structure within memory," he said. "This should be an important consideration for people using XSL-based transformations to create Web pages from huge quantities of dynamic data," he added.
As far as JVMs were concerned, IBM's JDK 1.3 JVM held the day. Interestingly, the benchmarks found that JVM efficiency had the greatest impact on DOM processing, possibly because of DOM's heavy reliance on
memory management operations.
To see how your favorite XML parser may have fared, check out Kumar's full report on Java XML processors athttp://www.pankaj-k.net/xpb4j/docs/Measurements-May30/measurements-May30-2002.html>
The XML Parser Test Methodology
Kumar used a 900 MHz Athlon machine running 512 MB of SDRAM and Windows 2000 to measure Java XStat parsing performance with a variety of JVMs, file sizes, parsers and parsing APIs. He then parsed three different data sets, which were generated by searching for the string "Bill Gates" using the Google Web API. His conclusion? When it comes to SAX or DOM processing, Sun's J2SE SDK-1.4.0 comes with the fastest Java parser.
Different implementations of XPB4J use different parsing APIs, such as SAX, DOM, JDOM, XmlPull, DOM4J and XSLT. The following versions (commercial and Open Source) were tested:
- Xerces Parser: Xerces-1.4.4 and Xerces-2.0.1 from Apache Software Foundation;
- GNU JAXP Processor: GNU JAXP 1.0 Beta1;
- Piccolo SAX Parser: Piccolo 1.02;
- XML Pull Parser: XPP3 and kXML;
- JDOM: JDOM Beta8;
- DOM4J: dom4j 1.3;
- Xalan Processor: Xalan-2.3.1 from Apache Software Foundation;
- Ant Software: Jakarta Ant 1.4 from Apache Software Foundation;
- Java Software: J2SE 1.3.1 and J2SE 1.4.0 from Sun Microsystems; JRockit 1.3 from BEA Systems; and JDK 1.3 from IBM.
Kumar's next great challenge will be to compare C and C++ parsers to the Java software he's already tested. The XPB4J developer said that he's not sure what to expect from these tests, and that arguments could be made predicting that any of the three languages could be the winner. XPB4J should be extended to test C and C++ parsers soon, he said.