Memory Efficient XML Processing not only with DOM

How can I efficiently parse large xml files which can be several GB large? With SAX? Hmmh, well, yes: you can! But this is somewhat ugly. If you prefer a better maintable approach you should definitely try joost which does not load the entire xml file into memory but is quite similar to xslt.

But how can I do this with DOM or even better dom4j, if you only have 50 MB or even less RAM? Well, this is not always possible, but under some circumstances you can do this with a small helper class. Read on!

E.g.you have the xml file

<products>
  <product id="1"> CONTENT1 .. </product>
  <product id="2"> CONTENT2 .. </product>
  <product id="3"> CONTENT3 .. </product>
  ...
</products>

Then you can parse it product by product via:

List<String> idList = new ArrayList<String>();
ContentHandler productHandler =
         new GenericXDOMHandler("/products/product") {
  public void writeDocument(String localName, Element element)
        throws Exception {
    // use DOM here
    String id = element.getAttribute("id");
    idList.add(id)
  }
}
GenericXDOMHandler.execute(new File(inputFile), productHandler);

How does this work? Every time the SAX handler detects the <product> element it will read the product tree (which is quite small) into RAM and call the writeDocument function. Technically we have added a listener to all the product elements with that and are waiting for ‘events’ from our GenericXDOMHandler. The code was developed for my xvantage project but is also used in production code on big files:


import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

/**
 * License: http://en.wikipedia.org/wiki/Public_domain
 * This software comes without WARRANTY about anything! Use it at your own risk!
 *
 * Reads an xml via sax and creates an Element object per document.
 *
 * @author Peter Karich, peathal 'at' yahoo 'dot' de
 */
public abstract class GenericXDOMHandler extends DefaultHandler {

 private Document factory;
 private Element current;
 private List<String> rootPath;
 private int depth = 0;

 public GenericXDOMHandler(String forEachDocument) {
  rootPath = new ArrayList<String>();
  for (String str : forEachDocument.split("/")) {
    str = str.trim();
    if (str.length() > 0)
    rootPath.add(str);
  }

  if (rootPath.size() < 2)
    throw new UnsupportedOperationException("forEachDocument"+
       +" must have at least one sub element in it."
       + "E.g. /root/subPath but it was:" + rootPath);
 }

 @Override
 public void startDocument() throws SAXException {
  try {
    factory = DocumentBuilderFactory.newInstance().
         newDocumentBuilder().newDocument();
  } catch (Exception e) {
    throw new RuntimeException("can't get DOM factory", e);
  }
 }

 @Override
 public void startElement(String uri, String local,
      String qName, Attributes attrs) throws SAXException {

  // go further only if we add something to our sub tree (defined by rootPath)
  if (depth + 1 < rootPath.size()) {
    current = null;
    if (rootPath.get(depth).equals(local))
      depth++;

    return;
  } else if (depth + 1 == rootPath.size()) {
    if (!rootPath.get(depth).equals(local))
      return;
  }

  if (current == null) {
    // start a new subtree
    current = factory.createElement(local);
  } else {
    Element childElement = factory.createElement(local);
    current.appendChild(childElement);
    current = childElement;
  }

  depth++;

  // Add every attribute.
  for (int i = 0; i < attrs.getLength(); ++i) {
    String nsUri = attrs.getURI(i);
    String qname = attrs.getQName(i);
    String value = attrs.getValue(i);
    Attr attr = factory.createAttributeNS(nsUri, qname);
    attr.setValue(value);
    current.setAttributeNodeNS(attr);
  }
 }

 @Override
 public void endElement(String uri, String localName,
     String qName) throws SAXException {

  if (current == null)
    return;

  Node parent = current.getParentNode();

  // leaf of subtree
  if (parent == null)
    current.normalize();

  if (depth == rootPath.size()) {
    try {
      writeDocument(localName, current);
    } catch (Exception ex) {
      throw new RuntimeException("Exception"+
        +" while writing one element of path:" + rootPath, ex);
    }
  }

  // climb up one level
  current = (Element) parent;
  depth--;
 }

 @Override
 public void characters(char buf[], int offset, int length)
       throws SAXException {
  if (current != null)
    current.appendChild(factory.createTextNode(
       new String(buf, offset, length)));
 }

 public abstract void writeDocument(String localName, Element element)
 throws Exception {
 }

 public static void execute(File inputFile,
     ContentHandler handler)
     throws SAXException, FileNotFoundException, IOException {

   execute(new FileInputStream(inputFile), handler);
 }

 public static void execute(InputStream input,
     ContentHandler handler)
     throws SAXException, FileNotFoundException, IOException {

   XMLReader xr = XMLReaderFactory.createXMLReader();
   xr.setContentHandler(handler);
   InputSource iSource = new InputSource(new InputStreamReader(input, "UTF-8"));
   xr.parse(iSource);
 }
}

PS: It should be simple to adapt this class to your needs; e.g. using dom4j instead of DOM. You could even register several paths and not only one rootPath via a BindingTree. For an implementation of this look at my xvantage project .

PPS: If you want to process xpath expressions in the writeDocument method be sure that this is not a performance bottleneck with the ordinary xpath engine! Because the method could be called several times. In my case I had several thousand documents, but jaxen solved this problem!

PPPS: If you want to handle xml writing and reading (‘xml serialization’) from Java classes check this list out!

About these ads

3 thoughts on “Memory Efficient XML Processing not only with DOM

  1. Xvantage is in an alpha state. So I couldn’t recommend it in production use.

    But the listed code snippet (which was extracted from xvantage) works for us pretty good.

Comments are closed.