Raph Levien, 28 March 1999
levien.com Gnome home
Here are some thoughts regarding defining and implementing DOM interfaces for Gnome, as part of the world DOMination project.
Basically, the problem we are trying to solve is to make the DOM an attractive interface for Gnome applications to store their state. The primary cost is the forcing the use of a fairly narrow interface for applications to be able to modify and access their state, as well as bloated memory requirements for the state itself. The primary advantage is that providing a unified framework should allow mixing of components within an application, as well as providing generic services for such things as remote access to state, loading and saving of state, undo, and many important parts of a plugin mechanism.
Currently, there is a gnome-dom codebase in Gnome CVS. I am concerned that this codebase and the interfaces it defines may be inadequate for the uses I am considering. The primary shortcomings are:
- Memory representation on the order of 10x the original XML document.
- No mechanism (currently) for notifying the application of changes within the tree. It should be noted that the W3C has a working draft of a Level 2 DOM specification that has support for adding listeners to events such as mutations of the tree.
Implementing a lightweight, yet fully standards compliant DOM seems to be challenging. Here I will outline some of my ideas about how it might be done.
Nodes and node handles
The DOM spec is largely organized around a Node data structure, with a number of traversal methods defined (parent, first child, next sibling, etc.). One principle of the DOM is that these Nodes are persistent across modifications of the tree. For example, a Node pointing to the first child of the second child of the root will point to the 1st of the 1st after the 1st child of the root is deleted.
When implementing a DOM that is accessible over CORBA IIOP, remote references to Nodes will always be through a "node handle" data structure. There is one specific challenge here: once it hands out a handle, the server does not know if or when the client "forgets" it. Thus, node handles can't be represented as simply as "1st child of 2nd child of root", as they would be invalidated by tree changes. How might node handles be implemented?
The first approach, and the one seemingly chosen by gnome-dom, is that node handles correspond to pointers to allocated memory. By storing links to parents, children, and siblings, all of the traversal methods can be trivially implemented. Further, these node handles easily survive changes to the tree.
However, this type of node handle is not adequate to fully implement the DOM. For one, the DOM provides a "getElementsByTagName" method that returns a NodeList of all descendant elements with the given tag name. This NodeList also tracks changes to the tree, i.e. if a node is deleted, the size of the NodeList decreases by one. It is not practical to use a pointer to allocated memory as a node handle for this NodeList. For one, the server would never know when this memory might be safely deallocated.
Thus, a second style of node handle comes into focus: a string description of a _procedure_ that might retrieve the value. For example, el.getElementsByTagName("img") might return a node handle that is the concatenation of el's node handle and the string "img". Then, when this node handle is queried for its length, it can traverse the tree starting at el and count the number of <img> descendants.
My approach here is that this type of node handle can be extended to vanilla tree traversal as well, opening up the possibility of a very compact representation.
Here is one concrete proposal for a compact data type. I do not claim it is in any way optimal.
A top-level document consists of a number of chunks. A chunk contains some number of nodes. Further, chunks can hold references to other chunks. The overall structure is quite reminiscent of a B-tree. Further, tag and attribute names are interned in a global table and can be referred to by an index. (i.e. once a name is interned, it can never be deleted from the document representation).
Within a chunk, an element is represented as follows:4-byte byte count for the element 4-byte unique node id 4-byte tagName index 4-byte byte count for the attributes [the attributes, sequentially] [the children, sequentially]
An attribute is represented as follows:4-byte byte count for the attribute 4-byte attributeName index [string representing the attribute value]
References to other chunks and character data nodes are handled similarly. There are also a few spare bits we can use (using the Lisp trick of 4-byte alignment and using the bottom bits as tag info). These bits will distinguish between the various node types and also indicate whether an element is the last child of its parent.
Conceptually, DOM traversal operations can be implemented by searching the entire document for the node id. This is, of course, hideously inefficient in the general case. Thus, I propose that the DOM implementation contains a cache that maps node id's to physical locations. Only when there is a miss in the cache does the tree need to be searched to find an id. The first child and next sibling operations can be done in constant time given a physical location - the result of these operations is also placed in the cache. Thus, a preorder traversal of the tree is likely to always hit in the cache, even if the cache is quite small.
Mutations of the tree also invalidate entries in the cache, as appropriate.
Additionally, there are quite a few "nodes" in the DOM interface that don't correspond directly to a node in the physical representation. For example, you can request an Attr node of an element. The node handle for this node would be a concatenation of the node id of the element and the attrName id of the attribute. This node has a child which is (usually) a Text node containing the attribute value. The id of this node is the same as the node handle for the attribute, plus a flag indicating that it's the child.
I believe this representation satisfies the requirements of being reasonably compact, while also allowing a 100% standards-compliant DOM interface on the back-end, and can be implemented without memory leaks.
Thus, I suggest that the interface be designed with to handle a representation of node handles that includes both the traditional node pointer approach and also these id-based node handles.
For the C language implementation of the DOM interface, a node handle should include both a pointer to the enclosing Document structure and also the various nodes, ids, and flags required. Memory for the node handle should be allocated on the client side. Thus, methods returning a node handle should have a pointer to the node handle structure passed in as a "result" argument.
The B-tree representation requires the following layouts for node handles:
Document * the enclosing document
int32 node_id the node id of the node itself
int32 attrName_id the attribute name for Attr nodes of elements
boolean to distinguish between the Attr node and its Text child
A node representation can fit Nodes into the following structure:
Document * the enclosing document
Node * a pointer to the node
However, NodeLists (the result of a getElementsByTagName invocation) require this layout:
Document * the enclosing document
Node * a pointer to the node
int32 tagName_id the id of the tagName (or maybe a pointer)
Thus, it seems to me that three pointers should be adequate to represent node handles (the extra bit for the boolean can be snuck in a number of different ways).
I have a small bit of working code implementing some of these ideas. This code was written strictly for the purposes of exploring the XML DOM implementation. This code uses a simple node-based representation, but attempts to use interfaces that should allow for alternate representations.
One topic that's come up (thanks Havoc!) is the possibility of custom representations for parts of the tree. For example, if the data structure is an array of doubles, it makes sense to implement it as an array of doubles. It may be possible for people to write their own custom implementations of the DOM interface that internally use much more compact, streamlined representations of the data. Further, applications may use their own specialized access methods to get at this data (i.e. directly indexing the array to get a double, rather than using DOM traversal methods followed by an ASCII-to-double conversion of the stored text).
Because of the need for node handles to be persistent across mutation of the tree and other considerations, making these implementations conform 100% to the DOM interface is probably not realistic. The way I'd suggest handling this is to allow the interface between the custom interface and the main DOM engine to implement a subset of the DOM interface. When a DOM operation occurs that is not supported by the custom implementation, the DOM engine will transfer the data to its own generic implementation (preserving the node handles), and use the generic DOM operations from there on out.
It may be worth looking at "compromise" approaches to such custom representations. For example, if the array of doubles additionally stored a node id (a 50% increase in memory requirements), then the same cache technique mentioned above could be used, and insertions and deletions of the array would be possible even after node handles had been given out to clients.
I'm not sure to what extent these custom representations will be useful. However, it's probably worth not ruling them out by poor choices of interface now.
Much of the complexity and difficulty of the DOM approach is a consequence of trying to be standards compliant. It may be worth exploring a nonstandard approach, if for no other reason than to decide that the benefits of the standards outweigh their costs. In this section, I briefly outline what might be done, and the technical advantages of these approaches over the standards.
First, XML can be quite significantly simplified. James Clark has already defined a Canonical XML specification that shows what might be done. Canonical XML is probably a bit too extreme in its quest for simplicity, but I do think it would be possible to eliminate entities (other than the built-in entities needed to escape metacharacters), parameter entities, default attribute values, dtd's, notations, processing instructions, cdata, and encodings other than UTF-8 without losing any significant functionality for applications. I believe a full-fledged and high performance parser can be written for this subset of XML in about 1000 lines of C. In fact, many "XML parsers" (including python's xmllib) implement just this subset or one very similar. There is no need to make the syntax incompatible with standard XML, so interoperability would still be preserved in most cases.
The key difficulty of the DOM, I think, is the requirement that node handles be persistent across tree mutation. An approach allowing much more lightweight DOM implementations would be to represent handles statelessly, probably as a sequence of navigation instructions from the root. Further, the standard DOM operations are probably too fine-grained in some cases (there is no need to have separate nodes for attributes and their values, imho).
Stateless node handles would allow for easy implementation of custom representations such as arrays of doubles. I don't think there are any serious techincal disadvantages. Yes, node handles would potentially be a bit bigger (esp. for very deep trees), and yes, most DOM query operations would be proportional to depth rather than being implementable in constant time, but I think realistically most trees are shallow, so these are not serious concerns.
Apps would need to be organized a little differently to deal with stateless node handles. But I think this is a question of "different", not necessarily better or worse. An app that truly required a stateful node handle could fake it by using event listeners to keep track of tree mutations, and maintain the needed state on the client side of the CORBA interface.
One additional limitation of the DOM is that in a CORBA implementation, it would tend to be quite latency limited for the traversal of large trees. I.e. each latency roundtrip would only yield one node of the tree. A workaround may be to use threads to issue multiple parallel DOM requests, but even this workaround is clearly not going to yield optimal performance.
Thus, a perfectly reasonable approach is to implement methods for retrieving and updating using simple flattened representations of big chunks of the tree at a time. It's worth noting that this could also be implemented as an extension to the standards-compliant DOM, i.e. we don't have to use a nonstandard base to get this advantage.
A decision to roll our own rather than use existing standards is not to be taken lightly. It does seem to me that many design decisions of the DOM were made in a totally different context (i.e. easy scripting of Web pages) than the one we're trying to apply it in (i.e. a framework for high performance applications written in C or one of its variants), so the extra costs are also not to be taken lightly. I don't think we'll truly be able to decide these issues without a working prototype.