Sunday 25 May 2008

Extending immutable data structures (part 2)

[This is part 2 of my series about extending immutable data structures in OCaml. You may want to go back and read part 1 first]

Previously I took this definition of an XML document tree from xml-light and asked how we could annotate nodes with extra data.

type xml =
| Element of (string * (string * string) list * xml list)
| PCData of string

If you "think outside the box", or in this case literally outside the structure, you might see that you can add annotations using a hash table. The idea is that we'll hash the addresses of nodes to the annotated data.

Let's have a concrete example to make this all a bit clearer. Start with this simple document tree:

<a href="http://camltastic.blogspot.com/">Camltastic!</a>

which is written in OCaml as:

let str = PCData "Camltastic!"
let doc = Element ("a",
["href", "http://camltastic.blogspot.com/"],
[str])

There are two nodes here (str and doc) and both have type xml.

A simple, and wrong, way to use a hash table to annotate these nodes is like this:

let h = Hashtbl.create 13
let add_annot node data = Hashtbl.add h node data
let get_annot node =
try Some (Hashtbl.find h node)
with Not_found -> None

It seems to work:

# add_annot doc "doc is annotated with this string" ;;
- : unit = ()
# get_annot doc ;;
- : string option = Some "doc is annotated with this string"
# get_annot str ;;
- : string option = None

But there are some subtle and not-so-subtle problems with this approach.

The first problem is that Hashtbl, by default, uses structural equality to hash nodes. If you had a document like <a>Link</a><a>Link</a> then there is no way to attach different data to the two links, because both links appear to be equal (they are equal, structurally). Attaching data to one link attaches the data to both.

The two other problems with this are garbage collection and copying. Garbage collection is defeated because the hash table maintains a permanent pointer to any annotated nodes. What we would like to happen is that if a node becomes garbage, the collector frees the node and any annotations on the node. In fact this never happens: the permanent pointer held by the hash table prevents any node from becoming garbage. The copying problem is that when the main program copies a node, the annotations aren't copied across.

Enter the Weak Hash Table



OCaml has a rarely-used but rather elegant module in the standard library called Weak. This module implements weak pointers. A weak pointer is like an ordinary pointer, but it doesn't "count" towards garbage collection. In other words if a value on the heap is only pointed at by weak pointers, then the garbage collector may free that value as if nothing was pointing to it.

A weak hash table uses weak pointers to avoid the garbage collection problem above. Instead of storing a permanent pointer to the keys or values, a weak hash table stores only a weak pointer, allowing the garbage collector to free nodes as normal.

It probably isn't immediately obvious, but there are 3 variants of the weak hash table, to do with whether the key, the value or both are weak. (The fourth "variant", where both key and value are strong, is just an ordinary Hashtbl). Thus the OCaml standard library "weak hash table" has only weak values. The Hweak library has weak keys and values. And the WeakMetadata library (weakMetadata.ml, weakMetadata.mli) is a version of Hweak modified by me which has only weak keys.

Which sort of weak hash table do we need? If you look at the Hashtbl implementation above you'll see that the key corresponds to the xml node and the value corresponds to the metadata attached to nodes. We assume that the main program holds onto a pointer to the xml tree until it is no longer needed, whereupon it should be collected. So we definitely need weak keys. How about the values (annotations)? The main program probably isn't holding any references to the annotations - it expects to annotate a node and then forget about the annotation until it is needed later. So values should not be weak (otherwise our annotation library will be rather "forgetful"). You'll therefore need to use my modified WeakMetadata library to get around the garbage collection problem.

Using WeakMetadata you can annotate immutable data structures, as this test program shows. Although we have solved the garbage collection problem, the other two problems I mentioned remain. Namely, it's still using structural equality, and if the main program copies a node, then the annotations don't get copied too.

However there are some rather more insidious problems that remain with this whole approach, which are down to how weak pointers are actually implemented inside the OCaml garbage collector. I wrote about them in this posting on the OCaml mailing list which also includes some analysis of the performance and memory usage of annotating XML document nodes.

Part 3 (next week) will look at what I hope is the "final word" on safely extending data structures.

4 comments:

Unknown said...

Nice post! Where's part 3?

Richard Jones said...

Just checking if anyone was listening!

The follow-up is written, but needs a lot of editorial work. After I get back, in 2 weeks, I'll post it.

Unknown said...

Hi Richard, could you please post part 3?

Richard Jones said...

Ah well, unfortunately part 3 doesn't work!

However I'm going to fix it.