Tuesday 20 May 2008

Extending immutable data structures (part 1)

Functional languages are famous because they let the programmer build complicated data structures effortlessly. It's very compelling to describe the type of any XML document in just three lines of code:

type xml =
| Element of (string * (string * string) list * xml list)
| PCData of string
And I won't recur (pun, intended) on the subject of how easy it is to write code to process these data structures.

However there is one thing which is difficult to do, and that is to meaningfully extend or annotate such data.

Let's say as a practical example I want to write a CSS renderer for XML documents. A first step to doing this is to annotate each node in the XML document with the list of styles that apply to it. (Styles are inherited, so natural recursive algorithms work quite well for this). The second step is to take what CSS calls "replaced elements" (elements like <IMG> that have an intrinsic size not known to CSS itself) and compute their sizes.

The xml type defined above is not extensible at all. In order to allow it to store styles and sizes for replaced elements we'd need to change the type definition to something like this:

type xml =
| Element of (string * (string * string) list * xml list
* style list * size option)
| PCData of string
Unfortunately this is no use to us because all our existing code that worked on the old xml type no longer works on this new type. That is no mere theoretical concern either, because the old xml type at the top wasn't just chosen at random. It is in fact the type used by Xml-Light, a useful, simplified XML parser and printer, and we cannot change this type without stopping using this library.

PXP (another XML library -- OCaml has several) has a polymorphic XML type, allowing you to attach precisely one arbitrary datum to each node. In terms of the xml type defined above, it looks similar to this:

type 'a xml =
| Element of ('a, ...
This is half-way useful, because one can now use the library functions (they assume polymorphism, so work fine), but has some serious shortcomings of its own. One is that you can't easily build up independent modules. If there is a PXP module which uses the 'a for its own purpose then one cannot extend the XML document any further. If you avoid that, you can attach multiple data by using a tuple, but each module that wants to attach data had better know about all the other modules in advance and agree on a single tuple type.

For the CSS styles / sizes example, we could also imagine adding a 'b styles with 'b instantiated as size when that is needed. You'd end up with document types like this:

unit xml # The basic document
unit styles xml # After annotating with styles (pass 1)
size styles xml # After annotating with sizes (pass 2)
But you can't write a function that just operates on "an XML document annotated with sizes", unless it "knows" about all the other annotations and the order in which they were applied. (Is the type size styles xml or styles size xml if we happened to swap the two passes around?)

In part 2 (next week) I'll look at some radically different ways to annotate and extend immutable data structures.

No comments: