Monday, 26 May 2008

Phantom types

In libvirt we have a concept of a connection to the hypervisor, and every libvirt operation happens across this connection. Connections are normally read/write, which means that you can carry out any operation, even destructive ones like shutting down a virtual machine. But since programs normally need to have root (or "elevated") privileges to do these destructive operations, we also introduced the concept of a read-only connection which can just be used for non-destructive things such as listing out virtual machines or monitoring CPU loads.

Read-only operations Read/write operations

get hypervisor type
get version
get hostname
get URI
get num CPUs
list VMs
list networks
get VM CPU stats

suspend VM
resume VM
shutdown VM
destroy VM
coredump VM
create VM
change CPU pinning

The question is, what happens if a program tries to call a destructive operation using just a read-only connection? The answer, naturally, is that libvirt throws a runtime error if you try this.

Wouldn't it be nicer if the compiler could find these sorts of errors? That way you could be sure that your compiled program was completely correct, and there wasn't some sort of dark corner of untested code which might fail in rare circumstances. For the OCaml bindings to libvirt that was exactly the sort of guarantee I wanted to make, and to do it I used a language feature which is often thought of as being confusing or highly expert (but in fact is neither of those) ... phantom types.

Before we can understand phantom types, we need to set up a little bit of background: firstly asking what sort of errors is it conceivable for the compiler to find, and secondly understanding a bit more about polymorphic types and variants (which are the toolbox we use to construct phantom types).

Compile time or run time?

To ask if phantom types are useful we need to know first if they are applicable to our problem. Phantom types find only compile time errors. Let's understand what a compile time error could be in libvirt:

let conn = Libvirt.Connect.connect_readonly ()
let dom = Libvirt.Domain.lookup_by_name conn "test"
Libvirt.Domain.destroy dom (* fail! *)

Clearly here the dom ("domain" = virtual machine) is derived from a read-only connection (conn), so the call to destroy must fail. It is also clear that the compiler can trivially know that conn / dom are read-only, because of the special connect_readonly call.

Contrast that with the following code:

printf "Connect read-only?" ;;
let readonly_flag = read_line () = "y"
let conn = readonly_flag
(* etc. *)

Now the compiler cannot possibly predict what the user will type in at runtime, so this is not a situation where the compiler or phantom types can help. (The call is one I just made up, it's not a real call precisely because of this problem).

So libvirt has two separate calls to create connections, connect_readonly which makes read-only connections and connect which makes read/write connections. The connections returned have different but compatible types, and because of this "compatibility" the connections can still be passed to the common non-destructive functions. Even more amazingly, the extra "is read/write" flag which is attached to the connection simply disappears at runtime, which means there is zero loss of efficiency when using phantom types (I'll explain a bit more about that below).

It's important to keep a very clear distinction in your head about what happens at compile time versus what happens at run time.

Mysterious polymorphic types

You may be familiar with polymorphic types such as 'a list, lists of any type, where 'a (pronounced alpha) stands for all types. Here's another example of a polymorphic type, a structure which contains a "free" field that can store any particular type:

type 'a t = { data : 'a }

We can make a structure containing string data for example. Notice its type:

# { data = "hello" } ;;
- : string t = {data = "hello"}

What is perhaps not so well known is that you can add extra polymorphism to any type you like, even when it doesn't seem to need it.

This just makes t an alias for the floating point type:

type t = float

But the following (legal) definition is much more mysterious:

type 'a t = float

What does it do, and how can you create anything of this type? The answer is that any float has this type, as you can prove very easily in the OCaml toplevel:

# (3.0 : unit t);;
- : unit t = 3.
# (10.4 : string t);;
- : string t = 10.4

Because the 'a isn't needed, it can be set to any type (unit and string in the examples above).

This isn't yet the "phantom" type though. It's tempting to think we could write a function which only works on string t (I'll call them "stringies"):

# let add_stringies (a : string t) (b : string t) = (a +. b : string t) ;;
val add_stringies : string t -> string t -> string t = <fun>

But in fact this function does not work!

# add_stringies (3.0 : unit t) 5.0 ;;
- : string t = 8.

This is because unit t and string t can be freely unified with each other because the compiler knows that both are really just floats:

# ((3.0 : unit t) : string t) ;;
- : string t = 3.

To prevent this "rogue" unification and allow us to write a function like add_stringies correctly, we have to hide the real type of t inside a module, like this:

module T : sig
type 'a t
val add_stringies : string t -> string t -> string t
end = struct
type 'a t = float
let add_stringies a b = a +. b

Module T is completely useless by the way, it's just setting the stage for ...

Phantom types

OCaml might have saved NASA from losing a $125 million space probe if only they'd had the following Length module which prevents programmers from mixing imperial and metric measurements:

module Length : sig
type 'a t
val meters : float -> [`Meters] t
val feet : float -> [`Feet] t
val (+.) : 'a t -> 'a t -> 'a t
val to_float : 'a t -> float
end = struct
type 'a t = float
external meters : float -> [`Meters] t = "%identity"
external feet : float -> [`Feet] t = "%identity"
let (+.) = (+.)
external to_float : 'a t -> float = "%identity"

open Length
open Printf

let () =
let m1 = meters 10. in
let m2 = meters 20. in
printf "10m + 20m = %g\n" (to_float (m1 +. m2));
let f1 = feet 40. in
let f2 = feet 50. in
printf "40ft + 50ft = %g\n" (to_float (f1 +. f2));
(*printf "10m + 50ft = %g\n" (to_float (m1 +. f2)) (* error *) *)

(Try compiling and running this, then try uncommenting the very last line and compiling it again).

Just for readability, we're using polymorphic variant types instead of the basic types (unit and string from the previous section). This allows us to substitute meaningful type names like [`Meters] t which I hope is obvious as to what it contains. It also means the error messages from the compiler will be easy to read - that's important because we are expecting to get a compiler error each time the programmer makes a mistake.

Now look at the functions we have:

Two functions (meters and feet) convert floating point numbers into meters or feet respectively. But the implementation of these functions is completely efficient. They're just the identity operation (which the compiler turns into a null operation). At compile time, the values have this extra type information. But at run time the overhead evaporates completely. At run time, these are just floats.

The addition function is constrained: 'a t -> 'a t -> 'a t. This means you can use it on two meter measurements, or two feet measurements, but you cannot mix meters and feet. Furthermore the return type is the same as the input types, so this safety cascades through all code.

Finally we need a way to extract and print the results. I've included a rather naive to_float function, but for better safety we'd probably want to define special print functions which ensure that the output indicates the correct type back to the user.

Let's go back to our read-only/-write connections from the beginning. Remember that simple non-destructive status functions are safe for anyone to call, but destructive functions must only be used when you have a read/write connection. To express this we will use a slightly different feature of polymorphic variants, that is being able to express open types using [>...]. A read-only connection will have type [`Readonly] t and a read/write connection will have type [`Readonly|`Readwrite] t which means that it's compatible with the read-only type but has the extra read/write ability.

Status functions have type : [>`Readonly] t -> ... because they work with "read-only or greater".

Here is our module:

module Connection : sig
type 'a t
val connect_readonly : unit -> [`Readonly] t
val connect : unit -> [`Readonly|`Readwrite] t
val status : [>`Readonly] t -> int
val destroy : [>`Readwrite] t -> unit
end = struct
type 'a t = int
let count = ref 0
let connect_readonly () = incr count; !count
let connect () = incr count; !count
let status c = c
let destroy c = ()

open Connection
open Printf

let () =
let conn = connect_readonly () in
printf "status = %d\n" (status conn);
(*destroy conn; (* error *) *)
let conn = connect () in
printf "status = %d\n" (status conn);
destroy conn

Again, after compiling this, try uncommenting the rogue call to destroy and notice that the error is caught by the compiler.

Here's another example, again it comes from actual deployed code. A "memory map" is a view into the memory of a virtual machine. When we first load a memory map we don't know anything about the endianness or word size of the virtual machine. We have to inspect the memory map first to determine endianness (little or big endian) and word size (32 or 64 bit pointers). Once we've determined both, we offer additional operations which work on the integers and pointers contained in the memory map.

So there are four "subtypes" (states?) of memory map, summarized in the diagram on the left.

Can we use phantom types to ensure that the transitions happen before specific calls are allowed? Yes we can!

We start by defining our phantom type. This is how it appears in the module signature (the mli file). The specifics of the implementation (from the ml file) aren't important here. Note that because there are two degrees of freedom (word size and endianness), there are two phantom types attached to t:

type ('a,'b) t

The basic of_file function makes a memory map from a file descriptor and base address. It returns a memory map with no word size or endianness, which is pretty clearly expressed in the return type:

val of_file : Unix.file_descr -> addr -> ([`NoWordsize], [`NoEndian]) t

Some functions work with any memory map, whether or not word size and endianness have been set. For example, the find function searches for strings in the memory map:

val find : ('a, 'b) t -> ?start:addr -> string -> addr option

A related function is find_align which finds strings that are aligned to the word size. This function cares that word size has been set, but not endianness, and its type is therefore:

val find_align : ([`Wordsize], 'b) t -> ?start:addr -> string -> addr option

The find_pointer function looks for pointers appearing in the memory map. Pointers have both endianness and word size implicitly, so this function can only be used when both have been set on the memory map. Its type is:

val find_pointer : ([`Wordsize], [`Endian]) t -> ?start:addr -> addr ->
addr option

Finally of course we need a way to supply the word size and endianness to the memory map. Detecting these is a difficult and drawn-out process, so our main code will set each one individually by calling the following two functions:

val set_wordsize : ([`NoWordsize], 'b) t -> wordsize ->
([`Wordsize], 'b) t
val set_endian : ('a, [`NoEndian]) t -> endian ->
('a, [`Endian]) t

The compiler guarantees here are quite strict. Although the programmer can call the functions in either order, they must call each function only once. And of course the programmer won't be allowed to call functions like find_pointer until they have called both set_wordsize and set_endian exactly once (although the order they call the functions doesn't matter).

Personally I think this is a pretty awesome language feature. I'm not sure if any other non-functional languages can make these sorts of guarantees at compile time. (Perhaps Eiffel?)

One interesting thing is that the implementation of the memory map still has runtime checks in those functions (such as find_pointer) which need it. But you can be sure that the runtime checks will never fail. An improvement to the code would be to write it so it doesn't need any runtime checks at all.

Runtime and phantom types

My final example with phantom types will be to explain how to add runtime checks back in. Some situations simply cannot be checked at compile time, yet don't represent errors. For example in mlvirsh users can open read-only or read/write connections to the hypervisor. We deal with this by adding a limited amount of dynamic typing back into the system:

type conn_t =
| No_connection
| RO of Libvirt.Connect.t
| RW of Libvirt.Connect.t

Of course we need to also add runtime checks into the program.

I will argue that this is an acceptable trade-off and still much better than languages which only do dynamic typing. We get the benefits of static types / early compiler errors when it's possible to check for them, but we can slip back to dynamic typing in the situations where it's necessary. Out of the entire suite of virt tools, just one (mlvirsh) needs runtime types.

Sunday, 25 May 2008

Extending immutable data structures (part 2)

[This is part 2 of my series about extending immutable data structures in OCaml. You may want to go back and read part 1 first]

Previously I took this definition of an XML document tree from xml-light and asked how we could annotate nodes with extra data.

type xml =
| Element of (string * (string * string) list * xml list)
| PCData of string

If you "think outside the box", or in this case literally outside the structure, you might see that you can add annotations using a hash table. The idea is that we'll hash the addresses of nodes to the annotated data.

Let's have a concrete example to make this all a bit clearer. Start with this simple document tree:

<a href="">Camltastic!</a>

which is written in OCaml as:

let str = PCData "Camltastic!"
let doc = Element ("a",
["href", ""],

There are two nodes here (str and doc) and both have type xml.

A simple, and wrong, way to use a hash table to annotate these nodes is like this:

let h = Hashtbl.create 13
let add_annot node data = Hashtbl.add h node data
let get_annot node =
try Some (Hashtbl.find h node)
with Not_found -> None

It seems to work:

# add_annot doc "doc is annotated with this string" ;;
- : unit = ()
# get_annot doc ;;
- : string option = Some "doc is annotated with this string"
# get_annot str ;;
- : string option = None

But there are some subtle and not-so-subtle problems with this approach.

The first problem is that Hashtbl, by default, uses structural equality to hash nodes. If you had a document like <a>Link</a><a>Link</a> then there is no way to attach different data to the two links, because both links appear to be equal (they are equal, structurally). Attaching data to one link attaches the data to both.

The two other problems with this are garbage collection and copying. Garbage collection is defeated because the hash table maintains a permanent pointer to any annotated nodes. What we would like to happen is that if a node becomes garbage, the collector frees the node and any annotations on the node. In fact this never happens: the permanent pointer held by the hash table prevents any node from becoming garbage. The copying problem is that when the main program copies a node, the annotations aren't copied across.

Enter the Weak Hash Table

OCaml has a rarely-used but rather elegant module in the standard library called Weak. This module implements weak pointers. A weak pointer is like an ordinary pointer, but it doesn't "count" towards garbage collection. In other words if a value on the heap is only pointed at by weak pointers, then the garbage collector may free that value as if nothing was pointing to it.

A weak hash table uses weak pointers to avoid the garbage collection problem above. Instead of storing a permanent pointer to the keys or values, a weak hash table stores only a weak pointer, allowing the garbage collector to free nodes as normal.

It probably isn't immediately obvious, but there are 3 variants of the weak hash table, to do with whether the key, the value or both are weak. (The fourth "variant", where both key and value are strong, is just an ordinary Hashtbl). Thus the OCaml standard library "weak hash table" has only weak values. The Hweak library has weak keys and values. And the WeakMetadata library (, weakMetadata.mli) is a version of Hweak modified by me which has only weak keys.

Which sort of weak hash table do we need? If you look at the Hashtbl implementation above you'll see that the key corresponds to the xml node and the value corresponds to the metadata attached to nodes. We assume that the main program holds onto a pointer to the xml tree until it is no longer needed, whereupon it should be collected. So we definitely need weak keys. How about the values (annotations)? The main program probably isn't holding any references to the annotations - it expects to annotate a node and then forget about the annotation until it is needed later. So values should not be weak (otherwise our annotation library will be rather "forgetful"). You'll therefore need to use my modified WeakMetadata library to get around the garbage collection problem.

Using WeakMetadata you can annotate immutable data structures, as this test program shows. Although we have solved the garbage collection problem, the other two problems I mentioned remain. Namely, it's still using structural equality, and if the main program copies a node, then the annotations don't get copied too.

However there are some rather more insidious problems that remain with this whole approach, which are down to how weak pointers are actually implemented inside the OCaml garbage collector. I wrote about them in this posting on the OCaml mailing list which also includes some analysis of the performance and memory usage of annotating XML document nodes.

Part 3 (next week) will look at what I hope is the "final word" on safely extending data structures.

Tuesday, 20 May 2008

Extending immutable data structures (part 1)

Functional languages are famous because they let the programmer build complicated data structures effortlessly. It's very compelling to describe the type of any XML document in just three lines of code:

type xml =
| Element of (string * (string * string) list * xml list)
| PCData of string
And I won't recur (pun, intended) on the subject of how easy it is to write code to process these data structures.

However there is one thing which is difficult to do, and that is to meaningfully extend or annotate such data.

Let's say as a practical example I want to write a CSS renderer for XML documents. A first step to doing this is to annotate each node in the XML document with the list of styles that apply to it. (Styles are inherited, so natural recursive algorithms work quite well for this). The second step is to take what CSS calls "replaced elements" (elements like <IMG> that have an intrinsic size not known to CSS itself) and compute their sizes.

The xml type defined above is not extensible at all. In order to allow it to store styles and sizes for replaced elements we'd need to change the type definition to something like this:

type xml =
| Element of (string * (string * string) list * xml list
* style list * size option)
| PCData of string
Unfortunately this is no use to us because all our existing code that worked on the old xml type no longer works on this new type. That is no mere theoretical concern either, because the old xml type at the top wasn't just chosen at random. It is in fact the type used by Xml-Light, a useful, simplified XML parser and printer, and we cannot change this type without stopping using this library.

PXP (another XML library -- OCaml has several) has a polymorphic XML type, allowing you to attach precisely one arbitrary datum to each node. In terms of the xml type defined above, it looks similar to this:

type 'a xml =
| Element of ('a, ...
This is half-way useful, because one can now use the library functions (they assume polymorphism, so work fine), but has some serious shortcomings of its own. One is that you can't easily build up independent modules. If there is a PXP module which uses the 'a for its own purpose then one cannot extend the XML document any further. If you avoid that, you can attach multiple data by using a tuple, but each module that wants to attach data had better know about all the other modules in advance and agree on a single tuple type.

For the CSS styles / sizes example, we could also imagine adding a 'b styles with 'b instantiated as size when that is needed. You'd end up with document types like this:

unit xml # The basic document
unit styles xml # After annotating with styles (pass 1)
size styles xml # After annotating with sizes (pass 2)
But you can't write a function that just operates on "an XML document annotated with sizes", unless it "knows" about all the other annotations and the order in which they were applied. (Is the type size styles xml or styles size xml if we happened to swap the two passes around?)

In part 2 (next week) I'll look at some radically different ways to annotate and extend immutable data structures.

Wednesday, 7 May 2008

Optimizing, memory allocation and loops

Unlike C compilers such as gcc, ocamlopt doesn't do very much in the way of optimization. Really it compiles the program directly down to machine code pretty much the way you write it.

Accidental memory allocation and loops are a particular challenge. Consider for example:

let n = 100_000
let fib = Array.make n 1 ;;
for i = 2 to n-1 do
let a, b = fib.(i-2), fib.(i-1) in
fib.(i) <- a + b
ocamlopt takes the code literally, and let a, b = ... allocates a tuple (a, b) before discarding it. Simply rewriting the problematic statement as:

let a = fib.(i-2) and b = fib.(i-1) in
makes the whole loop run in half the time.

The second problem with this loop is that we have lots of unnecessary references to the array fib. Each time we write fib.(i), the compiler has to emit code to calculate fib + i*sizeof(int), and unlike C, ocamlopt doesn't use code motion to simplify and share the repeated references.

We nearly halve the loop time again by accessing the array only once:

let a = ref 1 and b = ref 1 in
for i = 2 to n-1 do
let c = !a + !b in
a := !b; b := c; fib.(i) <- c
Overall these two simple optimizations reduce the total running time of this loop more than three-fold.

Monday, 5 May 2008

BBC Radio 4 interview with me on the 30th anniversary of spam

I was interviewed on Saturday by the BBC Radio 4 iPM programme, on the subject of the 30th anniversary of spam. For the next few days you will be able to download the whole program here (the segment on spam starts around 20'08). Or you can listen to a longer version here.

Persistent matches in pa_bitmatch

I'm quite pleased with the state of my pa_bitmatch language extension, which adds safe bitstring matching and construction to OCaml.

However to be truly useful as a general purpose binary parser it needs ways to import descriptions from C header files and build up reusable libraries of bitmatches.

Let me explain with a favorite example: Parsing Linux ext2 filesystem structures.

Many common binary formats are defined solely in terms of C structures in C header files, and a good example is the "struct ext2_super_block" type in the ext2_fs.h header file. In a previous iteration of bitmatch, which I called libunbin, I wrote a C header file parser which could import these directly into the XML format I was using at the time. In OCaml, the CIL library makes this quite simple.

In bitmatch at the moment we end up translating these structures by hand into OCaml code which looks like this:

bitmatch bits with
| { s_inodes_count : 32 : littleendian; (* Inodes count *)
s_blocks_count : 32 : littleendian; (* Blocks count *)
s_r_blocks_count : 32 : littleendian; (* Reserved blocks count *)
s_free_blocks_count : 32 : littleendian (* Free blocks count *)
} ->

(The real code is much much longer than the above extract).

That works, but there is no good way to save and reuse the above match statement. If there are two parts of the program which need to match ext2 superblocks, well we either need to duplicate the same matching code twice or else we need to isolate the matching code into an awkward OCaml library.

What we'd really like to do is to write:

let bitmatch ext2sb = { s_inodes_count : 32 : littleendian;
(* ... *) }

so that we can reuse this pattern in other code:

bitmatch bits with
| ext2sb ->

We could even grab the old libunbin CIL parser so we can turn C header files into these persistent matches.

Now the above is possible. Martin Jambon's Micmatch library does precisely this, but only within the same OCaml compilation unit. To make this truly useful for bitmatch we'd definitely need it working across compilation units, and that turns out to be considerably harder.

Let's look at how Martin's micmatch works first though. In micmatch, a regular expression can be named using:

RE digit = ['0'-'9']

Martin's implementation of this is to save the camlp4 abstract syntax tree (AST) in a hash table mapping from name ("digit") to AST. At the point of use later on in some match statement, the name is looked up and the AST is substituted. Thus:

match str with RE digit ->

just turns into the equivalent of:

match str with RE ['0'-'9'] ->

and is processed by micmatch in the normal way.

We can use the same approach within compilation units for bitmatch.

Making it work across compilation units means that either the AST or the whole hashtable is going to have to be saved into a file somewhere. One obvious choice would be the cmo file. In fact saving the AST into the cmo file is relatively simple: we just turn it into a string (using Marshal) and write out the string as a camlp4 substitution:

let bitmatch ext2sb = { ... }


let ext2sb = "<string containing marshalled representation of AST>"

This saves the string in the cmo file. Loading it from another file is a different kettle of fish. At the point where camlp4 is running all we have is the abstract syntax tree of the OCaml code. We haven't even looked up the identifiers to see if they exist or make sense, and we haven't opened the cmi files of other modules, nevermind the cmo files (which in fact never get opened during compilation). Thus the only hope would be for our camlp4 extension itself to perform all the library path searches, identify the right cmo file, load it and get the appropriate symbol. It is not even clear to me that we have the required information to do this, and even if we have, it's duplicating most of the work done by the linker, so would make the extension very complicated and liable to break if all sorts of assumptions don't hold.

If we don't save it into a cmo file, perhaps we can save the whole hash table into our own file (.bitmatches). This is tempting but has some subtle and not-so-subtle problems. We are still left with the "search path problem" assuming, that is, we want to be able to save these matches into external libraries. There is also the subtle problem that parallel builds break (both in that the file needs to be locked, and that we may not build in the right order so that matches could be used before they are defined). It is also unclear what would (or should) happen if a match is defined in more than one file.

It's fair to say that I don't have a good solution to this yet, which is why I've held off for the moment on adding persistent matches to pa_bitmatch. If no one suggests a good solution, then I may go with the shared file solution, together with lots of caveats and limitations. Of course if you see something that I'm missing in this analysis, please let me know.

Defaming Microsoft

The BBC write to inform me that they have deleted a comment a made on a BBC blog because it is (they say) potentially defamatory.

This is what I wrote originally about the MS-Yahoo deal (which sadly has been called off):

As a Linux fan, I honestly can't wait for this deal to go through.

Microsoft will eviscerate Yahoo, turning away all the open source fans who work there and lamely attempting to port all the server-side software to Windows (that worked out well for Hotmail). And they'll have $44bn less cash in the bank to bribe politicians and standards bodies.

Bring it on, please.

And this is the form letter from the BBC:

Thank you for contributing to a BBC Blog. Unfortunately we've had to remove your content below

Postings to BBC blogs will be removed if they appear to be potentially defamatory.

You can find out more about Defamation at