Monday 5 May 2008

Persistent matches in pa_bitmatch

I'm quite pleased with the state of my pa_bitmatch language extension, which adds safe bitstring matching and construction to OCaml.

However to be truly useful as a general purpose binary parser it needs ways to import descriptions from C header files and build up reusable libraries of bitmatches.

Let me explain with a favorite example: Parsing Linux ext2 filesystem structures.

Many common binary formats are defined solely in terms of C structures in C header files, and a good example is the "struct ext2_super_block" type in the ext2_fs.h header file. In a previous iteration of bitmatch, which I called libunbin, I wrote a C header file parser which could import these directly into the XML format I was using at the time. In OCaml, the CIL library makes this quite simple.

In bitmatch at the moment we end up translating these structures by hand into OCaml code which looks like this:

bitmatch bits with
| { s_inodes_count : 32 : littleendian; (* Inodes count *)
s_blocks_count : 32 : littleendian; (* Blocks count *)
s_r_blocks_count : 32 : littleendian; (* Reserved blocks count *)
s_free_blocks_count : 32 : littleendian (* Free blocks count *)
} ->

(The real code is much much longer than the above extract).

That works, but there is no good way to save and reuse the above match statement. If there are two parts of the program which need to match ext2 superblocks, well we either need to duplicate the same matching code twice or else we need to isolate the matching code into an awkward OCaml library.

What we'd really like to do is to write:

let bitmatch ext2sb = { s_inodes_count : 32 : littleendian;
(* ... *) }

so that we can reuse this pattern in other code:

bitmatch bits with
| ext2sb ->

We could even grab the old libunbin CIL parser so we can turn C header files into these persistent matches.

Now the above is possible. Martin Jambon's Micmatch library does precisely this, but only within the same OCaml compilation unit. To make this truly useful for bitmatch we'd definitely need it working across compilation units, and that turns out to be considerably harder.

Let's look at how Martin's micmatch works first though. In micmatch, a regular expression can be named using:

RE digit = ['0'-'9']

Martin's implementation of this is to save the camlp4 abstract syntax tree (AST) in a hash table mapping from name ("digit") to AST. At the point of use later on in some match statement, the name is looked up and the AST is substituted. Thus:

match str with RE digit ->

just turns into the equivalent of:

match str with RE ['0'-'9'] ->

and is processed by micmatch in the normal way.

We can use the same approach within compilation units for bitmatch.

Making it work across compilation units means that either the AST or the whole hashtable is going to have to be saved into a file somewhere. One obvious choice would be the cmo file. In fact saving the AST into the cmo file is relatively simple: we just turn it into a string (using Marshal) and write out the string as a camlp4 substitution:

let bitmatch ext2sb = { ... }

becomes:

let ext2sb = "<string containing marshalled representation of AST>"

This saves the string in the cmo file. Loading it from another file is a different kettle of fish. At the point where camlp4 is running all we have is the abstract syntax tree of the OCaml code. We haven't even looked up the identifiers to see if they exist or make sense, and we haven't opened the cmi files of other modules, nevermind the cmo files (which in fact never get opened during compilation). Thus the only hope would be for our camlp4 extension itself to perform all the library path searches, identify the right cmo file, load it and get the appropriate symbol. It is not even clear to me that we have the required information to do this, and even if we have, it's duplicating most of the work done by the linker, so would make the extension very complicated and liable to break if all sorts of assumptions don't hold.

If we don't save it into a cmo file, perhaps we can save the whole hash table into our own file (.bitmatches). This is tempting but has some subtle and not-so-subtle problems. We are still left with the "search path problem" assuming, that is, we want to be able to save these matches into external libraries. There is also the subtle problem that parallel builds break (both in that the file needs to be locked, and that we may not build in the right order so that matches could be used before they are defined). It is also unclear what would (or should) happen if a match is defined in more than one file.

It's fair to say that I don't have a good solution to this yet, which is why I've held off for the moment on adding persistent matches to pa_bitmatch. If no one suggests a good solution, then I may go with the shared file solution, together with lots of caveats and limitations. Of course if you see something that I'm missing in this analysis, please let me know.

No comments: