Camltastic!: bitmatch

Showing posts with label bitmatch. Show all posts

Sunday, 10 August 2008

virt-mem 0.2.9

I'm pleased to announce the latest alpha release of the virt-mem tools, version 0.2.9.

These are tools for system administrators which let you find things like kernel messages, process lists and network information of your guests.

For example:


  virt-uname
    'uname' command, shows OS version, architecture, etc.
  virt-dmesg
    'dmesg' command, shows kernel messages
  virt-ps
    'ps' command, shows process list

Nothing needs to be installed in the guest for this to work, and the tools are specifically designed to allow easy scripting and integration with databases and monitoring systems.

Source is available from the web page here:

http://et.redhat.com/~rjones/virt-mem/

The latest version (0.2.9) reworks the internals substantially so that we have direct access to basically any kernel structure, and this will allow us to quickly add the remaining features that people have asked for (memory usage information, lists of network interfaces and so on).

As usual, patches, feedback, suggestions etc. are very welcome!

Binaries are available for Fedora through this link:
https://bugzilla.redhat.com/show_bug.cgi?id=450713

Tuesday, 17 June 2008

virt-ps and the Red Hat Summit

Tomorrow (Wednesday) through to Friday is the Red Hat Summit in Boston. If you're coming, please make sure to see my talk with Dan Berrange on the "Virtualization Toolbox", or all the little, useful tools we've been writing to help you manage your virt systems. That talk is tomorrow, Wednesday 18th June, some time after 11am.

As I mentioned previously on this blog I'm working on deep inspection of the internals of running virtual machines, and dressing this up as familiar, easy to use command line tools, such as virt-df and virt-dmesg. I'll be talking a lot more about those tomorrow, so I don't want to spoil the surprises.

The real question is whether I'll get virt-ps (process listings) working today. Getting the process listing out of a stuck virtual machine is immensely useful to find out what's going on with the machine. For example, did it blow up because there are too many Apache processes? Or is some other daemon causing trouble? I had an initial implementation of this working, but it was rather slow and unsatisfactory because of the all the guessing and heuristics it had to do. In the meantime, I discovered that getting the Linux kernel version is quite easy, and once you know the kernel version you immediately reduce the amount of heuristics you need by a large factor. So the new implementation should be much faster.

Faster, but it doesn't work at the moment. Today is the final push on this - can I get virt-ps working in time for the demo tomorrow?

Friday, 13 June 2008

virt-df, virt-mem and bitmatch

One of the "self-evident truths" about virtualization is that in order to monitor aspects of the virtual machine such as free disk space / free memory / running processes / etc, you need to run some software inside the virtual machine. That's how the expensive proprietary virt systems such as VMWare work. The alternative to running monitoring software in the virtual machine would be to try to look at the disk image and memory image of the virtual machine and try to make sense of it, and that is usually regarded as an intractable, unimplementable nightmare.

That's not a truth any more.

A few months ago I proved with virt-df that it's perfectly possible to parse the disk image of a virtual machine to find out how much free disk space is left. And although in theory this could get inconsistent results, in practice it works well. Virt-df supports Linux, Windows NTFS, and complex partitioning schemes such as LVM as well as simple DOS-style partitions.

Now I'm proving with the virt-mem toolkit that we can snoop on the live memory image of running virtual machines and pull out such interesting details as the process list, the network configuration, the kernel messages (dmesg), and the kernel version strings.

So how is this possible?

Of course I'm writing these tools in OCaml, which allows me to express complicated ideas in a very few lines of code, yet is as fast as optimized C code. But OCaml alone isn't sufficient because all of the disk and memory images that we're parsing are made from binary structures and the base language doesn't handle binary structures very well. OCaml though does allow programmers to change and extend the base language using LISP-style macros -- these are language extensions that become one with the original language, but enable you to extend it in arbitrary and unexpected ways.

I have already extended OCaml by adding the entire, complete PostgreSQL SQL syntax through a macro, so this should be easy.

The key to being able to parse hundreds of different variations of the Linux task_struct (process table entry) so we can print process lists, was to look at a successful functional language which already supports bitstrings directly. Erlang is used in the telecom industry to parse binary network protocols and has a rich syntax for parsing and assembling bitstrings. I copied Erlang's syntax and using macros added it directly to OCaml. The result is the OCaml bitmatch project, hosted on Googlecode and with lots of examples and documentation.

Bitmatch is now a very advanced and powerful system for dealing with binary structures (more powerful than what the Erlang architype offered). As an example, how much code would you expect to write in order to take an ext3 partition, find the superblock and print out the number of free blocks from the superblock? (A poor man's 'df' command) You'll need to parse the Linux kernel header file <ext3_fs_sb.h> to get the fields, their offsets, sizes, endianness, etc. Then you'll need to load the superblock from disk and parse it using bit operators.

The answer in bitmatch is just 9 lines of code:


  (* bitmatch-import-c ext3.c > ext3.bmpp *)
  open bitmatch "ext3.bmpp"

  let () =
    let fd = Unix.openfile "/dev/sda1" [Unix.O_RDONLY] 0 in
    let bits = Bitmatch.bitstring_of_file_descr_max fd 4096 in

    bitmatch bits with
    | { :ext3_super_block } ->
        printf "free blocks = %ld\n" s_free_blocks_count
    | { _ } ->
        printf "/dev/sda1 is not an ext3 filesystem\n"

Monday, 5 May 2008

Persistent matches in pa_bitmatch

I'm quite pleased with the state of my pa_bitmatch language extension, which adds safe bitstring matching and construction to OCaml.

However to be truly useful as a general purpose binary parser it needs ways to import descriptions from C header files and build up reusable libraries of bitmatches.

Let me explain with a favorite example: Parsing Linux ext2 filesystem structures.

Many common binary formats are defined solely in terms of C structures in C header files, and a good example is the "struct ext2_super_block" type in the ext2_fs.h header file. In a previous iteration of bitmatch, which I called libunbin, I wrote a C header file parser which could import these directly into the XML format I was using at the time. In OCaml, the CIL library makes this quite simple.

In bitmatch at the moment we end up translating these structures by hand into OCaml code which looks like this:


bitmatch bits with
  | { s_inodes_count : 32 : littleendian;       (* Inodes count *)
      s_blocks_count : 32 : littleendian;       (* Blocks count *)
      s_r_blocks_count : 32 : littleendian;     (* Reserved blocks count *)
      s_free_blocks_count : 32 : littleendian   (* Free blocks count *)
    } ->

(The real code is much much longer than the above extract).

That works, but there is no good way to save and reuse the above match statement. If there are two parts of the program which need to match ext2 superblocks, well we either need to duplicate the same matching code twice or else we need to isolate the matching code into an awkward OCaml library.

What we'd really like to do is to write:


let bitmatch ext2sb = { s_inodes_count : 32 : littleendian;
                        (* ... *) }

so that we can reuse this pattern in other code:


bitmatch bits with
| ext2sb ->

We could even grab the old libunbin CIL parser so we can turn C header files into these persistent matches.

Now the above is possible. Martin Jambon's Micmatch library does precisely this, but only within the same OCaml compilation unit. To make this truly useful for bitmatch we'd definitely need it working across compilation units, and that turns out to be considerably harder.

Let's look at how Martin's micmatch works first though. In micmatch, a regular expression can be named using:


RE digit = ['0'-'9']

Martin's implementation of this is to save the camlp4 abstract syntax tree (AST) in a hash table mapping from name ("digit") to AST. At the point of use later on in some match statement, the name is looked up and the AST is substituted. Thus:


match str with RE digit ->

just turns into the equivalent of:


match str with RE ['0'-'9'] ->

and is processed by micmatch in the normal way.

We can use the same approach within compilation units for bitmatch.

Making it work across compilation units means that either the AST or the whole hashtable is going to have to be saved into a file somewhere. One obvious choice would be the cmo file. In fact saving the AST into the cmo file is relatively simple: we just turn it into a string (using Marshal) and write out the string as a camlp4 substitution:


let bitmatch ext2sb = { ... }

becomes:


let ext2sb = "<string containing marshalled representation of AST>"

This saves the string in the cmo file. Loading it from another file is a different kettle of fish. At the point where camlp4 is running all we have is the abstract syntax tree of the OCaml code. We haven't even looked up the identifiers to see if they exist or make sense, and we haven't opened the cmi files of other modules, nevermind the cmo files (which in fact never get opened during compilation). Thus the only hope would be for our camlp4 extension itself to perform all the library path searches, identify the right cmo file, load it and get the appropriate symbol. It is not even clear to me that we have the required information to do this, and even if we have, it's duplicating most of the work done by the linker, so would make the extension very complicated and liable to break if all sorts of assumptions don't hold.

If we don't save it into a cmo file, perhaps we can save the whole hash table into our own file (.bitmatches). This is tempting but has some subtle and not-so-subtle problems. We are still left with the "search path problem" assuming, that is, we want to be able to save these matches into external libraries. There is also the subtle problem that parallel builds break (both in that the file needs to be locked, and that we may not build in the right order so that matches could be used before they are defined). It is also unclear what would (or should) happen if a match is defined in more than one file.

It's fair to say that I don't have a good solution to this yet, which is why I've held off for the moment on adding persistent matches to pa_bitmatch. If no one suggests a good solution, then I may go with the shared file solution, together with lots of caveats and limitations. Of course if you see something that I'm missing in this analysis, please let me know.