Tuesday 30 December 2008

Destroying old hard drives

This is a simple, cheap method for destroying old hard drives, making the data unrecoverable against casual attackers and identity fraudsters (although probably not some hypothetical government agency with multi-million dollar resources).

For this you will need a stack of old hard drives:


An electric drill with a twist drill bit (suitable for going through metal), and most importantly some eye protection:


Line up the hard drives against the wall and drill straight through them. I didn't show it in this picture, but in fact I drilled through from the other (PCB) side to ensure that I went through the PCB but didn't go through any components that might explode:


Sunlight where there's not supposed to be sunlight!


For a few of the drives, mainly older ones, I couldn't get all the way through, but I got through to the platters, which is the important part:


Now you can see why eye protection is not optional. This old IBM SCSI-LVD drive had glass platters which shattered into tiny, sharp shards of metal-plated glass when the drill went through:


For extra assurance, I will soak the drives in a bucket of water for a few days before disposing of them:

Monday 8 December 2008

Stop spying on my encyclopedia reading

Recently the unaccountable UK "Internet Watch Foundation" added pages from Wikipedia to a secret list of censored pages.

I would like to make the point that no one should be prosecuted for reading an encyclopedia. Furthermore, no free, democratic society should tolerate authorities spying on people reading works of knowledge.

Let's together stop this spying now.

Here is a simple action that you can take right now, that won't cost you any time or money. When linking to any page on Wikipedia, use the secure URL:
https://secure.wikimedia.org/wikipedia/en/wiki/Main_Page
(Replace Main_Page with the name of the Wikipedia page as usual. You can also replace the language (en) or link to commons pages).

Sunday 7 December 2008

Fedora Rawhide has OCaml 3.11.0

Fedora Rawhide [the experimental/development version of Fedora] has been completely rebuilt with OCaml 3.11.0, and all library and application problems that were found have been patched.

List of packages: http://cocan.org/fedora#Package_status

Mailing list announcement

Slashdot groupthink

This may be the first time a comment of mine has been modded down to -1 on Slashdot. I'm questioning whether the inefficiency of glib outweighs the speed advantage of C. Very few of the replies get it. Perhaps this proves the people only read the first sentence of any posting ... tl;dr.

Wednesday 26 November 2008

Fedora 10 & OCaml

You can join in the general Fedora 10 fun here, but a quick note that Fedora 10 comes with stable OCaml 3.10.2 and 68 OCaml packages, making us the fastest, best supported functional language in Fedora.

Monday 24 November 2008

Common mistakes cross-compiling MinGW packages

Using the headers from /usr/include

The headers in /usr/include are for the native libraries installed on the system, and it's highly unlikely they will work for cross-compilation. By "won't work" I mean that types and structure fields could be different, resulting in a segfault.

The Fedora MinGW project takes two steps to avoid using native libraries by accident: Firstly GCC is configured so it looks in /usr/i686-pc-mingw32/sys-root/mingw/include and never looks in /usr/include (as long as you don't tell it to). Secondly we supply a replacement %{_mingw32_configure} RPM macro which sets PKG_CONFIG_PATH, so any pkg-config done will pick up the cross-compiled libraries' configuration instead of any native libraries' configuration.

$ PKG_CONFIG_PATH=/usr/i686-pc-mingw32/sys-root/mingw/lib/pkgconfig \
pkg-config --cflags glib-2.0
-mms-bitfields -I/usr/i686-pc-mingw32/sys-root/mingw/include/glib-2.0
-I/usr/i686-pc-mingw32/sys-root/mingw/lib/glib-2.0/include

One thing that can still go wrong is that you don't have the cross-compiled library installed and it then picks up the native library. For example, you missed a BuildRequires line. That mistake usually becomes evident when the program tries to link, because linking a cross-compiled Windows binary to a native Fedora library won't work.

Not setting --prefix

You likely don't want to install Windows binaries and libraries under /usr or /usr/local. For a start it's better to keep Windows things in one place, and the packaging guidelines have specified that place to be /usr/i686-pc-mingw32/sys-root/mingw. But mainly it's not a good idea to mix up native and cross-compiled libraries, which will cause all sorts of problems as in the point above.

If you use %{_mingw32_configure} in RPM specfiles, or the mingw32-configure command, then paths will be set correctly for you.

Not using a portability library

If you're writing the program yourself, or if you're doing the often difficult work of porting an existing application, use a portability library to help you. Which you choose is up to you and depends on many factors, but we would recommend that you look at these ones:

Writing your own build system

While it's fashionable to dislike autoconf and m4 macros, it is still by far the easiest way to both build your program on multiple systems, and to cross-compile. So use autotools or cmake, and definitely don't write your own build system. Discourage other projects from writing their own build systems too.

This really comes down to bitter experience. Every project we have had to port that has used its own build system has been far more of a headache than those that just used autoconf or cmake.

Running programs during the build process

When cross-compiling, it's always a mistake to run programs during essential build steps. The problem is that you can't be sure that binaries can be made to work in the build environment. For Windows binaries, there is some chance of running them under Wine, but Wine itself is incompatible with autobuild environments like mock and Koji. Furthermore Wine only works on x86 platforms, and it's not possible to use it at all when cross-compiling from other architectures like PPC.

Running programs during make test is normal and useful though.

Wednesday 19 November 2008

Egg & "Verified by Visa"

Message sent to Egg today about Verified by Visa:

Dear Sir/Madam,

I would like to permanently opt out of "Verified By Visa" when making purchases online. It just moves the liability on to me and the technical implementation of it is frankly crap. If not, I'll cancel my card (I expect you'll be happy about that) since it's no longer useful for purchases.

If however you are going to introduce some scheme which is really secure, such as a hardware token or one-time credit card numbers or authorization by SMS message, then let me know.


Update (2008-11-20) — a dull form reply from Egg:

The Secure online code service is supported by Verified by Visa and MasterCard Secure Code. It protects your card with a password, giving you added security when you shop online.

When you make purchases online with participating retailers, you'll be presented with a receipt at the end of the checkout process. The receipt includes details of your purchase, showing retailer name, purchase amount and date. You sign the receipt using your personal password and click 'Submit' to proceed with the purchase. Without your password the purchase can't be completed.

This is a system that's been put in place by Visa and MasterCard. It's to provide a more secure service, when making purchases online.

Unfortunately, this isn't something we can remove from your Egg Card.

Thanks for your message.

Emily Stirling
Internet Customer Services

Er yes, thanks for nothing Emily. You don't mention the idiotic implementation or the fact that they are passing liability over to their customers. I'm cancelling my credit card and looking for a secure alternative.

Update (2008-11-24) — I can't believe it, the fuckers cancelled my credit card.

LWN.net has an interview with us about MinGW Windows cross-compiler

Here is the article link if you are an LWN subscriber:

http://lwn.net/Articles/307732/

If you're not an LWN subscriber, you can use this free link to get to the article:

http://lwn.net/SubscriberLink/307732/0efc7b75c5696ae5/

Please consider subscribing to LWN!

Sunday 9 November 2008

OCaml Users Meeting, Feb 2009, Grenoble

Sylvain is already organizing the next OCaml Users Meeting 4th Feb 2009 in Grenoble, France.

The last meeting (rubbish photo I took below) was a great success, and since so much has happened in the community this year, I expect this one will be even bigger and better.



Update: Sylvain's announcement and the official photo

Sunday 2 November 2008

malloc failures

I can't put a comment on Debarshi's post, so I'll answer here. Debarshi complains about this comment by the "inimitable" Jeff Johnson:

You have to look at the usage case, malloc returning NULL is a "can't happen" condition where an exit call is arguably justified.

Returning an error from library to application when malloc returns NULL assumes:

1) error return paths exist [...]
2) applications are prepared to do something meaningful with the error

Another problem is that only about 1 in 10 memory allocations in a typical C program are mallocs. The rest are stack-allocated variables, and those aren't usually checked at all. If any of your 9 out of 10 stack allocations fail, your whole program fails hard.

This is the correct way to deal with those 1 in 10 memory allocations that you can check — provide a custom abort function that the main program can override in the very rare case that they can do anything useful other than exit:

void (*custom_abort) () = abort;

void
lib_set_custom_abort (void (*new_abort) ())
{
custom_abort = new_abort;
}

void *
lib_malloc (int n)
{
void *data = malloc (n);
if (data == NULL) custom_abort ();
return data;
}
Note that the main program can use longjmp (or exceptions in some cases) to "return" back to a safe point in the program, such as a transaction checkpoint. If the main program uses pool allocators — about the only safe and sensible way to deal with C's programming model — then the program has a chance of recovering.

Really the answer is to use a sensible programming language though. Programming languages invented before C had safer, faster memory allocation, dealt with 10 out of 10 memory allocation errors, and provided a mechanism to recover correctly. Those languages are now 30 years more advanced. In 2008 we're having these silly arguments about how to deal with malloc failures. That's a failure of ourselves as programmers.

Thursday 16 October 2008

MinGW: It's got an icon and it works!

As you can see from the screenshot below, I addressed Nicu's complaint and added a simple icon to the virsh (virt shell) EXE file. Here's how to do that again using all open source tools. We also a fixed a rather embarrassing endianness bug in our XDR implementation, and so virsh/libvirt can talk to remote libvirtd servers.

Tuesday 14 October 2008

MinGW: Screenshots

Previously I showed you how to build software on Fedora which will run on Windows. You can run this software on Fedora using Wine, but it's also nice to know that it even runs on a real Windows machine. Here are some screenshots of the installer running under Windows XP. Remember that this was entirely created on a Fedora Linux system, using completely open source software:


The menubar across the top of the screen comes from virt-viewer. Windows is running virtualized.




Notice the desktop shortcuts for each application, added by nsiswrapper automatically. In future we'll actually want to disable some of these since they don't really make sense for command line applications.


Who's the "surfer dude"?

Sunday 12 October 2008

IDN use and abuse

JWZ blogged about the Unicode snowman. If you're running a proper browser, take a close look at the domain name:
.net

A brief, two sentence overview: For any domain name which begins with xn-- followed by some gobbledygook, certain clients like web browsers can interpret the gobbledygook as a Punycode representation of some Unicode string. So the snowman's real domain name is xn--n3h.net.

I was quite excited for a while since many of these Unicode dingbats and symbols are unregistered in combinations of two or more, but then I found that the killjoys at the IETF had put a stop to that with RFC 4690. So while the snowman registration can be continued, no new dingbats can be registered.

Nevertheless, we can still have fun abusing the simpler Chinese characters. For a laugh I registered 丄.com and 丿乀.com. These might not be active when you read this, and to be honest I'm not quite sure where I'll point them at the moment. The first looks like bottom, the symbol for non-terminating programs. Hmmm maybe that'd be good for some insightful blog about functional programming? The second is a total abuse of two characters together, but looks like the number 8 in Japanese (IETF rules forbid registering actual numbers, even non-Arabic ones).

Someone should do the world a favour and register 丅丨丅.com (xn--9gqa8h.com).

Update

Subdomains are of course not regulated by the IETF jobsworths. Here's another, prettier unicode snowman: http://☃.earthlingsoft.net/, and I can have http://☆☆☆.annexia.org/

Friday 10 October 2008

MinGW: Compile software for Windows without leaving your Fedora machine

For the last few weeks I've been focused on the Fedora MinGW project. This project gives Fedora users a compelling new feature: you can build your software for Windows, without ever needing to leave the Fedora / Linux environment. In fact you can do everything, up to and including creating a Windows installer for your customers, without needing once to touch Windows.

To demonstrate how this works, I'm going to show you how to port a simple application to Windows, using Fedora MinGW. The app I've chosen is virt-viewer, a graphical console viewer for virtual machines, written in C.

First we install the cross-compiler environment and any libraries that our program requires. (Until the MinGW packages are accepted into Fedora, you'll have to get them from our temporary yum repository)

yum install mingw32-gcc mingw32-binutils \
mingw32-gtk2 mingw32-gtk-vnc mingw32-libvirt mingw32-libxml2 \
mingw32-nsis mingw32-nsiswrapper

With software such as virt-viewer that is based on the standard autoconf "configure" script, the cross-compiling step is simple. You just have to do:

./configure --host=i686-pc-mingw32

That's all you have to do to configure virt-viewer (and most other software) to cross-compile for Windows.

Now we just do make and discover ... ah, that it doesn't compile. This leads us to the hard part of porting software over to Windows. Windows uses the Win32 API instead of the usual POSIX / libc API found on Linux.

For virt-viewer there are several problems:

  1. virt-viewer uses some header files like <sys/socket.h> which aren't found under Win32.
  2. We need to include <windows.h> on Windows (but not on Linux). For Win32, this header file is analogous to <stdlib.h> or <unistd.h>, and almost every C source file should include it.
  3. virt-viewer makes some Linux-specific system calls which aren't available in the Win32 API. The problematic calls are:

    • usleep (sleep for a specified number of microseconds)
    • fork (create a subprocess)
    • socketpair (create a pipe to communicate with the subprocess)

Problems (1) and (2), the missing header files, are easily solved in a very portable way. For each header file which is missing on Windows or Linux, we will just add a configure-time test and some #ifdef magic. Into configure.ac we put:

AC_CHECK_HEADERS([sys/socket.h sys/un.h windows.h])

and then into the C sources files we put:

#ifdef HAVE_SYS_SOCKET_H
#include <sys/socket.h>
#endif

and so on.

Problem (3) -- missing APIs -- are the hardest problems to solve. In general there are three strategies we could try:

(a) Try to find an equivalent but different API which is present on Linux and Windows. As an example here, Windows has a call which is very similar to pipe, and might be used to replace socketpair.
(b) Write a replacement function for each problematic API.
(c) Comment out the particular feature in the code which uses the missing calls. This is less satisfactory of course: Windows users will now be missing some feature.

We're going to fix problems in (3) with a mixture of strategies (b) and (c).

Windows doesn't have usleep, but looking at MSDN I see that it does have a function Sleep (DWORD milliseconds) which can be used as a replacement for usleep.

You can test and replace functions conditionally by adding this to configure.in:

AC_REPLACE_FUNCS([usleep])

Remember that you don't want to replace this on Linux and any platforms that have usleep, and that is what AC_REPLACE_FUNCS does.

The code to implement usleep is now placed into a single function in a file with the same name, usleep.c:

#ifdef WIN32
int
usleep (unsigned int usecs)
{
unsigned int msecs = usecs / 1000;
if (msecs < 1)
Sleep (1);
else
Sleep (msecs);
}
#endif


The magic of autoconf will ensure this file will only be linked into the main program when it is needed.

As for fork and socketpair, it turns out we are quite lucky. These two calls are only used to implement a specific virt-viewer feature, namely tunneling connections over ssh. If you conclude, as I did, that ssh isn't that common on Windows machines, then you can do as I did and just comment out that feature conditionally when building on Windows.

With those changes, we have now completed our port of virt-viewer to Windows (full patch). After rerunnning:

autoconf
./configure --host=i686-pc-mingw32
make

we are left with virt-viewer.exe, a full Gtk application that runs on Windows.

Creating a Windows installer


To package up Windows applications into full-featured installers, that include menu shortcuts, desktop icons and an uninstaller, we wrote a little helper program called nsiswrapper. As its name suggests, it is a wrapper around the NSIS Windows Installer, which we also ported over to run natively under Fedora.

You'll need to wrap up not just virt-viewer.exe, but the Gtk-related DLLs and helper modules. With nsiswrapper you would do:

nsiswrapper --run \
--name "Virt-Viewer" \
--outfile "Virt-Viewer-for-Windows.exe" \
--with-gtk \
/usr/i686-pc-mingw32/sys-root/mingw/bin/virt-viewer.exe

Saturday 20 September 2008

What can OCaml do that you can't do in other programming languages?

Depressingly predictable comments on the Haskell story in Slashdot yesterday such as:
"I have *never* seen it being used since. To my mind they both belong in the category 'interesting, but pointless'."
and
"The point is that there's nothing those languages can do that can't be done, often more easily, with the current crop of popular languages. Elegance cannot beat convenience in the workplace, or in most at any rate."
and so on.

Here are some useful things you can do in OCaml which you cannot do in the "current crop of popular languages".

1. Change the language to suit your task.

In our case the task was to parse binary formats.

If you have a few hours to spare, try writing a C program to parse tcpdump files. Don't forget that the endianness in these files is variable so you'll need lots of if (..) { field1 = htonl (field1); ... }. OK so that's a bit hard. Let's say you want to parse a 6 bit length field 'n' followed by an n+1 bit data field (as a 1-64 bit int). Go and write it in C now.

Using our bitstring extension to OCaml, parsing binary structures is really effortless:

let bits = Bitstring.bitstring_of_file "input.data" in
bitmatch bits with
| { n : 6;
data : n+1 } -> data

This is the complete tcpdump parser, just 113 lines of code.

And it's fast too. The resulting code compiles down to machine code and inlines direct calls to C functions at every opportunity, so in practice you can parse data at speeds approaching C.

Another task we had was to check that hundreds of SQL statements in a modest sized web application actually matched with fields in the database, in other words that they didn't reference non-existent fields or treat a string field as an integer and so forth. Doing this by testing is almost impossible because many SQL statements are only used on rare error paths. We could contemplate doing it manually once but not routinely.

So instead we extended the OCaml language to do the checking for us every time we compiled the code. The resulting PG'OCaml project gives you compiled-time checked type-safe access to your database. It's used by mod_caml and ocsigen, or you can just use it in standalone programs. I won't go into detail because Dario Teixeira wrote a great introduction and tutorial to PG'OCaml.

Martin Jambon, author of Micmatch which adds regular expressions directly into the language, has an excellent tutorial. Browse the list of syntax extensions here.

2. Get the compiler to check for errors in your logic

Why can't your compiler check for errors like when you use a read-only database connection in a SQL INSERT statement, or you are supposed to call library function set_word_size because you try to call get_word?

In OCaml, using phantom types [tutorial] you can do exactly this.

3. Drop down to imperative code when you need speed like C

Scripting languages are expressive but slow ... and getting slower. But there's no need to compromise expressiveness to get real speed. Static typed languages with type inference (which excludes C, C++, Java, C# and most others, but includes SML, OCaml and F#) give you the expressiveness to write compact code but without compromising on speed.

OCaml programs are more than 10 - 15 times faster and less memory intensive than Python programs.

More about speed and optimizing OCaml programs. Comparison of a raytracer written in C++ and OCaml.

4. Large well-established standard libraries

OCaml comes with hundreds of packages for many tasks now.

If that's not enough you can directly call out to Perl, Java, Python and .Net [on Windows] libraries. Or you can call C functions directly.

You can compile the same code on Unix, Mac OS X and Windows.

So enough of this "elegance can't beat convenience" stuff please.

Tuesday 16 September 2008

Tip: Read all lines from a file (the most common OCaml newbie question?)

Is this the most common OCaml beginners question? It comes up every few weeks on the OCaml beginners list, and I have tried to answer it before.

It seems that everyone who learns OCaml comes away with the impression that functional programming is the New Cool Thing, and imperative programming is Bad and Must Be Avoided.

I'm going to say it now: programming fashions are stupid and counterproductive. The only things that matter are that your program is short, easy to write, easy to maintain and works correctly. How you achieve this has nothing to do with programming fads.

Reading all lines from a file is an imperative problem, and the shortest solution (easy to write, easy to maintain and correct1) uses a while loop, in OCaml or any other language:
let lines = ref [] in
let chan = open_in filename in
try
while true; do
lines := input_line chan :: !lines
done; []
with End_of_file ->
close_in chan;
List.rev !lines
Actually, no, I'm lying. The best solution is this:
Std.input_list chan
which is supplied by extlib. Don't bother to duplicate functions which are already provided in commonly available libraries.


1This is only strictly speaking correct if you handle clean-up if input_line throws some read error (exception). In the common case where you just exit the program, leaving the channel open is perfectly acceptable.

Home server, part 3, quick update

As I alluded to on the GLLUG mailing list, I got stuck installing Red Hat Enterprise Linux (RHEL) 5.2 on the Viglen MPC-L because of this bug when installing RHEL or CentOS on a Geode LX. Consequently I trashed the Xubuntu pre-install and managed to make the machine unbootable. The BIOS claims to support all sorts of useful boot methods such as booting from USB flash drives and over the network, but in practice nothing seemed to work for me except booting from the local hard disk.

This means I'll have to remove the hard disk (it's a 2.5" laptop-sized HDD) and transplant it into another machine, and to do this I'm waiting for the correct 2.5" - IDE adaptor cable from Maplin.

I think the most common question people have asked me is how fast the Viglen is, with its AMD Geode LX central processor. When I had Xubuntu on the machine and while I was failing to install RHEL, I played with it and the conclusions are mixed. Firstly it's fair to say that it is not a fast machine, but fast enough to be useful for (eg) light browsing. But I did observe that some specific things are really slow. For example the timezone / graphical map selection during the RHEL install is really slow, but the rest of the RHEL install was perfectly fine. According to Alan Cox the Geode is quite a strange beast. Some pieces of "hardware" are in reality emulated (certain PCI devices for one), so my theory is that when you hit a piece of code which uses these emulated devices in a certain way, suddenly everything slows down.

The true test of this will come when I get RHEL installed and have it running as a server.

Friday 12 September 2008

Home server, part 2, just photos

I received the Viglen MPC-L which will be part of my home server.

The first, rather unfair impression is "cheap Mac Mini", but to Viglen's credit it is very much cheaper than a Mac, about one quarter of the price, at a mere £80 inc. tax and delivery.



Front and back:





The full set is shown below. I was surprised that Viglen even bothered supplying a mouse and keyboard, since they don't supply the obvious missing bits:



... but what about all this waste? Viglen could do better to reduce this:

Wednesday 10 September 2008

MinGW - build Windows binaries from the comfort of your Fedora

MinGW is a Windows C/C++ compiler and toolchain based on gcc and binutils which runs on Linux, but generates Windows libraries (*.dll) and executables (*.exe). This means that when a customer asks you to write a Windows program, you can do it without leaving the comfort of your familiar Linux environment.

Today the Fedora MinGW Special Interest Group made a milestone release of the compiler toolchain and many basic libraries including Gtk and all its dependencies.

These aren't part of Fedora proper yet, but you can grab source RPMs and x86-64 binaries from my website here:

http://www.annexia.org/tmp/mingw/
(Still uploading as I write this)

If you prefer you can clone our Mercurial development repository. (Read the README file)

The current draft packaging guidelines may look complicated but if you start from our example spec file (mingw-example.spec) then that takes care of most packaging scenarios and dependency generation automatically.

Tuesday 9 September 2008

Home server, part 1

This is going to be a series about how I built a home network storage server for all my photos, music, videos, software etc.

I now have many computers at home, no central storage for them, and an incoherent backup regime. I've been planning a home NAS (network attached storage) server for some time.

The basic requirements are that I have at least 1 terabyte of usable storage, it must be available on the network for everyone in the house to use, and it must be regularly backed up. Additionally I'm planning that it will use RAID to give it some resilience, it should be easily and cheaply upgradable, and it will be fully encrypted to prevent catastrophic loss in case of theft.

Now the architecture is a bit unusual. Dedicated NAS systems are quite expensive, so instead I'm planning to use USB 2.0 external drives attached to a tiny Viglen MPC-L computer. This setup sacrifices some performance for an exceptionally low price (but then again, performance isn't too important when the only way to access it will be over a wireless network).

The Viglen MPC-L is a small, cheap, AMD Geode-based Linux box, akin to a Mac Mini (but considerably cheaper at £80 inc. tax and delivery). I also bought three 500 GB USB drives. The total cost (inc. delivery and UK sales taxes) is £260 (approx. US $400). That doesn't yet include backup which will be another 1 TB USB drive for around another $150-200.

Here are the three drives. Externally these are rather elegant FreeCom cases, and internally they are standard Samsung drives:



Here I'm formatting and testing the drives using my laptop. I'll cover the precise setup in a future article:

Sunday 7 September 2008

Writing printf-like functions

Here's a quick and useful tip: failwith is the standard function to raise a general error, but it's a little bit clumsy to use because it only takes a fixed string.
if temp >= 100 then
failwith "we've reached boiling point"
If you want to have the message contain useful debugging information, you need to use sprintf to generate the fixed string, like:
if temp >= 100 then
failwith (sprintf "%d degC: boiling point reached" temp)
(I'm assuming here that you have open Printf at the top of your file, something which you should almost always do so you don't need to write Printf.sprintf all the time,).

With this simple tip we can turn failwith into a function that automatically takes a printf-like format string, and we can learn a little bit about the arcana of polymorphic types too.

First of all, here is the code:
let failwith format = ksprintf failwith format
You can see in the toplevel that it works:
# let failwith format = ksprintf failwith format ;;
val failwith : ('a, unit, string, 'b) format4 -> 'a = <fun>
# failwith "hello, %s" "world" ;;
Exception: Failure "hello, world".
# failwith "error code %d" 3 ;;
Exception: Failure "error code 3".
ksprintf is the key function here. Like sprintf it takes a format string and a variable number of parameters, and makes a fixed result string. Unlike sprintf it doesn't return the string, but passes it to the function which is its first parameter — in this case, the standard failwith function. So ksprintf is useful because it can turn almost any fixed string function into a printf-like function.

Now how about the lesson on type arcana? Well if you know anything about currying you might think that you could write the new failwith function even shorter, like this:
let failwith = ksprintf failwith
If you try this, you'll find the new function works some of the time, but fails to type-check at other times. In fact, the first time you use it, it seems to "remember" the type of all the arguments, and then refuses to work if any of those types change:
# failwith "hello, %s" "world" ;;
Exception: Failure "hello, world".
# failwith "error code %d" 3 ;;
This expression has type (int -> 'a, 'b, 'c, 'd, 'd, 'a) format6
but is here used with type
(string -> 'e, unit, string, 'e) format4 =
(string -> 'e, unit, string, string, string, 'e) format6
If we take a close look at the inferred types of the wrong definition, we can see why:
# let failwith = ksprintf failwith ;;
val failwith : ('_a, unit, string, '_b) format4 -> '_a = <fun>
'_a (with an underscore) is not a polymorphic type, but a single type that the compiler just hasn't been able to infer fully yet. As soon as you give it more information (eg. calling the function), the compiler infers that type into some concrete type (like string -> ... above) and won't let you change it later.

A more advanced question is to work out why type inference fails to infer the more general polymorphic type. I suspect this FAQ may have the answer.

Saturday 30 August 2008

Sharing a global variable between C and OCaml

I was asked today if it is possible to share a global variable between C and OCaml. This was my first response:

OCaml cannot just directly access C globals. At best you'd need to have a function that returns the address of the C global, _and_ the C global would need to be in a form that OCaml code could understand although that is pretty easy to arrange.
In a followup, I gave some example code:

------------------------------------------------------------ test_c.c
/* Variable shared between C and OCaml. */

#include <caml/mlvalues.h>

/* On the OCaml side, this will be made to look like a structure
* containing a single int (well, provided you don't look *too*
* closely at it).
*/
value shared_var = Val_int (0);

value
get_struct_addr (value unitv)
{
/* Return the address of the 'structure'. */
return ((value) &shared_var);
}

/* Increment the shared variable. */
value
increment_it (value unitv)
{
int i = Int_val (shared_var);
i++;
shared_var = Val_int (i);
return (Val_unit);
}
----------------------------------------------------------------------

------------------------------------------------------------ test.ml
(* Variable shared between C and OCaml. *)

type var = {
shared_var : int;
}

external get_struct_addr : unit -> var = "get_struct_addr" "noalloc"
external increment_it : unit -> unit = "increment_it" "noalloc"

let var = get_struct_addr () ;;

while true do
Printf.printf "value of the variable now is %d\n%!" var.shared_var;
increment_it ();
(* OCaml isn't expecting that increment_it modifies a variable, so
* there is no guarantee that we will see the changed value next
* time around.
*)
Unix.sleep 1;
done
----------------------------------------------------------------------

$ gcc -I /usr/lib/ocaml -c test_c.c
$ ocamlopt -c test.ml
$ ocamlopt unix.cmxa test_c.o test.cmx -o test
$ ./test
value of the variable now is 0
value of the variable now is 1
value of the variable now is 2
value of the variable now is 3
value of the variable now is 4
[etc.]

Before you use this code, be aware that it's a real hack.

Thursday 21 August 2008

Tip: Calling C functions directly with "noalloc"

In OCaml you can call a C function by writing:

external call_1 : float -> float = "call_1"
then just using call_1. However these calls are not direct. They go via an OCaml runtime function called caml_c_call. This is a tiny bit of assembler, so the overhead isn't large, but it does use a computed jump which on many processors is quite slow.

Luckily this indirection is only needed in order to set up the garbage collector. If your C function won't perform any OCaml allocations, then you don't need this, and you can tell OCaml to jump directly to your C function like this:

external call_2 : float -> float = "call_2" "noalloc"

Let's compare the generated assembly code for the calls in both cases:


Normal "noalloc"

pushl %eax pushl %eax
movl $call_1, %eax call call_2
call caml_c_call addl $4, %esp
addl $4, %esp
...
caml_c_call:
movl (%esp), %edx
movl %edx, G(caml_last_return_address)
leal 4(%esp), %edx
movl %edx, G(caml_bottom_of_stack)
jmp *%eax


As you can see, the "noalloc" version is much shorter.

Tuesday 19 August 2008

Just draw something on the f-ing screen

I don't believe that computers have got better over the past 40 years.

Case in point: I've spent at least 2 hours trying to debug a Gtk program which is supposed to plot some dots on the screen. Like this sort of thing in barely remembered ZX Spectrum BASIC:

10 FOR F = 0 TO 2*PI STEP 0.1
20 PLOT SIN(F)*200+100, COS(F)*200+100
30 NEXT F

The Gtk program is 20 times longer than this. And refuses to draw anything except a black window.

Computers have got worse in many ways since my first computer.

Update

My angry late-night programming rant makes it to reddit.


Ob-awesome Wikipedia page of the week: List of 8 bit computer hardware palettes.

Saturday 16 August 2008

3 things that will confuse you when reading functional programs

Here are 3 things that will confuse you when you read a program written in a functional language. They're just small stylistic differences compared to more mainstream languages, and once you understand them you'll be able to read the code much more easily.

1. Function calls


Calls to functions are written differently, without any brackets or commas.

In C you would write:

printf ("hello %s\n", name);

but in OCaml the same function call would be written like this:

printf "hello %s\n" name;


What's the difference? In functional languages you don't put brackets around the arguments, and you just put spaces between the function name and the arguments.

Why is it confusing? It doesn't look like a function call (unless you're used to this style). Functions are very common in functional programming, not surprising really, and they are often given short names, so you'll see plenty of code like this:

f (g a b) c

Just take it one step at a time and remember that g a b is a function call (in C it would be written as g (a, b)), and that f (...) c is a function call with two parameters (in C it would be written as f (g (a, b), c).)

Why is it done like this? This syntax is better because it's shorter. Functions and function calls are very common in functional programming languages, so we need to use the shortest possible syntax, which is this one.

2. Bindings are not variables


let x = foo

x is not a variable. It's just a name which refers to foo, and there is no way to change its value. The technical term is a let-binding.

You can create another name x with a different value, but that doesn't change the original x or foo:

let x = foo in
let x = bar in
...

One consequence of this is that the following code doesn't do what you think it does:

let quit = false in
while not quit do
let line = read_line () in
if line = "q" then let quit = true in ();
print_endline line
done

In fact this loop never exits. Why? Because the inner quit is just a different label from the outer one. In this case you would get a compiler warning because the inner quit label is never used.

What's the difference? Let bindings make labels, not variables.

Why is it confusing? Variables do exist in some functional languages, particularly ones based on ML like OCaml, but they aren't used very much. Most code you look at will use only let bindings, and you shouldn't confuse those with variables.

Why is it done like this? This encourages the use of immutable data structures, which is a giant topic in itself. In brief, immutable data structures make programming errors less likely because they remove the "who owns that data" problem that imperative languages have. (It's fair to say that immutable data structures also have disadvantages, which is why OCaml lets you drop down to mutable data when you need it).

3. Function types use lots of '->' (arrows)


To write the type of a function, you use an arrow notation. The parameters and the return type are separated by arrows like this:

val average : float -> float -> float

This means that there is a function called average which takes two parameters, both floating point numbers, and returns a floating pointer number.

Here are some more examples:

val print_string : string -> unit

(Takes one parameter, a string, and returns nothing -- unit is like void in C).

val int_of_string : string -> int

(Takes one parameter, a string, and returns an int).

val open_out_gen : open_flag list -> int -> string -> out_channel

(Takes three parameters: a list of flags, an integer and a string. Returns an output channel).

What's the difference? This syntax is common in functional programming, and almost completely unknown outside of it.

Why is it confusing? The parameters and the return type aren't separated from each other.

Why is it done like this? The reason is to do with a mathematical concept called currying. The practical reason is that functional languages let you generate new functions by partially applying some arguments. Thus:

add : int -> int -> int
(add 42) : int -> int
(add 42 2) : int

The first is the general adding function. The second is a partially applied function which adds 42 to any int. The third is the number 44.

Sunday 10 August 2008

virt-mem 0.2.9

I'm pleased to announce the latest alpha release of the virt-mem tools, version 0.2.9.

These are tools for system administrators which let you find things like kernel messages, process lists and network information of your guests.

For example:

virt-uname
'uname' command, shows OS version, architecture, etc.
virt-dmesg
'dmesg' command, shows kernel messages
virt-ps
'ps' command, shows process list


Nothing needs to be installed in the guest for this to work, and the tools are specifically designed to allow easy scripting and integration with databases and monitoring systems.

Source is available from the web page here:

http://et.redhat.com/~rjones/virt-mem/

The latest version (0.2.9) reworks the internals substantially so that we have direct access to basically any kernel structure, and this will allow us to quickly add the remaining features that people have asked for (memory usage information, lists of network interfaces and so on).

As usual, patches, feedback, suggestions etc. are very welcome!

Binaries are available for Fedora through this link:
https://bugzilla.redhat.com/show_bug.cgi?id=450713

Tuesday 17 June 2008

virt-ps and the Red Hat Summit

Tomorrow (Wednesday) through to Friday is the Red Hat Summit in Boston. If you're coming, please make sure to see my talk with Dan Berrange on the "Virtualization Toolbox", or all the little, useful tools we've been writing to help you manage your virt systems. That talk is tomorrow, Wednesday 18th June, some time after 11am.

As I mentioned previously on this blog I'm working on deep inspection of the internals of running virtual machines, and dressing this up as familiar, easy to use command line tools, such as virt-df and virt-dmesg. I'll be talking a lot more about those tomorrow, so I don't want to spoil the surprises.

The real question is whether I'll get virt-ps (process listings) working today. Getting the process listing out of a stuck virtual machine is immensely useful to find out what's going on with the machine. For example, did it blow up because there are too many Apache processes? Or is some other daemon causing trouble? I had an initial implementation of this working, but it was rather slow and unsatisfactory because of the all the guessing and heuristics it had to do. In the meantime, I discovered that getting the Linux kernel version is quite easy, and once you know the kernel version you immediately reduce the amount of heuristics you need by a large factor. So the new implementation should be much faster.

Faster, but it doesn't work at the moment. Today is the final push on this - can I get virt-ps working in time for the demo tomorrow?

Friday 13 June 2008

virt-df, virt-mem and bitmatch

One of the "self-evident truths" about virtualization is that in order to monitor aspects of the virtual machine such as free disk space / free memory / running processes / etc, you need to run some software inside the virtual machine. That's how the expensive proprietary virt systems such as VMWare work. The alternative to running monitoring software in the virtual machine would be to try to look at the disk image and memory image of the virtual machine and try to make sense of it, and that is usually regarded as an intractable, unimplementable nightmare.

That's not a truth any more.

A few months ago I proved with virt-df that it's perfectly possible to parse the disk image of a virtual machine to find out how much free disk space is left. And although in theory this could get inconsistent results, in practice it works well. Virt-df supports Linux, Windows NTFS, and complex partitioning schemes such as LVM as well as simple DOS-style partitions.

Now I'm proving with the virt-mem toolkit that we can snoop on the live memory image of running virtual machines and pull out such interesting details as the process list, the network configuration, the kernel messages (dmesg), and the kernel version strings.

So how is this possible?

Of course I'm writing these tools in OCaml, which allows me to express complicated ideas in a very few lines of code, yet is as fast as optimized C code. But OCaml alone isn't sufficient because all of the disk and memory images that we're parsing are made from binary structures and the base language doesn't handle binary structures very well. OCaml though does allow programmers to change and extend the base language using LISP-style macros -- these are language extensions that become one with the original language, but enable you to extend it in arbitrary and unexpected ways.

I have already extended OCaml by adding the entire, complete PostgreSQL SQL syntax through a macro, so this should be easy.

The key to being able to parse hundreds of different variations of the Linux task_struct (process table entry) so we can print process lists, was to look at a successful functional language which already supports bitstrings directly. Erlang is used in the telecom industry to parse binary network protocols and has a rich syntax for parsing and assembling bitstrings. I copied Erlang's syntax and using macros added it directly to OCaml. The result is the OCaml bitmatch project, hosted on Googlecode and with lots of examples and documentation.

Bitmatch is now a very advanced and powerful system for dealing with binary structures (more powerful than what the Erlang architype offered). As an example, how much code would you expect to write in order to take an ext3 partition, find the superblock and print out the number of free blocks from the superblock? (A poor man's 'df' command) You'll need to parse the Linux kernel header file <ext3_fs_sb.h> to get the fields, their offsets, sizes, endianness, etc. Then you'll need to load the superblock from disk and parse it using bit operators.

The answer in bitmatch is just 9 lines of code:

(* bitmatch-import-c ext3.c > ext3.bmpp *)
open bitmatch "ext3.bmpp"

let () =
let fd = Unix.openfile "/dev/sda1" [Unix.O_RDONLY] 0 in
let bits = Bitmatch.bitstring_of_file_descr_max fd 4096 in

bitmatch bits with
| { :ext3_super_block } ->
printf "free blocks = %ld\n" s_free_blocks_count
| { _ } ->
printf "/dev/sda1 is not an ext3 filesystem\n"

Monday 26 May 2008

Phantom types

In libvirt we have a concept of a connection to the hypervisor, and every libvirt operation happens across this connection. Connections are normally read/write, which means that you can carry out any operation, even destructive ones like shutting down a virtual machine. But since programs normally need to have root (or "elevated") privileges to do these destructive operations, we also introduced the concept of a read-only connection which can just be used for non-destructive things such as listing out virtual machines or monitoring CPU loads.








Read-only operations Read/write operations

get hypervisor type
get version
get hostname
get URI
get num CPUs
list VMs
list networks
get VM CPU stats

suspend VM
resume VM
shutdown VM
destroy VM
coredump VM
create VM
change CPU pinning


The question is, what happens if a program tries to call a destructive operation using just a read-only connection? The answer, naturally, is that libvirt throws a runtime error if you try this.

Wouldn't it be nicer if the compiler could find these sorts of errors? That way you could be sure that your compiled program was completely correct, and there wasn't some sort of dark corner of untested code which might fail in rare circumstances. For the OCaml bindings to libvirt that was exactly the sort of guarantee I wanted to make, and to do it I used a language feature which is often thought of as being confusing or highly expert (but in fact is neither of those) ... phantom types.

Before we can understand phantom types, we need to set up a little bit of background: firstly asking what sort of errors is it conceivable for the compiler to find, and secondly understanding a bit more about polymorphic types and variants (which are the toolbox we use to construct phantom types).

Compile time or run time?


To ask if phantom types are useful we need to know first if they are applicable to our problem. Phantom types find only compile time errors. Let's understand what a compile time error could be in libvirt:

let conn = Libvirt.Connect.connect_readonly ()
let dom = Libvirt.Domain.lookup_by_name conn "test"
...
Libvirt.Domain.destroy dom (* fail! *)

Clearly here the dom ("domain" = virtual machine) is derived from a read-only connection (conn), so the call to destroy must fail. It is also clear that the compiler can trivially know that conn / dom are read-only, because of the special connect_readonly call.

Contrast that with the following code:

printf "Connect read-only?" ;;
let readonly_flag = read_line () = "y"
let conn = Libvirt.Connect.open readonly_flag
(* etc. *)

Now the compiler cannot possibly predict what the user will type in at runtime, so this is not a situation where the compiler or phantom types can help. (The Libvirt.Connect.open call is one I just made up, it's not a real call precisely because of this problem).

So libvirt has two separate calls to create connections, connect_readonly which makes read-only connections and connect which makes read/write connections. The connections returned have different but compatible types, and because of this "compatibility" the connections can still be passed to the common non-destructive functions. Even more amazingly, the extra "is read/write" flag which is attached to the connection simply disappears at runtime, which means there is zero loss of efficiency when using phantom types (I'll explain a bit more about that below).

It's important to keep a very clear distinction in your head about what happens at compile time versus what happens at run time.

Mysterious polymorphic types


You may be familiar with polymorphic types such as 'a list, lists of any type, where 'a (pronounced alpha) stands for all types. Here's another example of a polymorphic type, a structure which contains a "free" field that can store any particular type:

type 'a t = { data : 'a }

We can make a structure containing string data for example. Notice its type:

# { data = "hello" } ;;
- : string t = {data = "hello"}


What is perhaps not so well known is that you can add extra polymorphism to any type you like, even when it doesn't seem to need it.

This just makes t an alias for the floating point type:

type t = float

But the following (legal) definition is much more mysterious:

type 'a t = float


What does it do, and how can you create anything of this type? The answer is that any float has this type, as you can prove very easily in the OCaml toplevel:

# (3.0 : unit t);;
- : unit t = 3.
# (10.4 : string t);;
- : string t = 10.4

Because the 'a isn't needed, it can be set to any type (unit and string in the examples above).

This isn't yet the "phantom" type though. It's tempting to think we could write a function which only works on string t (I'll call them "stringies"):

# let add_stringies (a : string t) (b : string t) = (a +. b : string t) ;;
val add_stringies : string t -> string t -> string t = <fun>

But in fact this function does not work!

# add_stringies (3.0 : unit t) 5.0 ;;
- : string t = 8.

This is because unit t and string t can be freely unified with each other because the compiler knows that both are really just floats:

# ((3.0 : unit t) : string t) ;;
- : string t = 3.


To prevent this "rogue" unification and allow us to write a function like add_stringies correctly, we have to hide the real type of t inside a module, like this:

module T : sig
type 'a t
val add_stringies : string t -> string t -> string t
end = struct
type 'a t = float
let add_stringies a b = a +. b
end

Module T is completely useless by the way, it's just setting the stage for ...

Phantom types


OCaml might have saved NASA from losing a $125 million space probe if only they'd had the following Length module which prevents programmers from mixing imperial and metric measurements:

module Length : sig
type 'a t
val meters : float -> [`Meters] t
val feet : float -> [`Feet] t
val (+.) : 'a t -> 'a t -> 'a t
val to_float : 'a t -> float
end = struct
type 'a t = float
external meters : float -> [`Meters] t = "%identity"
external feet : float -> [`Feet] t = "%identity"
let (+.) = (+.)
external to_float : 'a t -> float = "%identity"
end

open Length
open Printf

let () =
let m1 = meters 10. in
let m2 = meters 20. in
printf "10m + 20m = %g\n" (to_float (m1 +. m2));
let f1 = feet 40. in
let f2 = feet 50. in
printf "40ft + 50ft = %g\n" (to_float (f1 +. f2));
(*printf "10m + 50ft = %g\n" (to_float (m1 +. f2)) (* error *) *)

(Try compiling and running this, then try uncommenting the very last line and compiling it again).

Just for readability, we're using polymorphic variant types instead of the basic types (unit and string from the previous section). This allows us to substitute meaningful type names like [`Meters] t which I hope is obvious as to what it contains. It also means the error messages from the compiler will be easy to read - that's important because we are expecting to get a compiler error each time the programmer makes a mistake.

Now look at the functions we have:

Two functions (meters and feet) convert floating point numbers into meters or feet respectively. But the implementation of these functions is completely efficient. They're just the identity operation (which the compiler turns into a null operation). At compile time, the values have this extra type information. But at run time the overhead evaporates completely. At run time, these are just floats.

The addition function is constrained: 'a t -> 'a t -> 'a t. This means you can use it on two meter measurements, or two feet measurements, but you cannot mix meters and feet. Furthermore the return type is the same as the input types, so this safety cascades through all code.

Finally we need a way to extract and print the results. I've included a rather naive to_float function, but for better safety we'd probably want to define special print functions which ensure that the output indicates the correct type back to the user.

Let's go back to our read-only/-write connections from the beginning. Remember that simple non-destructive status functions are safe for anyone to call, but destructive functions must only be used when you have a read/write connection. To express this we will use a slightly different feature of polymorphic variants, that is being able to express open types using [>...]. A read-only connection will have type [`Readonly] t and a read/write connection will have type [`Readonly|`Readwrite] t which means that it's compatible with the read-only type but has the extra read/write ability.

Status functions have type : [>`Readonly] t -> ... because they work with "read-only or greater".

Here is our module:

module Connection : sig
type 'a t
val connect_readonly : unit -> [`Readonly] t
val connect : unit -> [`Readonly|`Readwrite] t
val status : [>`Readonly] t -> int
val destroy : [>`Readwrite] t -> unit
end = struct
type 'a t = int
let count = ref 0
let connect_readonly () = incr count; !count
let connect () = incr count; !count
let status c = c
let destroy c = ()
end

open Connection
open Printf

let () =
let conn = connect_readonly () in
printf "status = %d\n" (status conn);
(*destroy conn; (* error *) *)
let conn = connect () in
printf "status = %d\n" (status conn);
destroy conn

Again, after compiling this, try uncommenting the rogue call to destroy and notice that the error is caught by the compiler.

Here's another example, again it comes from actual deployed code. A "memory map" is a view into the memory of a virtual machine. When we first load a memory map we don't know anything about the endianness or word size of the virtual machine. We have to inspect the memory map first to determine endianness (little or big endian) and word size (32 or 64 bit pointers). Once we've determined both, we offer additional operations which work on the integers and pointers contained in the memory map.

So there are four "subtypes" (states?) of memory map, summarized in the diagram on the left.

Can we use phantom types to ensure that the transitions happen before specific calls are allowed? Yes we can!

We start by defining our phantom type. This is how it appears in the module signature (the mli file). The specifics of the implementation (from the ml file) aren't important here. Note that because there are two degrees of freedom (word size and endianness), there are two phantom types attached to t:

type ('a,'b) t

The basic of_file function makes a memory map from a file descriptor and base address. It returns a memory map with no word size or endianness, which is pretty clearly expressed in the return type:

val of_file : Unix.file_descr -> addr -> ([`NoWordsize], [`NoEndian]) t

Some functions work with any memory map, whether or not word size and endianness have been set. For example, the find function searches for strings in the memory map:

val find : ('a, 'b) t -> ?start:addr -> string -> addr option


A related function is find_align which finds strings that are aligned to the word size. This function cares that word size has been set, but not endianness, and its type is therefore:

val find_align : ([`Wordsize], 'b) t -> ?start:addr -> string -> addr option


The find_pointer function looks for pointers appearing in the memory map. Pointers have both endianness and word size implicitly, so this function can only be used when both have been set on the memory map. Its type is:

val find_pointer : ([`Wordsize], [`Endian]) t -> ?start:addr -> addr ->
addr option


Finally of course we need a way to supply the word size and endianness to the memory map. Detecting these is a difficult and drawn-out process, so our main code will set each one individually by calling the following two functions:

val set_wordsize : ([`NoWordsize], 'b) t -> wordsize ->
([`Wordsize], 'b) t
val set_endian : ('a, [`NoEndian]) t -> endian ->
('a, [`Endian]) t

The compiler guarantees here are quite strict. Although the programmer can call the functions in either order, they must call each function only once. And of course the programmer won't be allowed to call functions like find_pointer until they have called both set_wordsize and set_endian exactly once (although the order they call the functions doesn't matter).

Personally I think this is a pretty awesome language feature. I'm not sure if any other non-functional languages can make these sorts of guarantees at compile time. (Perhaps Eiffel?)

One interesting thing is that the implementation of the memory map still has runtime checks in those functions (such as find_pointer) which need it. But you can be sure that the runtime checks will never fail. An improvement to the code would be to write it so it doesn't need any runtime checks at all.

Runtime and phantom types



My final example with phantom types will be to explain how to add runtime checks back in. Some situations simply cannot be checked at compile time, yet don't represent errors. For example in mlvirsh users can open read-only or read/write connections to the hypervisor. We deal with this by adding a limited amount of dynamic typing back into the system:

type conn_t =
| No_connection
| RO of Libvirt.ro Libvirt.Connect.t
| RW of Libvirt.rw Libvirt.Connect.t

Of course we need to also add runtime checks into the program.

I will argue that this is an acceptable trade-off and still much better than languages which only do dynamic typing. We get the benefits of static types / early compiler errors when it's possible to check for them, but we can slip back to dynamic typing in the situations where it's necessary. Out of the entire suite of virt tools, just one (mlvirsh) needs runtime types.

Sunday 25 May 2008

Extending immutable data structures (part 2)

[This is part 2 of my series about extending immutable data structures in OCaml. You may want to go back and read part 1 first]

Previously I took this definition of an XML document tree from xml-light and asked how we could annotate nodes with extra data.

type xml =
| Element of (string * (string * string) list * xml list)
| PCData of string

If you "think outside the box", or in this case literally outside the structure, you might see that you can add annotations using a hash table. The idea is that we'll hash the addresses of nodes to the annotated data.

Let's have a concrete example to make this all a bit clearer. Start with this simple document tree:

<a href="http://camltastic.blogspot.com/">Camltastic!</a>

which is written in OCaml as:

let str = PCData "Camltastic!"
let doc = Element ("a",
["href", "http://camltastic.blogspot.com/"],
[str])

There are two nodes here (str and doc) and both have type xml.

A simple, and wrong, way to use a hash table to annotate these nodes is like this:

let h = Hashtbl.create 13
let add_annot node data = Hashtbl.add h node data
let get_annot node =
try Some (Hashtbl.find h node)
with Not_found -> None

It seems to work:

# add_annot doc "doc is annotated with this string" ;;
- : unit = ()
# get_annot doc ;;
- : string option = Some "doc is annotated with this string"
# get_annot str ;;
- : string option = None

But there are some subtle and not-so-subtle problems with this approach.

The first problem is that Hashtbl, by default, uses structural equality to hash nodes. If you had a document like <a>Link</a><a>Link</a> then there is no way to attach different data to the two links, because both links appear to be equal (they are equal, structurally). Attaching data to one link attaches the data to both.

The two other problems with this are garbage collection and copying. Garbage collection is defeated because the hash table maintains a permanent pointer to any annotated nodes. What we would like to happen is that if a node becomes garbage, the collector frees the node and any annotations on the node. In fact this never happens: the permanent pointer held by the hash table prevents any node from becoming garbage. The copying problem is that when the main program copies a node, the annotations aren't copied across.

Enter the Weak Hash Table



OCaml has a rarely-used but rather elegant module in the standard library called Weak. This module implements weak pointers. A weak pointer is like an ordinary pointer, but it doesn't "count" towards garbage collection. In other words if a value on the heap is only pointed at by weak pointers, then the garbage collector may free that value as if nothing was pointing to it.

A weak hash table uses weak pointers to avoid the garbage collection problem above. Instead of storing a permanent pointer to the keys or values, a weak hash table stores only a weak pointer, allowing the garbage collector to free nodes as normal.

It probably isn't immediately obvious, but there are 3 variants of the weak hash table, to do with whether the key, the value or both are weak. (The fourth "variant", where both key and value are strong, is just an ordinary Hashtbl). Thus the OCaml standard library "weak hash table" has only weak values. The Hweak library has weak keys and values. And the WeakMetadata library (weakMetadata.ml, weakMetadata.mli) is a version of Hweak modified by me which has only weak keys.

Which sort of weak hash table do we need? If you look at the Hashtbl implementation above you'll see that the key corresponds to the xml node and the value corresponds to the metadata attached to nodes. We assume that the main program holds onto a pointer to the xml tree until it is no longer needed, whereupon it should be collected. So we definitely need weak keys. How about the values (annotations)? The main program probably isn't holding any references to the annotations - it expects to annotate a node and then forget about the annotation until it is needed later. So values should not be weak (otherwise our annotation library will be rather "forgetful"). You'll therefore need to use my modified WeakMetadata library to get around the garbage collection problem.

Using WeakMetadata you can annotate immutable data structures, as this test program shows. Although we have solved the garbage collection problem, the other two problems I mentioned remain. Namely, it's still using structural equality, and if the main program copies a node, then the annotations don't get copied too.

However there are some rather more insidious problems that remain with this whole approach, which are down to how weak pointers are actually implemented inside the OCaml garbage collector. I wrote about them in this posting on the OCaml mailing list which also includes some analysis of the performance and memory usage of annotating XML document nodes.

Part 3 (next week) will look at what I hope is the "final word" on safely extending data structures.

Tuesday 20 May 2008

Extending immutable data structures (part 1)

Functional languages are famous because they let the programmer build complicated data structures effortlessly. It's very compelling to describe the type of any XML document in just three lines of code:

type xml =
| Element of (string * (string * string) list * xml list)
| PCData of string
And I won't recur (pun, intended) on the subject of how easy it is to write code to process these data structures.

However there is one thing which is difficult to do, and that is to meaningfully extend or annotate such data.

Let's say as a practical example I want to write a CSS renderer for XML documents. A first step to doing this is to annotate each node in the XML document with the list of styles that apply to it. (Styles are inherited, so natural recursive algorithms work quite well for this). The second step is to take what CSS calls "replaced elements" (elements like <IMG> that have an intrinsic size not known to CSS itself) and compute their sizes.

The xml type defined above is not extensible at all. In order to allow it to store styles and sizes for replaced elements we'd need to change the type definition to something like this:

type xml =
| Element of (string * (string * string) list * xml list
* style list * size option)
| PCData of string
Unfortunately this is no use to us because all our existing code that worked on the old xml type no longer works on this new type. That is no mere theoretical concern either, because the old xml type at the top wasn't just chosen at random. It is in fact the type used by Xml-Light, a useful, simplified XML parser and printer, and we cannot change this type without stopping using this library.

PXP (another XML library -- OCaml has several) has a polymorphic XML type, allowing you to attach precisely one arbitrary datum to each node. In terms of the xml type defined above, it looks similar to this:

type 'a xml =
| Element of ('a, ...
This is half-way useful, because one can now use the library functions (they assume polymorphism, so work fine), but has some serious shortcomings of its own. One is that you can't easily build up independent modules. If there is a PXP module which uses the 'a for its own purpose then one cannot extend the XML document any further. If you avoid that, you can attach multiple data by using a tuple, but each module that wants to attach data had better know about all the other modules in advance and agree on a single tuple type.

For the CSS styles / sizes example, we could also imagine adding a 'b styles with 'b instantiated as size when that is needed. You'd end up with document types like this:

unit xml # The basic document
unit styles xml # After annotating with styles (pass 1)
size styles xml # After annotating with sizes (pass 2)
But you can't write a function that just operates on "an XML document annotated with sizes", unless it "knows" about all the other annotations and the order in which they were applied. (Is the type size styles xml or styles size xml if we happened to swap the two passes around?)

In part 2 (next week) I'll look at some radically different ways to annotate and extend immutable data structures.

Wednesday 7 May 2008

Optimizing, memory allocation and loops

Unlike C compilers such as gcc, ocamlopt doesn't do very much in the way of optimization. Really it compiles the program directly down to machine code pretty much the way you write it.

Accidental memory allocation and loops are a particular challenge. Consider for example:

let n = 100_000
let fib = Array.make n 1 ;;
for i = 2 to n-1 do
let a, b = fib.(i-2), fib.(i-1) in
fib.(i) <- a + b
done
ocamlopt takes the code literally, and let a, b = ... allocates a tuple (a, b) before discarding it. Simply rewriting the problematic statement as:

let a = fib.(i-2) and b = fib.(i-1) in
makes the whole loop run in half the time.

The second problem with this loop is that we have lots of unnecessary references to the array fib. Each time we write fib.(i), the compiler has to emit code to calculate fib + i*sizeof(int), and unlike C, ocamlopt doesn't use code motion to simplify and share the repeated references.

We nearly halve the loop time again by accessing the array only once:

let a = ref 1 and b = ref 1 in
for i = 2 to n-1 do
let c = !a + !b in
a := !b; b := c; fib.(i) <- c
done
Overall these two simple optimizations reduce the total running time of this loop more than three-fold.

Monday 5 May 2008

BBC Radio 4 interview with me on the 30th anniversary of spam

I was interviewed on Saturday by the BBC Radio 4 iPM programme, on the subject of the 30th anniversary of spam. For the next few days you will be able to download the whole program here (the segment on spam starts around 20'08). Or you can listen to a longer version here.

Persistent matches in pa_bitmatch

I'm quite pleased with the state of my pa_bitmatch language extension, which adds safe bitstring matching and construction to OCaml.

However to be truly useful as a general purpose binary parser it needs ways to import descriptions from C header files and build up reusable libraries of bitmatches.

Let me explain with a favorite example: Parsing Linux ext2 filesystem structures.

Many common binary formats are defined solely in terms of C structures in C header files, and a good example is the "struct ext2_super_block" type in the ext2_fs.h header file. In a previous iteration of bitmatch, which I called libunbin, I wrote a C header file parser which could import these directly into the XML format I was using at the time. In OCaml, the CIL library makes this quite simple.

In bitmatch at the moment we end up translating these structures by hand into OCaml code which looks like this:

bitmatch bits with
| { s_inodes_count : 32 : littleendian; (* Inodes count *)
s_blocks_count : 32 : littleendian; (* Blocks count *)
s_r_blocks_count : 32 : littleendian; (* Reserved blocks count *)
s_free_blocks_count : 32 : littleendian (* Free blocks count *)
} ->

(The real code is much much longer than the above extract).

That works, but there is no good way to save and reuse the above match statement. If there are two parts of the program which need to match ext2 superblocks, well we either need to duplicate the same matching code twice or else we need to isolate the matching code into an awkward OCaml library.

What we'd really like to do is to write:

let bitmatch ext2sb = { s_inodes_count : 32 : littleendian;
(* ... *) }

so that we can reuse this pattern in other code:

bitmatch bits with
| ext2sb ->

We could even grab the old libunbin CIL parser so we can turn C header files into these persistent matches.

Now the above is possible. Martin Jambon's Micmatch library does precisely this, but only within the same OCaml compilation unit. To make this truly useful for bitmatch we'd definitely need it working across compilation units, and that turns out to be considerably harder.

Let's look at how Martin's micmatch works first though. In micmatch, a regular expression can be named using:

RE digit = ['0'-'9']

Martin's implementation of this is to save the camlp4 abstract syntax tree (AST) in a hash table mapping from name ("digit") to AST. At the point of use later on in some match statement, the name is looked up and the AST is substituted. Thus:

match str with RE digit ->

just turns into the equivalent of:

match str with RE ['0'-'9'] ->

and is processed by micmatch in the normal way.

We can use the same approach within compilation units for bitmatch.

Making it work across compilation units means that either the AST or the whole hashtable is going to have to be saved into a file somewhere. One obvious choice would be the cmo file. In fact saving the AST into the cmo file is relatively simple: we just turn it into a string (using Marshal) and write out the string as a camlp4 substitution:

let bitmatch ext2sb = { ... }

becomes:

let ext2sb = "<string containing marshalled representation of AST>"

This saves the string in the cmo file. Loading it from another file is a different kettle of fish. At the point where camlp4 is running all we have is the abstract syntax tree of the OCaml code. We haven't even looked up the identifiers to see if they exist or make sense, and we haven't opened the cmi files of other modules, nevermind the cmo files (which in fact never get opened during compilation). Thus the only hope would be for our camlp4 extension itself to perform all the library path searches, identify the right cmo file, load it and get the appropriate symbol. It is not even clear to me that we have the required information to do this, and even if we have, it's duplicating most of the work done by the linker, so would make the extension very complicated and liable to break if all sorts of assumptions don't hold.

If we don't save it into a cmo file, perhaps we can save the whole hash table into our own file (.bitmatches). This is tempting but has some subtle and not-so-subtle problems. We are still left with the "search path problem" assuming, that is, we want to be able to save these matches into external libraries. There is also the subtle problem that parallel builds break (both in that the file needs to be locked, and that we may not build in the right order so that matches could be used before they are defined). It is also unclear what would (or should) happen if a match is defined in more than one file.

It's fair to say that I don't have a good solution to this yet, which is why I've held off for the moment on adding persistent matches to pa_bitmatch. If no one suggests a good solution, then I may go with the shared file solution, together with lots of caveats and limitations. Of course if you see something that I'm missing in this analysis, please let me know.

Defaming Microsoft

The BBC write to inform me that they have deleted a comment a made on a BBC blog because it is (they say) potentially defamatory.

This is what I wrote originally about the MS-Yahoo deal (which sadly has been called off):


As a Linux fan, I honestly can't wait for this deal to go through.

Microsoft will eviscerate Yahoo, turning away all the open source fans who work there and lamely attempting to port all the server-side software to Windows (that worked out well for Hotmail). And they'll have $44bn less cash in the bank to bribe politicians and standards bodies.

Bring it on, please.


And this is the form letter from the BBC:


Thank you for contributing to a BBC Blog. Unfortunately we've had to remove your content below

Postings to BBC blogs will be removed if they appear to be potentially defamatory.

You can find out more about Defamation at http://www.bbc.co.uk/dna/hub/HouseRules-Defamation

Wednesday 23 April 2008