Last week Norman Nunley got me interested in the ICFP challenge 2006, despite it being over. It involved implementing a Universal Machine following a (hilarious) spec. The challenge provided a program that, when ran on your VM, would yield instructions on what to do next.
I had never written a virtual machine before: I always imagined it too hard to even try. I was wrong. It took me around 7-8 hours; spread over an evening, the following morning and a couple of hours at work; to get a working machine. (Sorry boss, it was too addictive—I blame Norman.) I did most of the work alone, but I needed Norman's help to clarify parts of the spec and in some debugging.
Imagine my amazement when it turned out this "huge scary thing", which I'd imagined a VM to be, clocked in at less than 300 lines of C. It wasn't fast though. A self-test and benchmarking program for the UM was available from the ICFP website. Norman's UM ran this benchmark in minutes but mine, distressingly, took over 10 hours.
Last Saturday, after eliminating variables and assignments in the hotpath and using macroes instead, I managed to get the time for the benchmark program down to just over two hours. A whole lot better, but still way slower than others' VMs. From basic instrumentation I knew that the benchmark program did a lot of memory allocation and deallocation, and I surmised that this was the cause of my performance problems. During the course of the weekend I tried to speed my VM up further by caching previously allocated arrays in different ways. Whatever I tried, however, simply made it slower. It was a humbling experience that reminded me that I should have faith in the library implementors; they're most likely smarter than me.
The performance of my VM continued to bug me all Monday morning. My VM has an array of pointers to unsigned integer arrays, which the benchmark program required a whole bunch of. Indexes into the parent array identify which leaf array it required, and therefore I could not change the index of a leaf array. When freeing a leaf I end up with a sparse parent array. My fallacy was to use a sequential scan to find the first free location in the parent array, when allocating a new leaf; as it turned out, this was really expensive. When I tried to just allocate new leaf arrays at the end of the parent array and simply allowing the parent array to be sparse at the front, my VM suddenly ran the benchmark in just over three minutes. A considerable speedup, but it didn't come for free. As I didn't reuse any of the slots in the parent array the memory usage grew steeply to about 140 MB, up from around 3 MB throughout the entire run previously.
The morale is clear: Find your bottlenecks, then eliminate them; don't
try to eliminate imagined bottlenecks. I already knew this, but it
doesn't hurt (much) to be reminded. I originally assumed that memory
allocation was the source of my performance issues. I was partly
right, but it was not due to
free() being slow, as I had
assumed. It was down to my stupid algorithm for finding somewhere to
stash the newly allocated array.
Update: By making use of C99's flexible array member and introducing a simple free list I've now managed to get the memory usage for my VM down to 2.16 MB (as reported by top on my Powerbook) for the benchmark run, at a cost of adding about 40–50 seconds to the run time. The flexible array member feature allows me to do only one allocation instead of two when allocating an array wrapped in a struct; this can yield large memory savings if you're allocating a lot of structs containing short arrays. The free list simply holds indices to free slots in the parent array.
Update, 2007/09/03: Here's my Virtual Machine, should you wish to take a look.