Designing an Atomic Block File Format

In the process of the (long dead) FreeCard project of the late 90ies that aimed to clone HyperCard, I worked on its block file format. The idea was to create a format not unlike an entire web site packed into a single file, containing text, images, code, and what connected them into a whole.

XBlockFile

We called our file format XBlockFile, and it was inspired by resource files. It was a single file that contained smaller chunks of data. The chunks had a four-character type identifier (like CARD or PICT) and an ID number identifying the individual block, and optionally a name, and of course contain data.

XBlockFile supported just loading the data for an entire block into RAM, or random-access reading and writing to a block just like you would to a file, including appending to a block. If a block grew, XBlockFile would transparently move around things in the file to make room for your new data. All of this with the guarantee that a block file would not become invalid. Either it failed to write and the file was like it was before, or it successfully wrote.

These days I work on databases, and as I was reminded of these algorithms, I thought I’d give you a short tour of how these things can be achieved.

Block File Basics

The basic idea behind a block file is surprisingly simplistic: Instead of writing several tiny files, you write one file that contains all that data back-to-back. You do that by prefixing each range with its size (and maybe the other metadata, like its name, type, an ID number, a change date etc.).

A simple length-counted block file format

Now you can probably see the problem here: You have to move around the last block to resize the block in the middle. One quick way to solve this is to simply create a new larger block at the end, copy the data to that new block, and then append the new data:

A block file with an added block and the old block marked in red

Of course, this leaves “gaps” in your files. Blocks that contain the old data. Given that in practice many blocks are of similar sizes, a common optimization is that when looking for a new block (either to create a new block, or to copy a block to that needs to be enlarged), you re-use these old blocks. To achieve this, one marks the blocks as “free”. If they have a type code, just give them the type code “FREE”. If they just have a name, give them an empty name. If they just have an ID number, use 0 as the number for all free blocks … whatever works, as long as you can distinguish a block that’s actually in use from one that’s just wasted space, so you can re-use them.

Compacting a file

Another approach to reducing file size is to just write the file out again, without the “wasted” blocks. Many block file formats offer such an operation. If you remember “hard disk defragmentation” – it’s basically the same idea. It’s easy. You read your file from the start, and copy each used block to the new file, while skipping the unused ones.

The downside to this of course is that we temporarily need about twice the disk space.

Variable-size metadata

I mentioned earlier that you could associate a name with a block. This can work the same way as the block itself: Before the block data, you just write out the number of characters the name has, and then the actual name:

Blocks with an extra byte-counted string for each name

You could also just write a zero-terminated string, but then skipping a block would require you to read each character of the string until you encounter the zero. By prefixing things with their length, you can just read the length, and then skip that many bytes if you’re not interested.

However, this means that every time a name is changed and gets longer, we have to copy the entire block to enlarge it. If you have many large blocks, that can be slow. So what many block file formats do is separate out the metadata (like block size, name, ID) from the actual block data. So you have a “table of contents” that contains name, ID, type, length, and offset to the actual data in the file (because data doesn’t immediately follow the block, you need to know where to look for it).

Sometimes it’s even worth similarly placing names out-of-band from the table of contents, either as one large chunk, or even individually.

Since the TOC changes size whenever a new block is added or one is removed, we also can’t just put it at the start. We’d have to move all other data in the file whenever we add a new block. So let’s make the TOC occupy a range in our file, like any other block. But how do we do this? To have ranges, we’d need the information from the TOC?

To avoid this Catch-22, we start the file off with the offset to the table of contents. 4 bytes of fixed data for a number. The rest is all dynamic and may occupy any point in the file. Even better: When the TOC gets smaller or larger, we can declare its old location as a “wasted” block, and can re-use appropriately-sized “wasted” blocks to hold its data.

This gets a bit finicky because the TOC size itself changes depending on whether we find a “wasted” block of the right size or not, but since we also add a “wasted” entry for the previous TOC’s range, it usually evens out. We search for the TOC size we need, then simply replace the entry for the “wasted” block that we will re-use with the offset and length of the old TOC and write that out. If we find no wasted space to reuse, we instead create a new block with 1 extra “wasted” entry for the old TOC’s range.

With this setup, renaming a block writes the entire TOC again, but all the (potentially huge) block data can stay in place. And the converse holds true as well: If you’re just appending data to a block that is at the end of the file, you can update the TOC in place and just append to the file.

Atomicity

One problem with many file formats is that if you just write out the changes and e.g. run out of disk space or have a hard disk or network drive disconnect, you may end up with a partially-updated, invalid file. Since user data is important one can take a page from most modern databases to guarantee your file is always valid:

Instead of changing data, we always write a new copy.

So, if you add a new block, you write its data to a wasted block. You also write a copy of the TOC to another wasted block. In that TOC, everything is the same, except that it has the new entry for the new block, and a wasted entry where the current TOC is, and is missing the two wasted entries that we used for the new TOC and the new block (well, up to two).

At this point, anyone reading the file after a crash would still see the old state, but at least the file is valid. The last thing you do now is write the offset of the new TOC to the start of the file.

Boom! With one write, your file is completely changed. But it’s only a split-second write of 4 bytes. If it crashes anytime before, e.g. while copying the TOC, or if it crashes after … it doesn’t matter, The file is valid.