Building a loader...
One neat thing about a computer's internals is the "loader". A loader is a little bit of code (generally in the system) that takes a file of compiled code and loads it into RAM, preparing it for execution.
Preparing code for execution means that it looks where the system functions are and fixes up any calls to them to point to the current address, and does the same for other dynamically loaded libraries that might not get loaded at the same address every time.
For my experiments with code generation, runtimes etc., I recently wrote myself a little loader. I could have used the system's loader, but then my output files would have had to adhere to the Mach-O file format, and that looked a tad too complicated for me. So, I rolled my own that's fairly simple, but in essence does the same things a real loader would do:
It loads the actual code and data from a file. Then it looks at the symbol table which is also part of the loaded data, and looks up the actual function corresponding to each symbol name, and inserts it in the symbol table. The way one generally does this is to have the symbol table consist of 5-byte placeholders. Why 5 bytes? Well, because that's the size of a JMP (jump-to-address) instruction in Intel machine code.
All code that calls the function is written so it jumps to these 5 bytes and executes them, using a CALL instruction. And when the loader prepares our code for execution, it fills these 5 bytes with the opcode for the JMP instruction plus the address of the actual function to be called (in the case of our example image, that would be the address of printf()).
The nice part about this approach is that all CALL statements point to this central JMP statement in the symbol table (which is often called a "jump island"), and the loader only has to update this one place. Another nice part comes from the use of CALL and JMP: CALL remembers the location we were at when it got executed, so that the called function can return to that place. On the other hand, the JMP instruction just transfers execution to its destination. That means that when printf returns, it will not return to the JMP instruction, but rather it will directly jump back to the CALL instruction, because that was the last time someone saved where to return to. So, we don't need an additional RET (return) instruction after the JMP, and we don't waste time jumping to the symbol table just to jump back to the CALL statement.
But still, we make that extra jump. Can't we do better? Yes! Since CALL pushes the address of the instruction after it onto the stack, it's easy to find the CALL instruction that called into the jump island. So, many loaders don't put the address of printf into that symbol table entry, but rather the address of a linker-fix-up-function. This function then inserts the address of printf into the CALL instruction. That way, the first time each CALL statement is executed, we go through the jump island, the fix-up function changes the CALL instruction to call printf directly and then jumps to printf. Subsequent times, the CALL statement will call printf directly.
So, how does the fix-up function know to call printf? Well, before it updates the CALL statement, it takes the address of our jump island from there. So, we can just use the return address to find the CALL statement, and the CALL statement to find our jump island, and then do whatever we would have done before to get the real function address (e.g. get the function's name that our compiler stored before our jump island in the code), stash it in the CALL statement and jump to it ourselves.
The advantage is that we can lazily update each CALL to directly call printf, which saves us one extra jump (and maybe even flushing memory -- who knows how far the symbol table is away from the code we're executing). But since we do it only on code that is actually called, if a piece of code is never reached, we never look up that symbol. That may save us a lot of time. It's also handy for weak-linking: If the function is never called, we don't care whether the library isn't available. So, whoever wrote the code we're loading can just check for existence of a library and not call into it if it isn't there.
Neat, huh?