How many digits in infinity?

A brief tour of an exciting compiler bug.

Context is Everything

Before getting to the bug, a bit of background on the project I'm working on. Dislike being confused by how things work so writing a compiler for a Zig-like language. Unsafe memory access, arbitrary comptime execution, types as values, yada yada yada. I'm not trying to sell it to you. The vaguely relevent information for this discussion is:
  1. It's self hosted so code examples use my syntax
  2. It has two working backends: one that generates low quality machine code in memory for comptime execution and one that spits out llvm-ir as text for optimised AOT builds.
This bug occured while working on generating my own Mach-O executables without relying on someone else's linker. I wrote some innocuous bit fiddling code (shown later) and my compiler started crashing trying to compile it.

So where do we crash?

We're crashing in the compiler here so there's two pieces of code we're interested in:

Often the first is more useful because in a compiler all the code runs a lot of times because there's a lot of input, you can't really sanely step through it. The compiler was certainly working before (at least well enough to compile itself), so the new code I wrote is the input that changed.

Luckily finding the input code is easy even without nice debug info for stepping through the program. The backend has a loop over all the functions, we just print the name of the function before trying to compile it and see the last one before it crashes.

So this is the offending function. It packs the fields of a dyld_chained_import_addend for emitting a Mach-O executable.

fn encode_chained(r: ChainedReloc) i64 = {
    data: i64 = r.next.zext().bit_and(1.shift_left(12) - 1).shift_left(64-1-12);
    @match(r.payload) {
        fn Bind(it) => {
            data = data.bit_or(1.shift_left(63));
            data = data.bit_or(it.ordinal.zext().bit_and(1.shift_left(24) - 1).shift_left(0));
            data = data.bit_or(it.addend.zext().bit_and(1.shift_left(8) - 1).shift_left(24));
        }
        fn Rebase(it) => {
            data = data.bit_or(0.shift_left(63));
            data = data.bit_or(it.target.bit_and(1.shift_left(36) - 1).shift_left(0));
            data = data.bit_or(it.high8.zext().bit_and(1.shift_left(8) - 1).shift_left(36));
        }
    };
    data
}

Forgive how ugly that is, I'm certainly reconsidering my lack of syntax sugar for bitwise operators or bit fields. But I'll give you a little spoiler, that function is correct. At least correct enough for my purposes. It's not somehow dereferencing bad memory or overflowing something, and even if it was, it wouldn't matter because we're not crashing when ~running~ that code, we're crashing while compiling it. It certinly looks like a valid program. In fact it only crashes when trying to compile for llvm. It works just fine if I try to run it with my simple machine code backend.

So that didn't tell us much. Let's crack it open in lldb and see what we're running when we crash.

A Trip Down Memory Lane

When making a programming language you may find yourself wanting to print integers. Here's what I came up with:
fn display(i: i64, buf: *List(u8)) void = {
    if i < 0 {
        buf.push("-".ascii());
        display(-i, buf);
    } else {
        is_digit := i >= 0 && i < 10;
        if is_digit {
            buf.push(i + "0".ascii());
        } else {
            display(i / 10, buf);
            buf.push(i.mod(10) + "0".ascii());
        };
    };
}

I'm not claiming it's the most efficent way to print a number. It needs a stack frame per digit, its not even tail recursive, could just preallocate the buffer and loop becuase you know how many digits there are, etc, etc. But that's not the point. The point is that it works. I wrote that logic like 8 months ago, before i even had a self hosted compiler, and its been fine. Never thought about it again. Worked flawlessly...

If you still need convincing, feel free to trace through some examples on paper:
display(-123);
    push("-");
    display(123);
        display(12);
            display(1);
                push("1");
            push("2");
        push("3");

However... now it's segfaulting. In fact, it's segfaulting with a super deep stack trace. So really its an infinite loop that just manifests as running out of stack. That's strange, am I somehow printing a new edge case number never printed before?

For extra context on why code generation involves converting numbers to strings:
I generate llvm-ir as text because I found it annoying to link thier library into my program, and I'm too lazy to try to generate thier bitcode directly. The function calling display in this case looks like:

fn inst_literal(self: *EmitLlvm, value: i64, ty: Prim) EmitLlvm.Val = {
    v := self.get_var(ty);
    @match(ty) {
        // [...] non-integer cases omitted. 
        @default => {
            @fmt(self.out&.current(), "   % = add % %, 0\n", v, ty.llvm_type(), value); 
        };
    };
    v
}
Which creates an extra useless temporary value by adding an integer to zero so I can represent all values uniformly as llvm ssa names. But that's not relevant to this bug. The overarching theme here is that you shouldn't crash when you try to print a number!

Two Add Insult Two Injury

Let's take another look at what we're compiling. I can print out the bytecode of the encode_chained function and see exactly what's going on.
Note that this is a luxury I didn't have when actually debugging this problem... because when I tried to print it segfaulted again. I had a very painful adenvture in using lldb on a language without debug info. But the story makes more sense if you know what's exactly going on.
[#log_bc encode_chained] [P64]
    [b 0] ([P64])
    - (SaveSsa = (id = 0, ty = P64))
    - (AddrVar = (id = 1))
    - (LoadSsa = (id = 0))
    - (IncPtrBytes = 28)
    - (Load = I32)
    - (Intrinsic = ZeroExtend32To64)
    - (PushConstant = (value = 4095, ty = I64))
    - (Intrinsic = BitAnd)
    - (PushConstant = (value = 51, ty = I64))
    - (Intrinsic = ShiftLeft)
    - (StorePre = I64)
    - (AddrVar = (id = 2))
    - (LoadSsa = (id = 0))
    - (CopyBytesToFrom = 24)
    - (AddrVar = (id = 3))
    - (AddrVar = (id = 2))
    - (StorePre = P64)
    - (AddrVar = (id = 4))
    - (AddrVar = (id = 5))
    - (AddrVar = (id = 3))
    - (Load = P64)
    - (StorePre = P64)
    - (AddrVar = (id = 5))
    - (Load = P64)
    - (Load = I64)
    - (LastUse = (id = 5))
    - (CallDirect = (sig = 0, f = F3138, tail = false))
    - (StorePre = I64)
    - (AddrVar = (id = 4))
    - (Load = I64)
    - (Switch = 0)
    [b 4] ([])
    - (LastUse = (id = 2))
    - (LastUse = (id = 3))
    - (LastUse = (id = 4))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (LastUse = (id = 1))
    - (Ret1 = I64)
    [b 5] ([])
    - (AddrVar = (id = 6))
    - (AddrVar = (id = 3))
    - (Load = P64)
    - (IncPtrBytes = 8)
    - (PeekDup = 1)
    - (IncPtrBytes = 8)
    - (PeekDup = 1)
    - (IncPtrBytes = 8)
    - (Load = I64)
    - (StorePre = I64)
    - (Load = I64)
    - (StorePre = I64)
    - (AddrVar = (id = 1))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (PushConstant = (value = 0, ty = I64))
    - (Intrinsic = BitOr)
    - (StorePre = I64)
    - (AddrVar = (id = 1))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (AddrVar = (id = 6))
    - (Load = I64)
    - (PushConstant = (value = 68719476735, ty = I64))
    - (Intrinsic = BitAnd)
    - (PushConstant = (value = 0, ty = I64))
    - (Intrinsic = ShiftLeft)
    - (Intrinsic = BitOr)
    - (StorePre = I64)
    - (AddrVar = (id = 1))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (AddrVar = (id = 6))
    - (IncPtrBytes = 8)
    - (Load = I32)
    - (Intrinsic = ZeroExtend32To64)
    - (PushConstant = (value = 255, ty = I64))
    - (Intrinsic = BitAnd)
    - (PushConstant = (value = 36, ty = I64))
    - (Intrinsic = ShiftLeft)
    - (Intrinsic = BitOr)
    - (StorePre = I64)
    - (LastUse = (id = 6))
    - (Goto = (ip = B4, slots = 0))
    [b 7] ([])
    - (AddrVar = (id = 7))
    - (AddrVar = (id = 3))
    - (Load = P64)
    - (IncPtrBytes = 8)
    - (CopyBytesToFrom = 8)
    - (AddrVar = (id = 1))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (PushConstant = (value = -9223372036854775808, ty = I64))
    - (Intrinsic = BitOr)
    - (StorePre = I64)
    - (AddrVar = (id = 1))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (AddrVar = (id = 7))
    - (Load = I32)
    - (Intrinsic = ZeroExtend32To64)
    - (PushConstant = (value = 16777215, ty = I64))
    - (Intrinsic = BitAnd)
    - (PushConstant = (value = 0, ty = I64))
    - (Intrinsic = ShiftLeft)
    - (Intrinsic = BitOr)
    - (StorePre = I64)
    - (AddrVar = (id = 1))
    - (AddrVar = (id = 1))
    - (Load = I64)
    - (AddrVar = (id = 7))
    - (IncPtrBytes = 4)
    - (Load = I32)
    - (Intrinsic = ZeroExtend32To64)
    - (PushConstant = (value = 255, ty = I64))
    - (Intrinsic = BitAnd)
    - (PushConstant = (value = 24, ty = I64))
    - (Intrinsic = ShiftLeft)
    - (Intrinsic = BitOr)
    - (StorePre = I64)
    - (LastUse = (id = 7))
    - (Goto = (ip = B4, slots = 0))
    [b 9] ([])
    - (CallDirect = (sig = 1, f = F484, tail = false))
    - (Unreachable = ())
    - (Goto = (ip = B4, slots = 0))
That's a pretty big blob of garbage but this is the instruction being converted to llvm-ir when we crash:
    - (PushConstant = (value = -9223372036854775808, ty = I64))

You might be wondering, as i was, where on earth did that number come from and why does it look so familiar? It certainly isn't typed in the code i showed you before. Maybe some other function got inlined? But no that doesn't really make sense, we're just doing a bunch of bitwise math... on constants.

Another fun bit of trivia about this language is I aggressively do constant folding in the front end. Because sort of like Zig, it's a big deal semantically which things are known at comptime. Here's the declaration of the shift_left function:
#intrinsic(.ShiftLeft) #fold
fn shift_left(value: i64, shift_amount: i64) i64
That #fold annotation just means that if all the arguments are known at comptime, then the call is required to be evaluated at comptime and the result substituted in for the expression. So when I wrote 1.shift_left(63), it didn't make its way its way to the backend before being folded away into the value -9223372036854775808.
0000000000000000000000000000000000000000000000000000000000000001
shifted left by 63 bits = 
1000000000000000000000000000000000000000000000000000000000000000
and that happens to be the smallest value representable by a 64 bit two's complement number [0]. Since that's a constant in the program, the compiler tried to write it out as an integer literal in a string of llvm-ir.

If you've encountered Two's Complement before you can probably guess where this is going but I'm still going to work through it.

Let's think about what happens when we try to print that number. First we check if i is less than 0. Clearly this is a negative number so we take that branch and recurse on -i. What happens when we ask the cpu to negate a number? Well trick question you can't (on arm), neg is a psudo instruction that assembles to subtracting the number from zero. Subtraction also isn't really a thing. That's the same as adding the Two's Complement of the thing. So to recap:

(-i) === (0 - i) === (0 + twos_complement_of(i)) === twos_complement_of(i) 
// https://en.wikipedia.org/wiki/Two%27s_complement
start with our number
1000000000000000000000000000000000000000000000000000000000000000
flip all the bits
0111111111111111111111111111111111111111111111111111111111111111
add one
1000000000000000000000000000000000000000000000000000000000000000
hmmm... thats the same number we started with!
There are more representable negative numbers than positive numbers (because there's no negative 0), so when we try to compute the absolute value of this special number we get itself again and recurse forever.

It's almost a shame I buffered the output. If I'd printed one character at a time I'd have a thrilling terminal full of infinite negative signs.

Making Amends

Here's a lazy fix:
if i != 0 && i == -i {  // two's compliment :hack
    s.push_all("-9223372036854775808");
    return();
};
// [...]
Alternatively, we could we could say the magic words and make the problem go away:
If you attempt to print the most negative integer, behaviour is undefined.

Take Aways

I'm not really sure what I can learn from this except man am I glad I don't have to write software that runs in planes. Don't underestimate the infinte well of suffering that is number representations. And just for the record, this was easily in the simpler half of bugs I've encountered on this project. I'm starting to see the argument for just living in fear and never writing any code yourself.

But if you think cryptoGPT, or whatever it's called this month, could find that problem without prompting from someone who already knew the answer, i call bullshit. Just now I tried pasting in the tiny broken function and it did know the problem, but its fix was comparing i to the most negative integer, which wouldn't work... because I couldn't compile programs that used that number! If you tell it you tried its fix and the compiler segfaulted, it suggests there might be a compiler bug... no shit. And anyway the problem statement is starting with 30000 lines of broken compiler. Once you know where you're looping it's less impressive. It also suggested my Zig syntax was wrong. Thanks friend-o.

Footnotes

[0] Much to my confusion it seems that "Two's Complement" is the name for a process you can perform on a number, not a way of representing numbers. But it's awkward to say "[...] by the strategy for representing numbers where negative numbers are stored as the Two's Complement of thier absolute value".