The Process of Finding a Bug
I spent a lot of time tracking down and fixing this bug, and thought it would make a good blog as to how I went about solving it, and the mistakes and insights I gathered along the way.
What Started It All
Yesterday I was working on my programming language, Skull. In hopes of speeding things
up, I decided to start reducing my dependence on a C compiler, since having to invoke cc
for the final linking of our
means we have to call a whole other program, which is quite expensive. So how did I go about doing that?
Typically, C programs use main
as their entrypoint, but in reality, _start
is the true entrypoint, created for you by
the compiler. It does a bunch of nice things for you, like setting up the parameters for your main
function, and so on.
So why would you want to change that? Well, suppose you have a programming language, and to allow for people to create their own
functions called main
. By creating your own _start
method, you don't have to worry about conflicts with users
who want to name their functions being called main
, since we aren't going to use it.
We could just change the entrypoint of our program, but that would only move the problem elsewhere. Since we want to eventually not even use a C compiler at all, doing this to solve our linker issue is just a band-aid fix.
So I wrote commit 5b022ce0, and pushed it up to GitHub. I got an email saying that my CI job failed, and so I went to look:
<SNIP> Step 26/26 : RUN cd test/docs/c-integration && make ---> Running in 8c94c05bb0e6 skull hello.sk -- hello.c -no-pie ./hello make: *** [Makefile:5: all] Segmentation fault (core dumped) The command '/bin/sh -c cd test/docs/c-integration && make' returned a non-zero code: 2
Oh No! A Segfault!
I assumed that I just hadn't properly tested it before I pushed it up, so I re-ran the pre-commit
file on my local machine:
$ ./pre-commit
<SNIP>
../../../build/skull/skull hello.sk -- hello.c -no-pie
./hello
hello, world!
../../../build/skull/skull hello2.sk -c
cc .hello2.sk.o hello2.c -no-pie -o hello2
./hello2
hello, world 2!
<SNIP>
It passed! The next step is to build the same Dockerfile
that the GitHub Actions workflow actually uses to do all of the
CI testing:
$ docker build -f .github/workflows/Dockerfile .
<SNIP>
Step 25/26 : RUN cc test/libskull.c -o skull-shim && ./skull-shim
---> Running in 78b487694f22
Skull v0.7.0-63-g5b022ce0
Removing intermediate container 78b487694f22
---> 6e444c728e70
Step 26/26 : RUN cd test/docs/c-integration && make
---> Running in 175ca8540f0d
skull hello.sk -- hello.c -no-pie
./hello
make: *** [Makefile:5: all] Segmentation fault (core dumped)
The command '/bin/sh -c cd test/docs/c-integration && make' returned a non-zero code: 2
Investigate Further
So now that I am able to reproduce it on my machine, I need to figure out what is causing the segfault. The image
that was built before the failing step has an image ID of 6e444c728e70
, so if I spin up a container using that
image, I can drop into a shell and poke around:
$ sudo docker run --rm -it 6e444c728e70 bash
I have no name!@928fdd7f9083:/app$
--rm
will remove the container when we are done, since we don't want to use it again after we are done.-it
can be remembered as "interactive terminal". Basically, it allows us to talk to the container over our terminal.
For sanity, make sure the build still fails:
I have no name!@928fdd7f9083:/app$ cd test/docs/c-integration/ && make skull hello.sk -- hello.c -no-pie ./hello make: *** [Makefile:5: all] Segmentation fault (core dumped)
Now let's start poking around.
For clarity (and to make things look nicer), I am going to change the
bash prompt
so that it is easier to tell
which commands are being ran in the container (container $
), and which ones are running on my host machine
(just a $
):
I have no name!@928fdd7f9083:/app$ echo $PS1
${debian_chroot:+($debian_chroot)}\\u@\\h:\\w\$
I have no name!@928fdd7f9083:/app$ PS1="container \$ "
container $
Much better!
First things first we check that the hashes of the compiled hello
program in the Ubuntu docker
container match the ones on my Arch system:
Note: You shouldn't expect compiled binaries to be the same across different machines, let alone different operating systems. Since this is a pretty trivial program, I had a suspicion that the files might be the same, or at the very least, very similar.
$ shasum ./test/docs/c-integration/hello
7e70e18a4136106a124b5dab27deff8e2b92122f ./test/docs/c-integration/hello
container $ shasum test/docs/c-integration/hello
9cc04720933cba9328fd353f97b35dc8494e8dbd test/docs/c-integration/hello
They don't match, which is somewhat expected. Let's see why they are so different by dumping the assembly
via objdump
.
On the host machine:
$ objdump -c ./test/docs/c-integration/hello
<SNIP>
0000000000401050 <hello>:
401050: 55 push %rbp
401051: 48 89 e5 mov %rsp,%rbp
401054: 48 83 ec 10 sub $0x10,%rsp
401058: 48 89 7d f8 mov %rdi,-0x8(%rbp)
40105c: 48 8b 45 f8 mov -0x8(%rbp),%rax
401060: 48 89 c6 mov %rax,%rsi
401063: 48 8d 05 9c 0f 00 00 lea 0xf9c(%rip),%rax # 402006 <_int_pow+0xf1e>
40106a: 48 89 c7 mov %rax,%rdi
40106d: b8 00 00 00 00 mov $0x0,%eax
401072: e8 99 ff ff ff call 401010 <printf@plt>
401077: 90 nop
401078: c9 leave
401079: c3 ret
<SNIP>
In the docker container:
container $ objdump -c ./test/docs/c-integration/hello <SNIP> 0000000000401050 <hello>: 401050: f3 0f 1e fa endbr64 401054: 55 push %rbp 401055: 48 89 e5 mov %rsp,%rbp 401058: 48 83 ec 10 sub $0x10,%rsp 40105c: 48 89 7d f8 mov %rdi,-0x8(%rbp) 401060: 48 8b 45 f8 mov -0x8(%rbp),%rax 401064: 48 89 c6 mov %rax,%rsi 401067: 48 8d 05 98 0f 00 00 lea 0xf98(%rip),%rax # 402006 <_int_pow+0xf16> 40106e: 48 89 c7 mov %rax,%rdi 401071: b8 00 00 00 00 mov $0x0,%eax 401076: e8 95 ff ff ff call 401010 <printf@plt> 40107b: 90 nop 40107c: c9 leave 40107d: c3 ret <SNIP>
Weird. The only difference between the two outputs is that there is this endbr64
instruction that is
sprinkled in to the Ubuntu output. Let's look up what that means:
It stands for "End Branch 64 bit" -- or more precisely, Terminate Indirect Branch in 64 bit.
Doesn't sound harmful. Other then that instruction, these binaries look almost identical. What happens if we just copy over and run the working binary from the Arch system?
First we get the container id of our container (in another terminal):
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
928fdd7f9083 6e444c728e70 "bash" 32 minutes ago Up 32 minutes naughty_cannon
Then we actually copy it:
$ sudo docker cp test/docs/c-integration/hello 928fdd7f9083:/tmp/hello
Then we check the hash (just to double check), and run it:
container $ shasum /tmp/hello 7e70e18a4136106a124b5dab27deff8e2b92122f /tmp/hello container $ /tmp/hello Segmentation fault (core dumped)
Ok, so the binary is not the cause of the issue. It must be either Ubuntu, or the environment we created after installing all of our dependencies, or both!
MVP Time
Now that we have a good idea of what the root cause is, let's create the smallest possible environment which will recreate the issue.
We will create a new folder somewhere, add a Dockerfile, and a simple C program:
FROM ubuntu:22.04
RUN apt update && \
apt upgrade && \
apt install gcc-11 -y
WORKDIR /app
COPY file.c .
RUN gcc-11 -nostartfiles file.c
RUN ./a.out
Note that we are using Ubuntu Jammy (22.04). This will be important for later.
Add the following in our file.c
file:
#include <stdio.h>
#include <stdlib.h>
void _start(void) {
puts("hello world");
exit(0);
}
For good measure, you should build this on your host machine first, and make sure that it runs as expected. You don't want to spend a bunch of time fixing the wrong error!
Now build the container, and see what happens!
$ sudo docker build .
Step 1/6 : FROM ubuntu:22.04
---> 3f4714ee068a
Step 2/6 : RUN apt update && apt upgrade && apt install gcc-11 -y
---> Running in af582fe738f5
<SNIP apt stuff>
Removing intermediate container af582fe738f5
---> 58eee8b4c01a
Step 3/6 : WORKDIR /app
---> Running in 777fc78a948d
Removing intermediate container 777fc78a948d
---> 10defe2ee273
Step 4/6 : COPY file.c .
---> 101465eeb7a0
Step 5/6 : RUN gcc-11 -nostartfiles file.c
---> Running in 9fca80a45234
Removing intermediate container 9fca80a45234
---> caa232378a47
Step 6/6 : RUN ./a.out
---> Running in eb7fd3b7f0ec
Segmentation fault (core dumped)
The command '/bin/sh -c ./a.out' returned a non-zero code: 139
Now that we have a smaller area to search, we can start digging deeper.
Debugging Time
Perhaps we can install gdb
, and see if that can give us a better indication of what
is going on:
RUN gcc-11 -nostartfiles file.c
+
+RUN apt install gdb -y
+RUN gdb ./a.out -q -ex r
RUN ./a.out
When running gdb
from the command line, we need to use the -ex
flag to pass the commands we want to run,
since we won't be able to type them during the build. -q
means quiet, which hides the banner when first starting.
Let's rebuild and see what happens:
Step 7/8 : RUN gdb ./a.out -q -ex r ---> Running in 31b33923158c Reading symbols from ./a.out... (No debugging symbols found in ./a.out) Starting program: /app/a.out warning: Error disabling address space randomization: Operation not permitted [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 4375 ./malloc/malloc.c: No such file or directory. Program received signal SIGSEGV, Segmentation fault. 0x00007fa75da97540 in _int_malloc (av=av@entry=0x7fa75dc0cc80 <main_arena>, bytes=bytes@entry=640) at ./malloc/malloc.c:4375 (gdb) quit A debugging session is active. Inferior 1 [process 16] will be killed. Quit anyway? (y or n) [answered Y; input not from terminal]
Huh, we didn't call _int_malloc
. What is going on here? Let's add -ex bt
to our gdb
command to print the backtrace,
and re-run:
Step 7/8 : RUN gdb ./a.out -q -ex r -ex bt ---> Running in 57cc66e861e1 Reading symbols from ./a.out... (No debugging symbols found in ./a.out) Starting program: /app/a.out warning: Error disabling address space randomization: Operation not permitted [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 4375 ./malloc/malloc.c: No such file or directory. Program received signal SIGSEGV, Segmentation fault. 0x00007f428808f540 in _int_malloc (av=av@entry=0x7f4288204c80 <main_arena>, bytes=bytes@entry=640) at ./malloc/malloc.c:4375 #0 0x00007f428808f540 in _int_malloc ( av=av@entry=0x7f4288204c80 <main_arena>, bytes=bytes@entry=640) at ./malloc/malloc.c:4375 #1 0x00007f428808fa49 in tcache_init () at ./malloc/malloc.c:3245 #2 0x00007f428809025e in tcache_init () at ./malloc/malloc.c:3241 #3 __GI___libc_malloc (bytes=bytes@entry=4096) at ./malloc/malloc.c:3306 #4 0x00007f4288069c24 in __GI__IO_file_doallocate ( fp=0x7f4288205780 <_IO_2_1_stdout_>) at ./libio/filedoalloc.c:101 #5 0x00007f4288078d60 in __GI__IO_doallocbuf ( fp=fp@entry=0x7f4288205780 <_IO_2_1_stdout_>) at ./libio/libioP.h:947 #6 0x00007f4288077fe0 in _IO_new_file_overflow ( f=0x7f4288205780 <_IO_2_1_stdout_>, ch=-1) at ./libio/fileops.c:744 #7 0x00007f4288076755 in _IO_new_file_xsputn (n=11, data=<optimized out>, f=<optimized out>) at ./libio/libioP.h:947 #8 _IO_new_file_xsputn (f=0x7f4288205780 <_IO_2_1_stdout_>, data=<optimized out>, n=11) at ./libio/fileops.c:1196 #9 0x00007f428806bf9c in __GI__IO_puts (str=0x5604a28d8000 "hello world") at ./libio/libioP.h:947 #10 0x00005604a28d7067 in _start () (gdb) quit A debugging session is active. Inferior 1 [process 17] will be killed. Quit anyway? (y or n) [answered Y; input not from terminal]
Huh! It seems to be hitting our puts("hello world")
call, but after that, it just goes into neverland.
Let's install Valgrind
and see what it has to say about our program. Valgrind
is typically used for finding memory leaks, illegal/undefined behavior, but it can also do a lot more.
Adding it is super easy:
RUN gcc-11 -nostartfiles file.c
-RUN apt install gdb -y
-RUN gdb ./a.out -q -ex r -ex bt
-RUN ./a.out
+RUN apt install valgrind -y
+RUN valgrind ./a.out
Re-running:
Step 7/7 : RUN valgrind ./a.out ---> Running in 703095486b3a ==7== Memcheck, a memory error detector ==7== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==7== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==7== Command: ./a.out ==7== hello world ==7== ==7== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==7== General Protection Fault ==7== at 0x48F7F42: __pthread_once_slow (pthread_once.c:114) ==7== by 0x49C6A52: __rpc_thread_variables (rpc_thread.c:59) ==7== by 0x4A19D8C: free_mem (in /usr/lib/x86_64-linux-gnu/libc.so.6) ==7== by 0x4A198C1: __libc_freeres (in /usr/lib/x86_64-linux-gnu/libc.so.6) ==7== by 0x483F1B2: _vgnU_freeres (in /usr/libexec/valgrind/vgpreload_core-amd64-linux.so) ==7== by 0x48A3551: __run_exit_handlers (exit.c:136) ==7== by 0x48A360F: exit (exit.c:143) ==7== by 0x109070: (below main) (in /app/a.out) ==7== ==7== HEAP SUMMARY: ==7== in use at exit: 0 bytes in 0 blocks ==7== total heap usage: 1 allocs, 1 frees, 4,096 bytes allocated ==7== ==7== All heap blocks were freed -- no leaks are possible ==7== ==7== For lists of detected and suppressed errors, rerun with: -s ==7== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
This is even weirder! We seem to be getting our hello world
printed, but then the program dies somewhere in
exit
, which is different then what we where getting with gdb. Valgrind will swap out malloc()
calls (among others)
with it's own version to be able to do it's memory analysis, which might explain this difference in behavior.
The==7==
output in Valgrind is to differentiate between our program's output, and Valgrind's output. The7
means that ourhello
program is running under PID (process ID) 7.
Ok, this is all very weird. Our code is segfaulting in parts which are not in our control. It works just fine on my Arch machine, but not in an Ubuntu 22.04 container. What if we downgrade to an older Ubuntu docker image, say 21.04 or 20.04?
FROM ubuntu:21.04
RUN apt update && \
apt upgrade && \
apt install gcc-11 -y
WORKDIR /app
COPY file.c .
RUN gcc-11 -nostartfiles file.c
RUN ./a.out
Running:
Step 6/6 : RUN ./a.out ---> Running in 9931a6a2e7ee hello world
Success! Well, not really. We still don't know what is causing this. And, we cannot downgrade too far, because we need
LLVM 13 in order to build Skull. The llvm-13
package is only available in Ubuntu Impish (21.10) and up. Still, having code
that fails on the newest LTS version of Ubuntu is not good, and we should fix that problem instead.
The Solution
Although anti-climatic, I found the solution in a Stack Overflow comment after many insanity-questioning hours:
Note that this_start
is unsafe, violating the ABI when it callsmy_main
; you tell the compiler it's a normal function, but actually it's entered with the stack-pointer already aligned (e.g. on x86-64, RSP % 16 == 0), not RSP % 16 == 8 like on entry to a normal function after acall
that pushes an 8-byte return address. You can fix that with__attribute__((force_align_arg_pointer))
for_start
to tell GCC that the stack pointer may be "misaligned" on entry to that one "function", as shown in Get arg values with inline asm without Glibc?
So, we set our Dockerfile back to version 22.04, and update our C file to the following:
-void _start(void) {
+void __attribute__((force_align_arg_pointer)) _start(void) {
And re-run:
Step 6/6 : RUN ./a.out ---> Running in d5407535deb7 hello world
Actual success! Now we just need to figure out what LLVM we need to change/add to make our compiler spit out proper code.
Up until now we have been using gcc for our compiler, but for any LLVM related shenanigans, we should use Clang. Since we have figured out what the issue is for the most part, we don't need to use docker anymore, and can just run this on our local machine.
To figure out what we need to change, we will compile both the original (non-working) version and the new (working)
version of file.c
, and compare the LLVM IR between the 2:
$ clang file-old.c -S -emit-llvm $ clang file.c -S -emit-llvm $ diff file-old.ll file.ll 20c20 < attributes #0 = { noinline nounwind optnone sspstrong uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" } --- > attributes #0 = { noinline nounwind optnone sspstrong uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "stackrealign" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
Seems like the only thing that changed is the addition of the "stackrealign"
attribute on the new one, so now we need to
just figure out how to add that attribute to our _start
function and we should be set.
By the time I had figured out what the issue was, it was midnight, and I needed some sleep. After some well deserved rest, I spent another 2-3 hours searching online, trying to find out how to add attributes to functions using the LLVM C API. Eventually I found it, and here is the one line change in all of it's glory:
--- a/skull/codegen/llvm/write.c
+++ b/skull/codegen/llvm/write.c
@@ -348,6 +348,7 @@ static void add_start_shim(SkullStateLLVM *state) {
);
LLVMSetLinkage(start_func, LLVMExternalLinkage);
+ LLVMAddTargetDependentFunctionAttr(start_func, "stackrealign", "");
LLVMTypeRef exit_func_type = type_to_llvm_func_type(
&TYPE_VOID,
Fin
That's it! A total of about 6 hours spent investigating a single segfault, which manifested itself on a single operating system (that I could tell), and was solved with a single line of code. It goes without saying, but without a proper test suite, I might not have found out about this bug for a very very long time!
There where a lot of dead ends, and major derails which I will not delve into, this blog is already long enough! I hope this can teach someone about the debugging process, what things to look for, and how to stay sane when debugging things which you have no idea how to.