MicroPython Crash Compilation Unusual Code Segfault Analysis

by Sebastian Müller 61 views

Introduction

Hey guys! Today, we're diving into a rather intriguing issue encountered while using MicroPython. Specifically, we're talking about a crash that occurs during compilation when feeding MicroPython some unusual code. This isn't your everyday bug; it involves segfaults, corrupted stack traces, and a fascinating journey through the depths of MicroPython's internals. So, buckle up, and let's get started!

The Bug: A Crash Course

So, the main issue at hand is a segmentation fault that occurs when running MicroPython with a specific snippet of code. Imagine you're just trying to calculate (-1) ** 2.3 and then accidentally type aa (which, of course, isn't defined). Instead of getting a friendly NameError, the whole thing crashes. Not cool, right? This unexpected behavior is definitely something we need to investigate.

The Setup

Before we delve deeper, let's lay the groundwork. This bug was observed on the unix port, coverage, and x86_64 architecture, using MicroPython version v1.26.0-preview.521.g658a2e3dbd, built on August 2, 2025, with GCC 12.2.0. So, if you're trying to reproduce this, make sure you've got a similar setup.

The Code Snippet

The culprit code snippet is deceptively simple:

ans = (-1) ** 2.3; aa

As you can see, it's a straightforward calculation followed by an undefined variable. The expected behavior? A NameError, telling us that aa is not defined. But what we get instead is a segmentation fault, which is a much bigger problem.

Observed Behavior: Segfault City

When running the code, MicroPython crashes with a segmentation fault. The stack trace is, unfortunately, corrupted, making it difficult to pinpoint the exact location of the crash. Tools like ubsan and asan didn't provide much additional info, which is always a bummer. Here’s what the stack trace looks like:

Program received signal SIGSEGV, Segmentation fault.
0x0000555555756530 in mp_state_ctx ()
(gdb) where
#0  0x0000555555756530 in mp_state_ctx ()
#1  0x0000000000000000 in ?? ()

Yeah, not very helpful. It's like trying to navigate in the dark with a broken flashlight.

Valgrind to the Rescue

Thankfully, there's Valgrind, our trusty debugging companion! Valgrind produced some interesting diagnostics, starting with this:

==1669951== Invalid write of size 8
==1669951==    at 0x16C3B4: nlr_jump (nlrx64.c:104)
==1669951==    by 0x1B23DE: fun_bc_call (objfun.c:352)
==1669951==    by 0x19E61E: mp_call_function_n_kw (runtime.c:727)
==1669951==    by 0x1A0DEA: mp_call_function_0 (runtime.c:701)
==1669951==    by 0x264DB8: execute_from_lexer (main.c:162)
==1669951==    by 0x264E67: do_str (main.c:315)
==1669951==    by 0x2658D3: main_ (main.c:656)
==1669951==    by 0x26619F: main (main.c:494)
==1669951==  Address 0x1ffefff888 is on thread 1's stack
==1669951==  232 bytes below stack pointer

This points to an invalid write within nlr_jump, which is a non-local return mechanism in MicroPython. It seems like something is trying to write to memory it shouldn't, specifically on the stack. This is where things start to get interesting!

Diving Deeper: The nlr_jump Mystery

The suspicion is that an nlr jmp_buf (part of the non-local return mechanism) registered inside fold_constants is somehow interfering when the NameError is thrown. To understand this, we need to peek into the execution flow.

Breakpoints and Backtraces

GDB (the GNU Debugger) is our friend here. By setting breakpoints in nlr_push (which saves the state for a non-local return) and examining the call stack, we can trace the execution.

Here's a snippet from the GDB session:

Breakpoint 1, nlr_push (nlr=nlr@entry=0x7fffffffdb10) at ../../py/nlrx64.c:55
55    unsigned int nlr_push(nlr_buf_t *nlr) {
(gdb) where
#0  nlr_push (nlr=nlr@entry=0x7fffffffdb10) at ../../py/nlrx64.c:55
#1  0x00005555556b0ae5 in execute_from_lexer (source_kind=source_kind@entry=1, 
    source=0x7fffffffe1a9, input_kind=input_kind@entry=MP_PARSE_FILE_INPUT, 
    is_repl=is_repl@entry=false) at main.c:123
#2  0x00005555556b0e68 in do_str (str=<optimized out>) at main.c:315
#3  0x00005555556b18d4 in main_ (argc=argc@entry=3, argv=argv@entry=0x7fffffffddd8)
    at main.c:656
#4  0x00005555556b21a0 in main (argc=3, argv=0x7fffffffddd8) at main.c:494

This shows that nlr_push is called from execute_from_lexer, which is part of the main execution loop. Let's continue and see what happens.

Breakpoint 1, nlr_push (nlr=nlr@entry=0x7fffffffd900) at ../../py/nlrx64.c:55
55    unsigned int nlr_push(nlr_buf_t *nlr) {
(gdb) where
#0  nlr_push (nlr=nlr@entry=0x7fffffffd900) at ../../py/nlrx64.c:55
#1  0x00005555555c5ea3 in binary_op_maybe (op=op@entry=MP_BINARY_OP_POWER, 
    lhs=0xffffffffffffffff, rhs=0x7ffff7c491e0, res=res@entry=0x7fffffffd998)
    at ../../py/parse.c:672
#2  0x00005555555c6d42 in fold_constants (parser=parser@entry=0x7fffffffda30, 
    rule_id=rule_id@entry=42 '*', num_args=2) at ../../py/parse.c:780
#3  0x00005555555c6ac2 in push_result_rule (parser=parser@entry=0x7fffffffda30, src_line=1, 
    rule_id=rule_id@entry=42 '*', num_args=<optimized out>) at ../../py/parse.c:1033
#4  0x00005555555c86b7 in mp_parse (lex=lex@entry=0x7ffff7c48bc0, 
    input_kind=input_kind@entry=MP_PARSE_FILE_INPUT) at ../../py/parse.c:1263
#5  0x00005555556b0b5c in execute_from_lexer (source_kind=source_kind@entry=1, 
    source=<optimized out>, input_kind=input_kind@entry=MP_PARSE_FILE_INPUT, 
    is_repl=is_repl@entry=false) at main.c:147
#6  0x00005555556b0e68 in do_str (str=<optimized out>) at main.c:315
#7  0x00005555556b18d4 in main_ (argc=argc@entry=3, argv=argv@entry=0x7fffffffddd8)
    at main.c:656
#8  0x00005555556b21a0 in main (argc=3, argv=0x7fffffffddd8) at main.c:494

Ah, here we see fold_constants in the stack trace! This is where constant expressions are evaluated at compile time. It seems like we're on the right track.

Let's keep going:

Breakpoint 1, nlr_push (nlr=nlr@entry=0x7fffffffd990) at ../../py/nlrx64.c:55
55    unsigned int nlr_push(nlr_buf_t *nlr) {
(gdb) where
#0  nlr_push (nlr=nlr@entry=0x7fffffffd990) at ../../py/nlrx64.c:55
#1  0x00005555556218fa in mp_execute_bytecode (code_state=code_state@entry=0x7fffffffda20, 
    inject_exc=<optimized out>, inject_exc@entry=0x0) at ../../py/vm.c:301
#2  0x00005555555fe288 in fun_bc_call (self_in=0x7ffff7c48be0, n_args=0, n_kw=0, args=0x0)
    at ../../py/objfun.c:295
#3  0x00005555555ea61f in mp_call_function_n_kw (fun_in=0x7ffff7c48be0, 
    n_args=n_args@entry=0, n_kw=n_kw@entry=0, args=args@entry=0x0) at ../../py/runtime.c:727
#4  0x00005555555ecdeb in mp_call_function_0 (fun=<optimized out>) at ../../py/runtime.c:701
#5  0x00005555556b0db9 in execute_from_lexer (source_kind=source_kind@entry=1, 
    source=<optimized out>, input_kind=input_kind@entry=MP_PARSE_FILE_INPUT, 
    is_repl=is_repl@entry=false) at main.c:162
#6  0x00005555556b0e68 in do_str (str=<optimized out>) at main.c:315
#7  0x00005555556b18d4 in main_ (argc=argc@entry=3, argv=argv@entry=0x7fffffffddd8)
    at main.c:656
#8  0x00005555556b21a0 in main (argc=3, argv=0x7fffffffddd8) at main.c:494

Now we see mp_execute_bytecode, which is where the compiled code is actually run. It seems like the non-local return mechanism is being used during bytecode execution.

The Jump

Let's examine the nlr_jump calls:

Breakpoint 2, nlr_jump (val=0x7ffff7c48ba0) at ../../py/nlrx64.c:103
103   MP_NORETURN void nlr_jump(void *val) {
(gdb) p mp_thread_get_state ()->nlr_top
$3 = (nlr_buf_t *) 0x7fffffffd990

Here, nlr_jump is called with a specific value. The nlr_top pointer indicates the current non-local return buffer on the stack.

Breakpoint 2, nlr_jump (val=val@entry=0x7ffff7c48ba0) at ../../py/nlrx64.c:103
103   MP_NORETURN void nlr_jump(void *val) {
(gdb) p mp_thread_get_state ()->nlr_top
$4 = (nlr_buf_t *) 0x7fffffffd900
(gdb) where
#0  nlr_jump (val=val@entry=0x7ffff7c48ba0) at ../../py/nlrx64.c:103
#1  0x00005555555fe3df in fun_bc_call (self_in=<optimized out>, n_args=0, n_kw=0, args=0x0)
    at ../../py/objfun.c:352
#2  0x00005555555ea61f in mp_call_function_n_kw (fun_in=0x7ffff7c48be0, 
    n_args=n_args@entry=0, n_kw=n_kw@entry=0, args=args@entry=0x0) at ../../py/runtime.c:727
#3  0x00005555555ecdeb in mp_call_function_0 (fun=<optimized out>) at ../../py/runtime.c:701
#4  0x00005555556b0db9 in execute_from_lexer (source_kind=source_kind@entry=1, 
    source=<optimized out>, input_kind=input_kind@entry=MP_PARSE_FILE_INPUT, 
    is_repl=is_repl@entry=false) at main.c:162
#5  0x00005555556b0e68 in do_str (str=<optimized out>) at main.c:315
#6  0x00005555556b18d4 in main_ (argc=argc@entry=3, argv=argv@entry=0x7fffffffddd8)
    at main.c:656
#7  0x00005555556b21a0 in main (argc=3, argv=0x7fffffffddd8) at main.c:494

Here's the crucial observation: the last nlr_buf_t in nlr_jmp is equal to the one inside the stack, including binary_op_maybe called from fold_constants, even though those functions are no longer on the stack. This suggests that the non-local return is jumping back to a context that no longer exists, leading to the crash.

The Hypothesis: A Mismatched Jump

So, the hypothesis is that the nlr_jump is attempting to return to a nlr_buf_t that was set during constant folding but is no longer valid when the NameError is raised. This mismatch causes the jump to land in an unexpected location, resulting in the segmentation fault.

Root Cause

After analyzing the crash, the root cause appears to be an issue with how MicroPython's non-local return mechanism interacts with constant folding and error handling. Specifically, the nlr_jmp_buf registered inside fold_constants is being triggered later when a NameError is thrown, even though the context in which it was registered is no longer valid.

This leads to a jump to an invalid stack location, causing the segmentation fault.

The Fix

To fix this, the non-local return mechanism needs to be more carefully managed during constant folding and error handling. One potential solution is to ensure that any nlr_buf_t set during constant folding is cleared or invalidated before error handling takes place. This would prevent the jump from landing in an invalid context.

Conclusion

This bug highlights the complexities of exception handling and non-local control flow in a language like MicroPython. The interaction between constant folding and error handling, in this case, led to a rather nasty crash. By carefully tracing the execution and examining the stack, we were able to pinpoint the root cause and propose a solution.

It's fascinating how seemingly simple code snippets can expose deep-seated issues in a system. Bugs like these are a reminder of the importance of thorough testing and debugging, especially in embedded systems where stability is paramount. Remember guys, keep your code clean, and happy debugging!

This crash was found with AFLplusplus and minimized manually, which goes to show the value of fuzzing in uncovering such issues.

Code of Conduct

Yes, I agree with the code of conduct.