The Hunt for Error -22

66 points by mpalme 6 months ago

I can see how the author came to the conclusion that libmodem written in rust would have prevented the issue, but isn't it simply pushing the problem further down the stack?

The author needed to use unsafe in order to pass his pointer to libmodem, but libmodem is going to require a pointer with static lifetime itself. Which would have prevented the issue in the first place had the author done this.

I can see why you wouldn't want to use static, it hinders testability, but that means you need to ensure that the pointer you supply libmodem outlives libmodem. I would use RAII to do that in C++ and I am sure in rust you could/would do the same.

I guess I am asking, is there anything here that a libmodem written in rust would have magically solved? It feels like wishful thinking, but I am open to learn where I am mistaken.

In any case, kudos for finding this bug. Having worked with Zephyr/NRF connect SDK and this exact chip myself I can definitely relate to the pain they (can) bring.

swatcoder 6 months ago

The existing C interface doesn't have means to describe the lifetime of the data being passed in. It just takes a pointer. An experienced C programmer would often understand what's happening by convention and not encounter the problem.
But the custom Rust wrapper was composed as a game of telephone (ugh), with the author blindly mimicking "Jonathan" who seemed to have been blindly mimicking a sloppy (and later repaired) example from Nordic.
The argument is that if the library and its internals were originally written in Rust, which has richer semantics for object lifetimes, Rust would have been able to formally convey that the input data needed to outlive the individual function call, throwing an error at compile time.
The wrapper could have enforced this constraint itself, as it probably does now, but the handoff between Rust and C needs somebody to account for and understand the by convention stuff in C so that it can be expressed formally in Rust, and that human process failed to happen here.
- rcxdude 5 months ago
  
  Part of the issue is that there's not really a convention in C. If it's not documented, you should probably read the source code to find out. (C programmers often think there's a convention, but that's because there's one option that's obvious to them but then other programmers will have a different 'obvious' option, which is why this is so often not documented at all)
- wrs 6 months ago
  
  If I read the article correctly, Nordic changed the rules on this function without saying anything. It used to work with a stack-allocated config and now it doesn’t. The only way for the caller to know about that in C is a comment.
  - swatcoder 6 months ago
    
    That's possible, but unlikely given conventions in embedded development and how something like this interface would generally need to work.
    More likely (but not necessarily), Nordic's early example was either bugged or conditionally valid (benefiting from other implicit details of their implementation) and then was revised either because the mistake was identified or something else about the example change.
    That's all pretty common in this domain. Inadvertently stumbling because you uncritically followed some vendor example is also pretty common and completely understandable. Better tools, like using a language with richer semantics, are indeed something that can help with that.
    
    wrs 6 months ago
    
    I dunno, in my experience if you see something worded the way this was:
    int something_init(const something_init_params* params);
    the convention is that the params are temporary -- really just a way of passing a bunch of parameters to the init function. It would be a surprise that the params are expected to be static. E.g., the whole STM32 HAL is done this way, and it would be a disaster if you thought the init structs all had to be static!
    BTW, you can see the assumptions of the non-embedded programmers talking about "taking ownership" being the default interpretation of a signature like that...if you don't have a heap, what does that even mean?
    In any case, C is a mess, embedded is a mess, no argument there!
    
    cozzyd 5 months ago
    
    Making a copy is much less common in embedded if it's a large data structure.
    
    exmadscientist 6 months ago
    
    Not to mention that one of the things experienced people learn is that vendor code is hot flaming garbage and must never be trusted. Writing a Rust implementation based on vendor code is like building a skyscraper on a landfill. Don't do that. If you have to do that, tread bloody carefully!
    I am more on the hardware side these days, but Nordic's hardware docs are pretty crap. As in, they're pretty, and they're crap. (The prettiness lulls people, especially managers, into a false sense of confidence. Don't fall for the trap!) There are obvious poor choices in there, and if you call FAEs out on them, they say to just follow the docs. Experienced engineers should not follow the docs.
    I can't see their software side being any better.
orf 6 months ago

> I guess I am asking, is there anything here that a libmodem written in rust would have magically solved?
I'm not following your comment, but I think the point is simply "the lifetime of the config is in the function signature, rather than hopefully (sometimes) being in the documentation, and hopefully (sometimes) correct".
HALtheWise 5 months ago

It sounds like one function in libmodem accepts a pointer to a configuration struct, then stores that pointer (or an interior pointer from within it), which is then later used by another libmodem function later. If all of libmodem were written in Rust, this could be done without any use of unsafe, but it would require the lifetime on the original "reference" to provably outlive the second function getting called, probably by being static.
consp 6 months ago

The author mentioned in the first chapter that everything works fine in rust, since it solves all problems. So I guess they throw "better in rust" against every problem.
The assumption nobody ever makes mistakes is mistake one.

Neywiny 5 months ago

I'll admit like others I'm more C biased. This whole debugging story is not very flattering. Tbh my team would not be happy if I told them I didn't 3-4 weeks on something because I used an unsupported language, library, debugger, changed compilers often (did I see him say nightly builds?? On production??), etc etc.

I'm still not sure I understand why he couldn't just diff between versions. And the black box thing seems like a fool's errand. If changing the order of random things makes the issue go away, you can't change anything. The only thing you can do is use the binary you already have. Especially because even if you have 2 not working versions, fixing one doesn't necessarily fix the other. This debug effort felt very sloppy.

It's also weird looking at a lot of this code. The first assembly function pushes 4 values on the stack and only needed to push 2. I've had my fair share of bugs that make me go to dissembly but that also felt very time wastey here. The author evidently did not have enough of a grasp of what to expect for it to help at all.

While true that nRF should've put something in a log, the author admits they don't support this development flow. It's like the old addage about APIs. Any change no matter how small will break someone's usage.

wrs 6 months ago

Two general debugging lessons to note here: (1) Data watchpoints are a lifesaver that should be in your toolbox. (2) When something funky is happening to a data structure in a C program, look at its address to see if it’s in the stack or the heap, in case the answer is a surprise.

fch42 6 months ago

Caution: I'm a C programmer "by history" and therefore I see "C Interfaces" as that and don't always / immediately view them through the rust lens of ownership...

Reading the article (nice troubleshooting story!), my summary, as a C programmer, is that the "C Interface" here "takes ownership". Given C cannot express this properly, a pointer is passed - and the called function "simply" makes the assumption that from hence-on, what was given to it will remain.

As "semantics" this (the need to pass an "owned" piece of data to a function) isn't unusual irrespective of the programming language. Just in case of Rust, this is explicit in the interface (if the func takes a non-ref arg, or a shared smart ref of sorts), while in C ... this can lead to errors of the observed kind. I haven't looked whether any of the sources or docs of libmodem say "this pointer must be either global/static or malloc'ed (and the caller shall not free it)".

A rust wrapper for this could / should possibly "leak a reference" here; Something that prevents the initialisation object from being dropped. yea, accepted, needs "nasty hacks" whether static lifetime, Pin, manual drops, explicit Arc leaks, ... possible though.

It'd be nice if libmodem were stricter about such ownership, agreed, and then a rust wrapper could take advantage. Takes time to evolve; is there a bug report / enhancement request out there for this in libmodem ?

steveklabnik 6 months ago

> Takes time to evolve; is there a bug report / enhancement request out there for this in libmodem ?
The end of the post says
> This would have been so simple to put in the docs. I've opened a ticket on their DevZone forum. As of writing they've still not updated the docs of the init function.
And they've replied
> Thank you for reporting this, it will be fixed in the next `libmodem` release by the end of the month.
kevin_thibedeau 6 months ago

C would benefit from a new keyword that allows a function to require a pointer argument be a non-stack object when passed from a function that returns. That would cover the most common cases where you accidentally pass in a temporary object that must persist.
- bennettnate5 5 months ago
  
  The problem with this is that a pointer value may be passed through multiple functions or nontrivial control flow before reaching a function with this keyword. To effectively check if the pointer is in the stack or .rodata, it would need to do some kind of reverse data dependency analysis of all the possible places that pointer could come from to ensure none of the control flows could cause that pointer to be changed to a stack pointer.
  - fch42 5 months ago
    
    Also ... the lifetime issue here doesn't preclude the use of stack-based variables. A "local" variable of main() is "as good as" a global/static to pretty much all parts of the program.
- cozzyd 5 months ago
  
  It seems like it wouldn't be hard to support a vendor-specific function parameters attribute like __attribute__((ptr_in_section(".data",".rodata"))__ or __attribute__((ptr_not_in_section(".stack"))
  - fch42 5 months ago
    
    the stack (or stacks, for multithreaded processes) is not a "section" as far as executable file formats are concerned. It is amazingly hard at compile time to "know" that a specific pointer is on any stack.

cozzyd 5 months ago

Even if libmodem were written in rust, since they ship it as a blob, it would have to use the C ABI and you could easily have the same problem.

sagacity 6 months ago

This reminds of a similar issue I had recently when working with Rust and dx12 using the excellent windows-rs crates. Dx12 is designed around the concept of command buffers that only get submitted and evaluated at (usually) the end of the frame. This means that you need to take care to keep references around to the data you're submitting because otherwise dx12 will be looking at broken pointers once the command buffers actually get executed.

The tooling around this is much easier to deal with it, of course, since it's all just Windows and there's a bunch of sane debug layers that you can use.

Massive respect to be able to debug the issue on an embedded system!

diffuse_l 6 months ago

I'm not sure about the exact hardware setup used here, but is it possible that something like valgrind would have helped here?

I guess not, as the seems to run on the microcontroller, but I remember getting at least some warning from valgrind in similar situations

antithesis-nl 6 months ago

Yeah, overly complicated firmware like this just isn't great. It's always hard to debug and fix these issues, and there is no "if everything were just written in X" miracle-world where that goes away.

For example, I've encountered hardware that would occasionally write unexpected-error details to a memory location that was completely undocumented. And if you expect more than a shrug from a vendor after pointing out such things, well...

heywire 6 months ago

I love reading debugging stories like these. Thanks for sharing!