Why is "cargo clean" required to get my inline asm to work?

Just for fun I had the urge to learn me some assembler on my M1 MacBook, as inline asm in Rust of course, when I discovered this curious phenomena:

Step 1: I write some inline asm that has a simple one character mistake:

#![feature(naked_functions)]
#![no_main]

use core::arch::naked_asm;

#[naked]
extern "C" fn print_hex_64(n: u64) {
    unsafe {
        naked_asm!(
            // Make space for output string on stack. Stack pointer must have 16-byte alignment.
            // See: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop
            "sub   sp, sp, #(16) ",
            // Number to print passed as parameter in X0
            //
            // Set X1 to the end of the destination on stack
            "mov    x1, sp",
            "add    x1, x1, #15",
            // The loop is FOR x5 = 16 TO 1 STEP -1
            "mov	x5, #16",          // 16 digits to print
            "42:	and	x6, x0, #0xf", // mask of least sig digit
            // If x6 >= 10 then goto letter
            "CMP	x6, #10", // is 0-9 or A-F
            "b.ge	1f",
            // Else its a number so convert to an ASCII digit
            "add	x6, x6, #'0'",
            "b      2f", // goto to end if
            "1:",        // handle the digits A to F
            "add	x6, x6, #('A'-10)",
            "2:",             // end if
            "strb	x6, [x1]",  // store ascii digit
            "sub	x1, x1, #1", // decrement address for next digit
            "lsr	x0, x0, #4", // shift off the digit we just processed
            // next x5
            "subs	x5, x5, #1", // step x5 by -1
            "b.ne	42b",        // another for loop if not done
            //
            // Setup the parameters to print our hex number
            // and then call Linux to do it.
            "mov	x0, #1",  // 1 = StdOut
            "mov	x1, sp",  // Start of string
            "mov	x2, #16", // length of our string
            "mov	x16, #4", // linux write system call
            "svc	#0x80",   // Call linux to output the string
            //
            // Restor stack and return
            "add   sp, sp, #(16)",
            "ret",
        );
    }
}

#[unsafe(no_mangle)]
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
    let n = 0x0123456789abcdef;
    print_hex_64(n);
    0
}

This fails to build with:

cargo run
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
error: invalid operand for instruction
   |
note: instantiated into assembly here
  --> <inline asm>:16:6
   |
16 | strb    x6, [X1]

Step 2: I edit my code changing "strb x6, [x1]" to "strb w6, [x1]".
This compiles and runs but produces garbage output:

$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/rust_asm`
`{�k% 

Step 3: I run cargo clean and cargo run:

$ cargo clean
$ cargo run 
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/rust_asm`
0123456789ABCDE

Which is the output I want.

This had me scratching my head for a long while. What is going on?

While we are here, anyone know how to build this as no_std. As I'm using sys calls to get everything done I wondered how we could jettison all the unused library junk for a tiny executable.

3 Likes

Not an explanation for the issue you have, but why are you using direct syscalls when running on macOS? By chance macOS currently uses the same syscall number as Linux for write, but macOS has no syscall abi stability at all, so this can break at any time. macOS requires you to call through libc.

1 Like

It would break especially reliably if you would forget to pass syscall number in x8, like XNU needs. Which probably explains the whole mess: one time x8 was set to write, the other time it was set to some other syscall number.

When you have UB in your program, like this, it's hard to predict what would it do.

No reason really. As I said in my OP "Just for fun...".

Then it occurred to me it would fun to get rid of all the libs and achieve tiny executables. A little challenge if you like.

Then I found out the Mac sys calls are so wobbly, in fact I have yet to find them documented anywhere.

Perhaps I have to do this on my Rasperry Pi or Jetson boards...

You would only find them documented only in XNU sources – and then only for that particular version of XNU.

Linux is pretty much unique OS that actually have a stable syscall ABI. Most other OSes only provide stable ABI at the level of libc (except Windows have special NTDLL library to make sure libc can be unstable, too).

Although some older ones like AIX or HP UX are stable by virtue of not being updated all that much.

2 Likes

This really puzzles me.

With all this talk of "libc" one would get the impression that this is something to do with the C language. But as far as I know the C language has no defined ABI and the C standard goes out of its way to avoid defining an ABI. Why don't they call it "libsys" or something?

Is it really so that the Mac OS sys calls have been changing all the time?

Is it then true that one can never even make a statically linked executable? I mean if the sys calls change your static binary would be broken.

What a horrible can of worms.

Well, duh. C language was literally created for this very task. As in: it's C raison d'ĂŞtre, it literally exists to provide libc.

The fact that C was adopted for some other usages is kinda-sorta abuse of C, if you look on how, where and why it was developed.

Because the idea was that compiler provided by OS developer would define ABI, yes.

To ensure that OS developers are free to define their own ABIs.

Well, on MacOS the library that you have to use is technically called libSystem. But it's common to call it libc, less confusion.

Oh, yeah. Not only Apple explicitly rejects the idea that syscalls would be stable, but, as Go developers have found out, it's not an idle thread. They switched to libSystem after a few breakages.

Yup. That's what Apple writes extremely explicitly.

Why? They have to provide one stable ABI surface, because normal apps use libc. Why spend extra effort to provide anything more? Pure effort savings.

Linux is special because one group of people provides kernel and two others (GNU and Google) provide libc – and they are on barely speaking terms. Thus Linux have to support syscall interface.

But where kernel and libc are developed by the same group… declaring syscalls an internal unstable ABI just makes things simpler for OS developers and doesn't affect most user, so why not?

3 Likes

Indeed. My understanding was that C was create specifically to be able to reimplement Unix in a portable fashion. Accreting features as they found they needed them for that task.

I've always thought so. Having been weaned on a few other compiled to native languages before C. ALGOL, Coral, PL/M, Ada...

Ah well, time to dig out those Raspberry Pi and Jetsons...

Now, what about that "cargo clean" issue?

1 Like

You would need to disassemble the generated binary and see what could have been in x8 there.

We have no idea what syscall was your program even calling and what it was supposed to do!

If you program went from incrementally-built to built-from-scratch then x8 value could have been changed in the unpredictable fashion…

1 Like

To avoid all the confusion with Mac sys calls I have removed all that. My asm now converts a u64 to a hexadecimal ASCII string in a 16 byte buffer. The new code is:

#![feature(naked_functions)]
#![no_main]

use std::arch::naked_asm;

#[naked]
extern "C" fn u64_to_hex(n: u64, buff: *const u8) {
    unsafe {
        naked_asm!(
            // Number to print passed as parameter in X0
            // Output buffer pointer passed as parameter in X1
            //
            // Set x1 to the end of the destination buffer
            "add   x1, x1, #15",
            // The loop is FOR x5 = 16 TO 1 STEP -1
            "mov	x5, #16",          // 16 digits to print
            "42:	and	x6, x0, #0xf", // mask of least sig digit
            // If x6 >= 10 then goto letter
            "cmp	x6, #10", // is 0-9 or A-F
            "b.ge	1f",
            // Else its a number so convert to an ASCII digit
            "add	x6, x6, #'0'",
            "b	2f", // goto to end if
            "1:",   // handle the digits A to F
            "add	x6, x6, #('A'-10)",
            "2:",             // end if
            "strb	x6, [x1]",  // store ascii digit
            "sub	x1, x1, #1", // decrement address for next digit
            "lsr	x0, x0, #4", // shift off the digit we just processed
            // next x5
            "subs	x5, x5, #1", // step x5 by -1
            "b.ne	42b",        // another for loop if not done
            "ret",
        );
    }
}

#[unsafe(no_mangle)]
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
    let buff: &mut [u8; 16] = &mut [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    u64_to_hex(0x0123456789abcdef, buff.as_ptr());
    println!("{:?}", buff);
    0
}

Which fails to build with:

cargo run
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
error: invalid operand for instruction
   |
note: instantiated into assembly here
  --> <inline asm>:14:6
   |
14 | strb    x6, [x1]

So I change that strb instruction to use w6 instead.
Which then builds without error or warning but produces the wrong result:

cargo run
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/rust_asm`
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Then we build from clean:

cargo clean
     Removed 51 files, 1.5MiB total
cargo run
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.25s
     Running `target/debug/rust_asm`
[48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 65, 66, 67, 68, 69, 70]

Which is now the correct result!

2 Likes

That looks like some crazy problem with incremental build. File an issue?

1 Like

OK. Where?

Github, obviously: GitHub · Where software is built

Try to create a full reproducer: incremental builds had an issues in the past, thus it's important to know how exactly to reproduce what you are seeing: the exact version of toolchain, cargo, etc.

It may even be known issue if you are not using the latest version of rustc

1 Like

Interesting. When I look at the disassembled code with objdump -d target/debug/rust_asm the troublesome strb instruction is missing:

0000000100000db0 <u64_to_hex>:
100000db0: 91003c21     add     x1, x1, #15
100000db4: d2800205     mov     x5, #16
100000db8: 92400c06     and     x6, x0, #0xf
100000dbc: f10028df     cmp     x6, #10
100000dc0: 5400006a     b.ge    0x100000dcc <u64_to_hex+0x1c>
100000dc4: 9100c0c6     add     x6, x6, #48
100000dc8: 14000002     b       0x100000dd0 <u64_to_hex+0x20>
100000dcc: 9100dcc6     add     x6, x6, #55
100000dd0: d1000421     sub     x1, x1, #1
100000dd4: d344fc00     lsr     x0, x0, #4
100000dd8: f10004a5     subs    x5, x5, #1
100000ddc: 54fffee1     b.ne    0x100000db8 <u64_to_hex+0x8>
100000de0: d65f03c0     ret

After the ´cargo cleanandcargo runthestrb` is where it should be:

0000000100000dac <u64_to_hex>:
100000dac: 91003c21     add     x1, x1, #15
100000db0: d2800205     mov     x5, #16
100000db4: 92400c06     and     x6, x0, #0xf
100000db8: f10028df     cmp     x6, #10
100000dbc: 5400006a     b.ge    0x100000dc8 <u64_to_hex+0x1c>
100000dc0: 9100c0c6     add     x6, x6, #48
100000dc4: 14000002     b       0x100000dcc <u64_to_hex+0x20>
100000dc8: 9100dcc6     add     x6, x6, #55
100000dcc: 39000026     strb    w6, [x1]
100000dd0: d1000421     sub     x1, x1, #1
100000dd4: d344fc00     lsr     x0, x0, #4
100000dd8: f10004a5     subs    x5, x5, #1
100000ddc: 54fffec1     b.ne    0x100000db4 <u64_to_hex+0x8>
100000de0: d65f03c0     ret

Very curious.

2 Likes

Yeah, it looks as if first attempt produced error, at LLVM level, which wasn't properly reported to rustc, which made rustc cache that output. And then it was used as something “already built”, in incremental build, which produced that nonsense.

2 Likes

I made a minimal demonstration of the problem:

This code has a deliberate error and fails to build:

#![feature(naked_functions)]
use std::arch::naked_asm;

#[naked]
extern "C" fn add_u64(x: u64, y: u64) -> u64 {
    unsafe { naked_asm!("add x0, x0, w1", "ret") }
}

pub fn main() {
    println!("{}", add_u64(3u64, 7u64));
}
cargo run
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
error: too few operands for instruction
  |
note: instantiated into assembly here
 --> <inline asm>:4:1
  |
4 | add x0, x0, w1
  | ^^^^^^^^^^^^^^

We fix the source by changing w1 into x1. Which now builds without error but produces the wrong result:

#![feature(naked_functions)]
use std::arch::naked_asm;

#[naked]
extern "C" fn add_u64(x: u64, y: u64) -> u64 {
    unsafe { naked_asm!("add x0, x0, x1", "ret") }
}

pub fn main() {
    println!("{}", add_u64(3u64, 7u64));
}
cargo run
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/rust_asm`
3

Looking at the disassembled executable with objdump -d target/debug/rust_asm | less we find the add instruction is missing:

0000000100003db0 <__ZN8rust_asm7add_u6417h0dca69e1dd632e86E>:
100003db0: d65f03c0     ret

Rebuilding with cargo clean and cargo run then produces the correct result:

     Removed 60 files, 1.3MiB total
cargo run 
   Compiling rust_asm v0.1.0 (/Users/me/rust_asm)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/rust_asm`
10

And of course the add instruction is in the executable where we expect it to be:

0000000100003dac <__ZN8rust_asm7add_u6417h0dca69e1dd632e86E>:
100003dac: 8b010000     add     x0, x0, x1
100003db0: d65f03c0     ret

Done with:

cargo --version 
cargo 1.87.0-nightly (ab1463d63 2025-03-08)
2 Likes

Did it work before codegen `#[naked]` functions using global asm by folkertdev · Pull Request #128004 · rust-lang/rust · GitHub and does this PR break it?

Could you please file an issue containing this information? naked_functions is getting stabilized soon and if it’s broken like this, that should be taken into account.

6 Likes

Will do. Thanks all.

Bug reported here: Instructions missing from naked_asm blocks. · Issue #139407 · rust-lang/rust · GitHub

I hope that is clear enough.

3 Likes