Why is `extend_from_slice` off by 4 bytes?

I am building a semi-hosted RISV64 interpreter. However, I have noticed that when using extend_from_slice with an argument of length > 16 bytes, 4 bytes before the beginning of the slice are read. This screams memory alignment issues, but I have no clue where to begin to look.


The broken code:

    let slice_a = b"ABCDABCDABCDABCDA";
    let mut vec_a = Vec::with_capacity(slice_a.len());
    vec_a.extend_from_slice(slice_a);
    io::log(&alloc::format!("vec_a = {:?}", vec_a));
    io::log(&alloc::format!("slice_a = {:?}", slice_a));

Below are the logs of the vector allocation and debug logs of both vec_a and slice_a, notice the incorrect first four bytes.

DEBUG vm@0x00000000002e54: ALLOC@0x005620fdce6300       Layout { size: 17, align: 1 (1 << 0) }
 INFO vm@0x00000000001968: vec_a = [101, 114, 115, 126, 65, 66, 67, 68, 65, 66, 67, 68, 65, 66, 67, 68, 65]
 INFO vm@0x000000000019e8: slice_a = [65, 66, 67, 68, 65, 66, 67, 68, 65, 66, 67, 68, 65, 66, 67, 68, 65]

I can achieve the correct output by instead using vec_a.extend_from_slice(&slice_a[4..]);.

Advice on where to look and what is causing this is much appreciated :slight_smile:

is this the Vec from the standard library? if so, this is a serious problem, and you should definitely report the bug.

This is alloc::vec::Vec. However I do not believe this to be a bug in the library, I rather expect it to be an issue with me breaking an expected behavior.

I've narrowed the issue down to a SD (STORE DOUBLE WORD) instruction which is storing [101, 114, 115, 126, 65, 66, 67, 68] into the vector. But still working out how it's getting these incorrect first 4 values and why.

This SD instruction is within compiler_builtins::mem::memcpy, so next up working out whether the function is getting a bad pointer.

if you don't need unsafe to trigger this error, it is definitely a bug, if not in the library, then it's in the compiler.

3 Likes

This is running inside a RISCV64 hypervisor of my own making, there is plenty of unsafe.

So far I know that alloc::vec::Vec<T,A>::extend_from_slice calls compiler_builtins::mem::memcpy and somewhere along that chain I get 4 bad bytes, probably because I am dumb.

oh, I see. I am not familiar with riscv. are you emulating at instruction level, or are you hooking/patching the program at API entry?

Some ideas:

  • Compare execution with another RISCV emulator (or the real hardware if you have it). I believe Qemu supports RISCV for example. If you set some breakpoint in both your emulator and the other one you should be able to compare memory and registers and find out when behaviour starts to diverge.
  • Compare assembly between the working and non-working versions.

You should look at the generated assembly to ensure it's correct. And if it is correct, stepping through it may help you notice where it goes wrong.

You can paste this code into the Rust playground and it compiles with just a few changes (playground link):

#[inline(never)]
pub fn f() -> *mut u8 {
    let slice_a = b"ABCDABCDABCDABCDA";
    let mut vec_a = Vec::with_capacity(slice_a.len());
    vec_a.extend_from_slice(slice_a);

    // try to minimize cleanup instructions
    Box::into_raw(vec_a.into_boxed_slice()) as *mut u8
}

I took a look at the "Release" mode "Show Assembly", and it looks reasonable. It uses movups to load 16 bytes from the byte literal, and movups to store that in the Vec buffer, and it embeds the last A right in the assembly as an immediate $65. Very much optimized to the particular byte literal in the code, but it's hard to see how this would go wrong. So the weirdness you're seeing is likely something specific to your platform, compilation options, or some such detail. You'll probably need to provide more information to get an answer.

Is your project all Rust? Do you let Cargo link the executable? What platform? And so on.

This is a good idea. I've been trying to avoid stepping up totally new tooling but I might give GitHub - unicorn-engine/unicorn: Unicorn CPU emulator framework (ARM, AArch64, M68K, Mips, Sparc, PowerPC, RiscV, S390x, TriCore, X86) a go once I fail to work out what is going on with GDB.

The project is all rust using a riscv64ima-unknown-none-elf target (A modification of riscv64imac-unknown-none-elf to remove the compressed instruction set). The executable is linked by LLVM's lld.

I've narrow the issue down to compiler_builtins::mem::memcpy. The arguments being passed to this function appear to be correct. I've shown the changes in registers at each instruction which are labeled by execution order until the bad [101 'e', 114 'r', 115 's', 126 '~', 65 'A', 66 'B', 67 'C', 68 'D'] values are stored into vec_a.

[
    0x0,                // 0 - zero
    0x7ffffff49960,     // 1 - ra
    0x7ffffff44998,     // 2 - sp
    0x7ffffff45d88,     // 3 - gp
    0x0,                // 4 - tp
    0x7ffffff4992c,     // 5 - t0
    0x7ffffff4df30,     // 6 - t1
    0x3,                // 7 - t2
    0x11,               // 8 - s0
    0x7ffffff4e44e,     // 9 - s1
    0x555555c76380,     // 10 - a0 = ptr to vec_a
    0x7ffffff4e44e,     // 11 - a1 = ptr to slice_a
    0x11,               // 12 - a2 = slice_a.len()
    0x11,               // 13 - a3
    0x11,               // 14 - a4
    0x7,                // 15 - a5
    0x7ffffff45a71,     // 16 - a6
    0x4e23,             // 17 - a7
    0x7ffffff45cc8,     // 18 - s2
    0x0,                // 19 - s3
    0x39,               // 20 - s4
    0x6,                // 21 - s5
    0x7ffffff45aaa,     // 22 - s6
    0x7ffffff45cc8,     // 23 - s7
    0x7ffffff45a28,     // 24 - s8
    0x7ffffff45a69,     // 25 - s9
    0x7ffffff45aeb,     // 26 - s10
    0x7ffffff4e44e,     // 27 - s11
    0xe0,               // 28 - t3
    0xa0,               // 29 - t4
    0xed,               // 30 - t5
    0xffffffffffffff9f  // 31 - t6
]

│           0x080080c8      93060001       li a3, 16                    ; compiler_builtins::mem::memcpy::h8080bdef9a1b6ee8
|           ; src/mem/impl.rs:117   if n >= WORD_COPY_THRESHOLD {
│       ┌─< 0x080080cc      636cd608       bltu a2, a3, 0x8008164       ; 1.    branch if 0x11 (17) < 0x10 (16) (no branch)
|       |   ; src/mem/impl.rs:120   let dest_misalignment = (dest as usize).wrapping_neg() & WORD_MASK;
│       │   0x080080d0      bb06a040       negw a3, a0                  ; 2.    negate word in a0 (0x555555c76380)
│       │   0x080080d4      93f67600       andi a3, a3, 7               ; 3.    a3 & 7 = a3 (0)
|       |   ; src/mem/impl.rs:48    let dest_end = dest.wrapping_add(n) as *mut usize;
│       │   0x080080d8      3307d500       add a4, a0, a3               ; 4.    a0 + a3 = a4 (0x555555c76380)
│       │   0x080080dc      93870500       mv a5, a1                    ; 5.    a5 = ptr to slice_a (0x7ffffff4e44e)
│       │   0x080080e0      13080500       mv a6, a0                    ; 6.    a6 = ptr to vec_a (0x555555c76380)
|       |   ; src/mem/impl.rs:62    while dest_usize < dest_end {
│      ┌──< 0x080080e4      637ce500       bleu a4, a0, 0x80080fc       ; 7.    branch if a4 (0x555555c76380) < a0 (0x555555c76380) (branch) 
│      ││   ; CODE XREF from sub.loc._x_202_80080c8 @ 0x80080f8(x)
│     ┌───> 0x080080e8      83c80700       lbu a7, 0(a5)
│     ╎││   0x080080ec      23001801       sb a7, 0(a6)
│     ╎││   0x080080f0      13081800       addi a6, a6, 1
│     ╎││   0x080080f4      93871700       addi a5, a5, 1
│     └───< 0x080080f8      e368e8fe       bltu a6, a4, 0x80080e8
│      ││   ; CODE XREF from sub.loc._x_202_80080c8 @ 0x80080e4(x)
|      ||   ; src/mem/impl.rs:122    dest = dest.wrapping_add(dest_misalignment)
│      └──> 0x080080fc      b385d500       add a1, a1, a3               ; 8.    a1 + a3 (0) = a1 (0x7ffffff4e44e)
│       │   0x08008100      3306d640       sub a2, a2, a3               ; 9.    a2 + a3 (0) = a2 (0x11)
│       │   0x08008104      937786ff       andi a5, a2, -8              ; 10.   a2 & -8 = a5 (0x10)
|       |   ; src/mem/impl.rs:127   let src_misalignment = src as usize & WORD_MASK;
│       │   0x08008108      13f87500       andi a6, a1, 7               ; 11.   a1 & 7 = a6 (0x6)
│       │   0x0800810c      b306f700       add a3, a4, a5               ; 12.   a4 (0x555555c7638) + a5 (0x10) = a3 (0x555555c76390) (&vec_a + 16)
|       |   ; src/mem/impl.rs:128   if likely(src_misalignment == 0) {
│      ┌──< 0x08008110      630c0806       beqz a6, 0x8008188           ; 13.   branch if a6 (0x6) = 0 (no branch)
|      ||   ; src/mem/impl.rs:77    let shift = offset * 8;
│      ││   0x08008114      93983500       slli a7, a1, 0x3             ; 14.   a1 (0x7ffffff4e44e) << 0x3 = a7 (0x3ffffffa72270)
│      ││   0x08008118      13f88803       andi a6, a7, 56              ; 15.   a7 & 56 = a6 (0x30)
│      ││   0x0800811c      93f285ff       andi t0, a1, -8              ; 16.   a1 (0x7ffffff4e44e) & -8 = t0 (0x7ffffff4e448)
|      ||   ; src/mem/impl.rs:85    let mut prev_word = core::intrinsics::atomic_load_unordered(src_aligned);
│      ││   0x08008120      03b30200       ld t1, 0(t0)                 ; 17.   loads &slice_a-6 ([10 '\n', 60 '<', 101 'e', 114 'r', 114 'r', 62 '>', 65 'A', 66 'B']) into t1
│      ││   0x08008124      bb081041       negw a7, a7                  ; 18.   negate word in a7 (0x3ffffffa72270)
│      ││   0x08008128      93f88803       andi a7, a7, 56              ; 19.   a7 + 56 = a7 (0x10)
│      ││   0x0800812c      93828200       addi t0, t0, 8               ; 20.   t0 (0x7ffffff4e448) + 8 = t0 (0x7ffffff4e450) (&slice_a + 2)
│     ┌───< 0x08008130      6374d702       bleu a3, a4, 0x8008158       ; 21.   branch if a3 (0x555555c76390) <= a4 (0x555555c76380) (no branch)
│     │││   ; CODE XREF from sub.loc._x_202_80080c8 @ 0x8008154(x)
│    ┌────> 0x08008134      83b30200       ld t2, 0(t0)                 ; 22.   loads &slice_a+2 ([67 'C', 68 'D', 65 'A', 66 'B', 67 'C', 68 'D', 65 'A', 66 'B']) into t2
│    ╎│││   0x08008138      33530301       srl t1, t1, a6               ; 23.   t1 ([10 '\n', 60 '<', 101 'e', 114 'r', 114 'r', 62 '>', 65 'A', 66 'B']) >> a6 (0x30) = t1 ([101 'e', 114 'r', 114 'r', 62 '>', 65 'A', 66 'B', 0 '\000', 0 '\000']) 
│    ╎│││   0x0800813c      339e1301       sll t3, t2, a7               ; 24.   t2 ([67 'C', 68 'D', 65 'A', 66 'B', 67 'C', 68 'D', 65 'A', 66 'B']) << a7 (0x10) = t3 ([0 '\000', 0 '\000', 67 'C', 68 'D', 65 'A', 66 'B', 67 'C', 68 'D'])
│    ╎│││   0x08008140      33636e00       or t1, t3, t1                ; 25.   t3 ([0 '\000', 0 '\000', 67 'C', 68 'D', 65 'A', 66 'B', 67 'C', 68 'D'])| t1 ([101 'e', 114 'r', 114 'r', 62 '>', 65 'A', 66 'B', 0 '\000', 0 '\000']) = t1 ([101 'e', 114 'r', 115 's', 126 '~', 65 'A', 66 'B', 67 'C', 68 'D'])
│    ╎│││   ; STRN XREF from sub.entry0_8000000 @ 0x8000144(r)
│    ╎│││   0x08008144      23306700       sd t1, 0(a4)                 ; 26.   store t1 ([101 'e', 114 'r', 115 's', 126 '~', 65 'A', 66 'B', 67 'C', 68 'D']) into 0(a4) (&vec_a)
│    ╎│││   0x08008148      13078700       addi a4, a4, 8               ; 27.   a4 (&vec_a) + 8 = a4 (&vec_a + 8)
│    ╎│││   0x0800814c      93828200       addi t0, t0, 8               ; 28.   t0 (&slice_a) + 8 = t0 (&slice_a + 8)
│    ╎│││   0x08008150      13830300       mv t1, t2
│    └────< 0x08008154      e360d7fe       bltu a4, a3, 0x8008134
│     │││   ; CODE XREFS from sub.loc._x_202_80080c8 @ 0x8008130(x), 0x800818c(x), 0x80081a4(x)
│   ┌┌└───> 0x08008158      b385f500       add a1, a1, a5
│   ╎╎ ││   0x0800815c      13767600       andi a2, a2, 7
│   ╎╎┌───< 0x08008160      6f008000       j 0x8008168
│   ╎╎│││   ; CODE XREF from sub.loc._x_202_80080c8 @ 0x80080cc(x)
│   ╎╎││└─> 0x08008164      93060500       mv a3, a0
│   ╎╎││    ; CODE XREF from sub.loc._x_202_80080c8 @ 0x8008160(x)
│   ╎╎└───> 0x08008168      3386c600       add a2, a3, a2
│   ╎╎ │┌─< 0x0800816c      63fcc600       bleu a2, a3, 0x8008184
│   ╎╎ ││   ; CODE XREF from sub.loc._x_202_80080c8 @ 0x8008180(x)
│   ╎╎┌───> 0x08008170      03c70500       lbu a4, 0(a1)
│   ╎╎╎││   0x08008174      2380e600       sb a4, 0(a3)
│   ╎╎╎││   0x08008178      93861600       addi a3, a3, 1
│   ╎╎╎││   0x0800817c      93851500       addi a1, a1, 1
│   ╎╎└───< 0x08008180      e3e8c6fe       bltu a3, a2, 0x8008170
│   ╎╎ ││   ; CODE XREF from sub.loc._x_202_80080c8 @ 0x800816c(r)
│   ╎╎ │└─> 0x08008184      67800000       ret
│   ╎╎ │    ; CODE XREF from sub.loc._x_202_80080c8 @ 0x8008110(r)
│   ╎╎ └──> 0x08008188      13880500       mv a6, a1
│   └─────< 0x0800818c      e376d7fc       bleu a3, a4, 0x8008158
│    ╎      ; CODE XREF from sub.loc._x_202_80080c8 @ 0x80081a0(x)
│    ╎  ┌─> 0x08008190      83380800       ld a7, 0(a6)
│    ╎  ╎   0x08008194      23301701       sd a7, 0(a4)
│    ╎  ╎   0x08008198      13078700       addi a4, a4, 8
│    ╎  ╎   0x0800819c      13088800       addi a6, a6, 8
│    ╎  └─< 0x080081a0      e368d7fe       bltu a4, a3, 0x8008190
└    └────< 0x080081a4      6ff05ffb       j 0x8008158

This assembly is generated from unsafe fn copy_forward(mut dest: *mut u8, mut src: *const u8, mut n: usize)

I have now solved this issue. Following is the problematic, instruction srl is shifting by 0x10 (16) rather than 0x30 (48):

This was due to me using an incorrect shift mask of 0x11111 (valid for RISCV32) rather than 0x111111 (6 bits rather than 5) for RISCV64. Interestingly enough this slipped past the official RISCV64 instruction set tests so might be a good one to add to protect against idiots like me.

Thank you all for your interest and help! :slight_smile:

5 Likes