Flaky results with memory access violations

Hi,
I made an account since I need help with the following 'funky' behavior. As this is commercial software, I don't want to expose function names, so bare with my generic names, please.

Situation

there is an external (fortran based) windows dll. I make use of the routines the following way:

#[link(name = "dllname", kind = "dylib")]
extern "C" {
    fn external_function(
        index1: *mut c_long,
        index2: *mut c_long,
        option: *const c_char,
        strlen: c_long,
        number_of_values: *mut c_long,
        values: *mut c_double,
        string_array: *const c_char,
        strlen: c_long,
        noerr: *mut c_long,
    );
}

I am wrapping this function this way:

pub fn wrapped_func(option: EnumType, index1: i32, index2: i32) -> Result<ResultStruct, String> {

    let mut ca_index1: c_long = index1 + 1; // from 0-based to 1-based indexing
    let mut ca_index2: c_long = index2 + 1; // from 0-based to 1-based indexing

    let mut ca_option = format!("{:?}", option); // Display trait to convert to String representation
    let len = ca_option.len() as i32; 

    // these are going to be set from the fortran side, initialize all as 0
    let mut ca_nvals = 0; // number of values
    let mut vals = [0.; CONST1 as usize]; // CONST1 is a hardcoded parameter, as is CASTRLEN
    let mut buf = [0 as c_char; CASTRLEN * CONST1 as usize];

    let mut return_code: c_long = 10;

    unsafe {
        external_function(
            &mut ca_index1 as *mut _,
            &mut ca_index2 as *mut _,
            ca_option.as_ptr() as *mut _,
            len as c_long,
            &mut ca_nvals as *mut _,
            vals.as_mut_ptr() as *mut _,
            buf.as_mut_ptr() as *mut c_char,
            CASTRLEN as c_long,
            &mut return_code as *mut _,
        );
    }
    // this piece is irrelevant
    if return_code != 0 {
        return Err(error_fn());
    }

    // in my desperation, I copy the values 1-by-1. I don't think I should need a copy at all!
    let mut buf_copy = [0; CASTRLEN * CONST1 ];
    let mut vals_copy = [0.; CONST1 ];

    for cnt in 0..TQFNLEN * ca_nvals as usize {
        buf_copy[cnt] = buf[cnt];
    }
    for cnt in 0..ca_nvals as usize {
        vals_copy[cnt] = vals[cnt];
    }

    // Truncate the length to the actual number of parameters
    let return_vals = vals_copy[..ca_nvals as usize].to_vec();

    // Convert the C strings to Rust strings
    let mut reply = Vec::with_capacity(ca_nvals as usize);
    let mut offset = 0;
    for _ in 0..ca_nvals as usize {
        let c_str = unsafe { CStr::from_ptr(buf_copy[offset..offset + TQFNLEN].as_ptr()) };
        let str_slice = c_str.to_str().unwrap();
        reply.push(str_slice[..TQFNLEN].trim().to_string());
        offset += TQFNLEN;
    }

    Ok(ResultStruct{
        values: return_vals,
        expressions: reply,
    })
}

As you can see, i tried a few more or less convoluted ways to convert.

I had many iterations on this, but it never got robustly running - running my tests on this code sometimes (!) gives me an access violation / invalid read error, and sometimes runs.

Interestingly enough, it seems that adding dbg!(&buf_copy) somewhere after the unsafe block stabilizes things - but not to an extent that it never happens (specifically as obviously if doing a non-development build, the dbg statements get dropped.

As I know that this is not a running example of the problem, let me confirm that the Fortran code actually writes 'good' data. The f64 values can't be bad, and the returned char-arrays are all simple ASCII bytecodes. I can confirm that these are correct. The Fortran program would crash if any non-sane characters were in there much earlier.

The issue really seems to be that the either the pointers/buffers are loosing validity. I have a suspicion that they may get optimized away?

LLMs suggested to use std::ptr::read_volatile(buf.as_ptr()), but that fundamentally did not change the issue.

So I guess I am either looking for a way to stop any (premature?) optimization.

I guess there are many ways to swap a tire, but the conversion of the char-array to String can be done in multitudes of way. I know I do not need to go through the unsafe CStr if I know the length yadayada. This is merely still in place from my attempts at debugging. But coming back to my question, I don't think I should need any copy of the buffers before returning?

I haven't used fortran in a really long time, but I looked it up just now and fortran strings are not null terminated. So perhaps CStr::from_ptr is not what you want above.
https://doc.rust-lang.org/std/ffi/struct.CStr.html#method.from_ptr

  • The memory pointed to by ptr must contain a valid nul terminator at the end of the string.

Just a guess.

Hey, thank you for that answer, fair and correct point. In between my debugging sessions I apparently dropped the part where I specifically set the expected end of the char-array to a null value.

However, the problem persists, even when implemented like this:

pub fn wrapped_func(option: EnumType, index1: i32, index2: i32) -> Result<ResultStruct, String> {

    let mut ca_index1: c_long = index1 + 1; // from 0-based to 1-based indexing
    let mut ca_index2: c_long = index2 + 1; // from 0-based to 1-based indexing

    let mut ca_option = format!("{:?}", option); // Display trait to convert to String representation
    let len = ca_option.len() as i32; 

    // these are going to be set from the fortran side, initialize all as 0
    let mut ca_nvals = 0; // number of values
    let mut vals = [0.; CONST1 as usize]; // CONST1 is a hardcoded parameter, as is CASTRLEN
    let mut buf = [0 as c_char; CASTRLEN * CONST1 as usize];

    let mut return_code: c_long = 10;

    unsafe {
        external_function(
            &mut ca_index1 as *mut _,
            &mut ca_index2 as *mut _,
            ca_option.as_ptr() as *mut _,
            len as c_long,
            &mut ca_nvals as *mut _,
            vals.as_mut_ptr() as *mut _,
            buf.as_mut_ptr() as *mut c_char,
            CASTRLEN as c_long,
            &mut return_code as *mut _,
        );
    }
    // this piece is irrelevant
    if return_code != 0 {
        return Err(error_fn());
    }

    // in my desperation, I copy the values 1-by-1. I don't think I should need a copy at all!
    let mut buf_copy = [0; CASTRLEN * CONST1 ];
    let mut vals_copy = [0.; CONST1 ];

    for cnt in 0..CASTRLEN * ca_nvals as usize {
        buf_copy[cnt] = buf[cnt];
        if cnt % CASTRLEN == 0 && cnt > 1 {
            buf_copy[cnt - 1] = 0;
        }
    }
    for cnt in 0..ca_nvals as usize {
        vals_copy[cnt] = vals[cnt];
    }

    // Truncate the length to the actual number of parameters
    let mut return_vals = vals_copy.clone().to_vec();
    return_vals.truncate(ca_nvals as usize);


    // Convert the C strings to Rust strings
    let mut reply = Vec::with_capacity(ca_nvals as usize);
    let mut offset = 0;
    for _ in 0..ca_nvals as usize {
        let c_str = unsafe { CStr::from_ptr(buf_copy[offset..].as_ptr()) };
        let str_slice = c_str.to_str().unwrap();
        reply.push(str_slice.trim().to_string());
        offset += CASTRLEN;
    }

    Ok(ResultStruct{
        values: return_vals,
        expressions: reply,
    })
}

I'm sorry that obfuscating the code is making things a bit more complicated, I had a few incorrect renames in the first code snippet.

I don't think this applies, as the mangling would be relevant if the Rust function would be exposed externally, but I am using the FFI to call into a Fortran DLL.

This will not add a null terminator on the last string in the array. cnt never reaches CASTRLEN * ca_nvals (it reaches one less), so cnt % CASTRLEN == 0 is never true for the last string.

You don't need CStr. You can treat the raw strings as slices of bytes (u8) where each slice has length CASTRLEN, and convert those to str slices. Then trim and convert to a String.

This will also avoid overwriting the last char in each array with a null, which would truncate every string to length CASTRLEN - 1.

And of course, there is no reason to copy from one local array to another.

1 Like

One thing i don't feel very good about is not specifying types and then casting the pointers, because you don't really know what type it was before the cast. (especially with numbers where the compiler has a default)
Another thing: you comment that you use the Display trait, but you do {:?} which calls the Debug trait.

1 Like

omg, pls excuse me. I did indeed copy wrong code, as I got totally lost in the debugging process.

Bear with me, here is a version that definitely works sometimes, but fails at others:

    let mut ca_option = format!("{:?}", option);
    let len = ca_option.len() as i32;

    let mut ca_index1: c_long = indexp + 1;
    let mut ca_index2: c_long = indexc + 1;

    let mut ca_nvals = 0 as c_long;
    let mut vals = [0.; CONST1 as usize];
    let mut buf = [0u8; CASTRLEN * CONST1 as usize];

    let mut return_code: c_long = 10;
    unsafe {
        external_function(
            &mut ca_index1 as *mut c_long,
            &mut ca_index2 as *mut c_long,
            ca_option.as_mut_ptr() as *mut c_char,
            len as c_long,
            &mut ca_nvals,
            vals.as_mut_ptr() as *mut _,
            buf.as_mut_ptr() as *mut _,
            CASTRLEN as c_long,
            &mut return_code as *mut c_long,
        );
    }
    let mut buf_copy = [0; CASTRLEN * CONST1];
    let mut vals_copy = [0.; CONST1];
    for cnt in 0..CASTRLEN * ca_nvals as usize {
        buf_copy[cnt] = buf[cnt];
    }
    for cnt in 0..ca_nvals as usize {
        vals_copy[cnt] = vals[cnt];
    }
    if return_code != 0 {
        return Err(tqerr());
    }

    let mut return_vals = vals_copy.clone().to_vec();
    return_vals.truncate(ca_nvals as usize);
    // dbg!(ca_nvals);

    // Convert the C strings to Rust strings
    let mut reply: Vec<String> = Vec::with_capacity(ca_nvals as usize);
    let mut offset = 0;
    for i in 0..ca_nvals as usize {
        let r = buf_copy[offset..offset + CASTRLEN ]
            .iter()
            .map(|c| *c as char)
            .collect::<String>();
        //dbg!(&r);
        reply.push(r.trim().to_string());
        offset += CASTRLEN ;
    }

    Ok(FnExpression {
        values: return_vals,
        expressions: reply,
    })
}

This version definitely compiles and works with an associated test. However, wenn I "build" this (using maturin), the associated pytest fails with an access violation.

I'm not sure which things you're refering to, but the as *mut _ is simply a shorthand as the signature of that extern function is typed and therefore the type can be inferred, or am I mistaken?

Another thing: you comment that you use the Display trait, but you do {:?} which calls the Debug trait.

Yea, for these simple enums the Display and Debug trait are identical. Good observation, but the input validation of that parameter in the external dll works flawlessly.

Yes the result of the cast is clear from the function signature but the type before the cast isn't. as casts can cast a ptr from one type to a ptr to another type without you noticing and you never specify which type your variables have.

let mut vals: [c_double; CONST1 as usize] = [0.; CONST1 as usize];

with this you would make sure that vals actually has the right type.

not really good advice in this case

That's also why you shouldn't use as casts. ptr types have a lot of methods that only cast mut/const or only cast type and with those you are sure you don't change something you don't want to change.

1 Like

okay!

I did explicitly add the types, however the result remains 'access violation'. I could also remove most of the "as ..." parts, as the types already fit.

Is there any technical reason the stack allocation style would throw a wrench in the call-by-reference style of the called dll?

Any comment on why e.g. for the above, tests that are run as:

cargo.exe test --package packagename --lib -- ca_ffi::wrapped_func --exact --show-output

work, but

cargo nextest run ca_ffi::wrapped_func

fails?

I think it is generally good advice to get rid of as many as casts as possible. I think the only you need is from *const u8 to *const i8.

Another thing I would do is to split the function into a safe and an unsafe part, i.e. you write a minimal safe wrapper which takes slices and do the argument construction in safe rust.

That can be done with pointer - Rust.

So, as I could remove those casts and the problem persists - any suggestions how to debug this? I added some assertions on e.g. the array sizes, but those never fail. Is it possible that Rust deallocates prematurely in some way?

I briefly read up about the differences in the stack/heap allocation form, but fundamentally can't see how the deallocations of this memory would make the called dll fail, as the called routine only acts on the reference for that call and does not take ownership, at least to my understanding.

You could (just for debugging) put all the memory the function needs on the heap and just leak it all. If it then still happens you know for sure that it's not a livetime issue.

Do you know how the access violation happens? which array gets indexed etc?

So I tried to put the data on the heap, but nothing changed.

that's the curious thing both of the arrays seem to be affected. (That's why I put the "flaky" in the title). If I remove the code that deals with either of those arrays, it sometimes works, but as I already said, it also sometimes works completely correctly. If I use debug prints after the FFI call, the data in those arrays looks correct.

May it be possible that the dll holds on to the memory past the point where rust gives it up? (and then tries to write?). I would not know how to check this.

That is what i wanted to check by leaking the memory. Just putting it on the heap by itself won't help there. Did you use Box::leak() (or something similar) when putting it on the heap?

Yea okay, that sounds interesting. Is this about right?

    let mut buf: Box<[c_char; CASTRLEN * CONST1 as usize]> =
        Box::new([0; CASTRLEN * CONST1 as usize]);
    let static_buf: &'static mut [i8] = Box::leak(buf);

and then

            static_buf.as_mut_ptr(),

in the arguments?

I did not obverse any changes in behavior, other than that the test-runner now occasionally hangs on the execution (without actually triggering the access violation, I think).

Have you done that with all parameters? If you then still get access violations it's probably not a lifetime issue.
I don't know how you run your tests but it could be that the test runner doesn't clean up before starting the next test which might be why it hangs.
Are the lenghts that you pass in measured in elements or bytes? that could be an issue, but it seems fine to me.

Have you tried running it with valgrind and asan/ubsan?

Are you still using cargo nextest, or cargo test? cargo test by default runs tests in many threads, and I recall (possibly incorrectly) that Fortran code often uses global variables where other languages would use stack variables, so it is not thread-safe. If that’s true of yours, then you need a static Mutex guarding calls to your external_function so that only one at a time can execute.

3 Likes