Restrict reference from being borrowed by async/Weird segfault in OCaml-Rust interop

Lupus · June 17, 2023, 7:15pm

Hi!

I'm trying to build an interop layer between asynchronous Rust and asynchronous OCaml with primary goal to be able to use asynchronous Rust libraries from OCaml application.

I've got it in a somewhat usable state, and even have it open-sourced on Github [1]. The project is built on top of excellent ocaml-rs and ocaml-interop crates.

I'm running into a weird segmentation fault that I can't really understand. OCaml GC is a moving one, and most likely this is the cause as Rust is not quite prepared to the fact that some things are relocating behind it's back.

I've came up with a smart pointer for OCaml referenced objects, called CamlRef (full sources):

pub struct CamlRef<'a, T>
where
    T: Debug,
{
    dynbox_root: ocaml::root::Root,
    ptr: *const T,
    marker: PhantomData<&'a T>,
}

impl<'a, T> CamlRef<'a, T>
where
    T: Debug + 'static,
{
    pub fn new(dynbox: OCaml<DynBox<T>>) -> Self {
        let dynbox_root = unsafe { ocaml::root::Root::new(dynbox.get_raw()) };
        let ptr: &T = dynbox.borrow();
        Self {
            dynbox_root,
            ptr: ptr as *const T,
            marker: PhantomData,
        }
    }
}

impl<'a, T> Deref for CamlRef<'a, T>
where
    T: Debug + Sized,
{
    type Target = T;

    fn deref(&self) -> &'a Self::Target {
        unsafe { &*self.ptr }
    }
}

Basically it roots the OCaml value upon creation (rooting stops the value from being garbage-collected by OCaml GC), and dereferences it via Deref trait.

I was very happy that Rust refused to move executor into async in the following example:

#[ocaml::func]
#[ocaml::sig("executor -> (int -> promise) -> unit")]
pub fn lwti_executor_test(executor: CamlRef<Executor>, _f: ocaml::Value) {
    eprintln!("executor at #1: {:?}", executor);
    let task = executor.spawn(async move {
        eprintln!("executor at #2: {:?}", executor);
        executor.spawn(async {}).detach();
    });
    task.detach();
}

But if I change the async to not be a move one, it happily compiles, and crashes as well. Debug prints will print the following lines:

executor at #1: CamlRef { dynbox_root: Root(0x7f3d6d150028), ptr: 0x556cc720f600, marker: PhantomData<&ocaml_lwt_interop::local_executor::LocalExecutor> }
executor at #2: CamlRef { dynbox_root: Root(0x7f3d6d93b3c0), ptr: 0x556cc6006682, marker: PhantomData<&ocaml_lwt_interop::local_executor::LocalExecutor> }
Segmentation fault

Somehow CamlRef contents got corrupted, leading to segfault.

I'll try to figure out what's happening with GC interaction (although some ideas are welcome if someone knowledgeable about OCaml integration comes by!), but I thought that maybe it makes sense to just prevent this from happening using type system capabilities somehow. If I clone my executor before sending it to async, everything works just fine. Ideally I want my CamlRef to not leak into an async that outlives the current function, and I want Rust compiler to enforce that.

I've added explicit lifetime parameter to my CamlRef implementation, and assumed that it worked with async move, but looks like it's not enough. Some recommendations on how this can be tightened up are welcome!

Thanks in advance!

Lupus · June 17, 2023, 7:18pm

Sorry, was unable to post more than 2 links in the original topic, so adding some details on OCaml<DynBox<T>> magic, based on my understanding, this mechanism stores Pin<Box<dyn Any>> in OCaml heap, making it safe with regards to OCaml moving it's heap around during compation phase. See boxing and dereferencing in ocaml-interop crate.

SkiFire13 · June 17, 2023, 7:47pm

What's happening is that the Future you pass to spawn contains a reference to a local variable (executor), but then you detach the task and return from the function while the task is still executing. By doing this the reference contained in the Future becomes dangling, because the executor local variable gets destroyed at the end of the function call, so when you try to print it you're reading garbage data from the stack.

The soundness issue is in the fact that you need to ensure that whatever the Future borrows remains alive until spawned task ended. Most executors solve this problem by requiring the Future passed to spawn to be 'static.

Lupus · June 18, 2023, 7:09am

Whoa! That did the trick, thanks a lot! Looks like I don't have to chase GC integration problems at least in this case, as they are not the root cause here...

That's quite surprising that one can drive Rust to segfault even without using unsafe {}. I assumed that "if it compiles - it works" is applicable to Rust when I was refactoring multi-threaded executor from async_executor crate to actually be true thread-local one. One such refactoring was removing of lifetime parameter for the executor, which I assumed was not necessary for my simpler problem domain as Rust did not complain when I removed it.

Most executors solve this problem by requiring the Future passed to spawn to be 'static.

Is there any advantage/disadvantage of marking the Future just 'static instead of adding a lifetime parameter 'a to the executor itself and marking Future with that 'a instead of 'static?

SkiFire13 · June 18, 2023, 7:38am

But you did use unsafe in Executor::spawn in order to call async_task::spawn_unchecked which includes in its safety requirements:

If future is not 'static, borrowed variables must outlive its Runnable.

Which was exactly the problem causing the segfault.

Allowing non-'static lifetimes is generally regarded as something very very difficult to do safely.

With your idea I can see happening either:

you'll be greatly limited by how much the executor can run (because it won't be able to outlive any Future it contains)
you mess up something with e.g. variance and you allow again spawning Futures with shorter lifetimes than what should be allowed

Lupus · June 18, 2023, 8:55am

Oh my... Thanks for pointing this out and please forgive my ignorance, I'm quite new to Rust, and was unable to clearly understand that part of the safety requirements when I was reading them, and did not put enough effort as I was mostly focused on thread-related stuff at that moment. I removed lifetime parameter later on and completely forgot that there was something about lifetime in safety requirements in async_task::spawn_unchecked. My bad, lesson learned!

I've returned back the lifetime parameter as it was in original async_executor::Executor code (commit), it seems to work as expected, and even forced me to do some .clone()s on some parameters coming from OCaml in hyper bindings that I'm working on on top of ocaml-lwt-interop. Error messages are hard to understand though...

For this function for example:

#[ocaml::func]
pub fn hyper_server_stop(
    executor: CamlRef<LocalExecutor>,
    server: CamlRef<ServerHandle>,
    f: ocaml::Value,
) -> () {
    let f = ensure_rooted_value(f);
    let task = executor.spawn(async move {
        let gc = ambient_gc();
        let f = ocaml::function!(f, (x: ()) -> ());
        server.stop_server().await.unwrap();
        f(gc, &()).unwrap()
    });
    task.detach();
}

Rust complains about the following:

error: lifetime may not live long enough
   --> hyper/src/stubs.rs:320:16
    |
315 |       executor: CamlRef<LocalExecutor>,
    |       -------- has type `CamlRef<'_, LocalExecutor<'2>>`
316 |       server: CamlRef<ServerHandle>,
    |       ------ has type `CamlRef<'1, server::ServerHandle>`
...
320 |       let task = executor.spawn(async move {
    |  ________________^
321 | |         let gc = ambient_gc();
322 | |         let f = ocaml::function!(f, (x: ()) -> ());
323 | |         server.stop_server().await.unwrap();
324 | |         f(gc, &()).unwrap()
325 | |     });
    | |______^ argument requires that `'1` must outlive `'2`

To overcome this I added .clone() to my server, which makes sense - if OCaml garbage-collects the server handle before the async tasks completes, everything will explode. Is that the limitations that you mentioned about the ability to run taks on such executor?

Executor lifetime is actually managed by OCaml, won't there be any issues if Futures are 'static? Is the executor also considred 'static in this scenario from Rust's view?

SkiFire13 · June 18, 2023, 9:19am

Yes this is what I meant.

T: 'static just means that T doesn't contain non-'static lifetimes, and not containing lifetimes also counts (for example i32: 'static holds). At a lower level this just means that you want to you could hold onto instances of T until the program end (so for example you can hold onto a i32 until the program end just fine, but if you have a reference to a stack allocated value then you can't hold onto that until the program end because before that it will become dangling).

Are you sure it will respect the lifetime parameter of your LocalExecutor struct then?

Or viewed from another point of view, is there any way you could ever create a CamlRef<LocalExecutor<'non_static>>? Both new and FromValue seem to require T: 'static (i.e. LocalExecutor<'a>: 'static, which in turn requires 'a = 'static)

Lupus · June 18, 2023, 11:26am

I've tried with 'static, and the same limitations apply as with 'a approach - I need to clone the server reference. I'm not saying that this is the bad thing - no, this is the desired behavior, just trying to grasp the difference between 'a and 'static that you mentioned. The error message has changed, and now is arguably easier to understand:

error[E0521]: borrowed data escapes outside of function
   --> hyper/src/stubs.rs:320:16
    |
316 |       server: CamlRef<ServerHandle>,
    |       ------
    |       |
    |       `server` is a reference that is only valid in the function body
    |       has type `CamlRef<'1, server::ServerHandle>`
...
320 |       let task = executor.spawn(async move {
    |  ________________^
321 | |         let gc = ambient_gc();
322 | |         let f = ocaml::function!(f, (x: ()) -> ());
323 | |         server.stop_server().await.unwrap();
324 | |         f(gc, &()).unwrap()
325 | |     });
    | |      ^
    | |      |
    | |______`server` escapes the function body here
    |        argument requires that `'1` must outlive `'static`

All materials on 'static that I looked through mention that 'static has to exist till the program end, like static variables in C. Maybe I'm missing something here. I indeed had to specify 'static nearly everywhere and that defeats the purpose of explicit lifetime parameter in executor altogether, thanks for explanation! Interestingly enough requirement on 'static comes from ocaml-interop crate:

// Be careful about not deriving anything on OCaml to
// uphold the Borrow contract on Eq/Ord/Hash
impl<'a, A: 'static> Borrow<A> for OCaml<'a, DynBox<A>> {
    fn borrow(&self) -> &A {
        Pin::get_ref(Pin::as_ref(
            unsafe { self.custom_ptr_val::<Pin<Box<dyn Any>>>().as_ref() }
                .expect("Custom block contains null pointer"),
        ))
        .downcast_ref::<A>()
        .expect("DynBox of wrong type, cannot downcast")
    }
}

Does explicit lifetime parameter in CamlRef make sense if T: 'static is required to construct any CamlRef? I mostly added it to bind the dereferenced value to the same lifetime as CamlRef, since CamlRef is holding OCaml GC root that prevents it from being collected, if CamlRef is gone, referenced value will become dangling. Deref implementation that illustrates my point is cited below for convenience:

impl<'a, T> Deref for CamlRef<'a, T>
where
    T: Debug + Sized,
{
    type Target = T;

    fn deref(&self) -> &'a Self::Target {
        unsafe { &*self.ptr }
    }
}

SkiFire13 · June 18, 2023, 12:03pm

The comparison with static variables is fundamentally flawed because being 'static applies to types not variables or values. It's the type that has to be valid till the end of the program, not the values of that type (and again, look at i32. i32: 'static holds but not every i32 lives until the end of the program)

Yes, if the two lifetimes describe different things. The explicit lifetime parameter of CamlRef could describe for how long the reference itself is valid. The reference becoming invalid however doesn't directly correlate to the value being referenced not existing anymore though! That value could live potentially forever (maybe due to the GC? I don't know the details here), so that's why its type may need to be bounded by 'static.

system · September 16, 2023, 12:04pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rust/OCaml bindings and Traits help	3	231	December 15, 2024
Using Rust's advanced memory management features without invalidating a bound C library's references help	7	1668	May 17, 2020
Ocaml rust ffi bindings announcements	3	1251	January 12, 2023
Integrating Rust crates into OCaml package management help	1	314	December 12, 2023
Async closure that takes a reference help	2	584	February 27, 2021

Restrict reference from being borrowed by async/Weird segfault in OCaml-Rust interop

Related topics