Using raw pointers to point to a parent structure

Hello,

Our team of 5 is building a NoSQL engine as a student project for our course. Rust has been a great choice so far because of it's ownership model and clear way to do things. But today we ran into a kind of a problem. If you don't know NoSQL engine terms I am going to use, it's not that important.

The LSM structure, which handles SSTables and the Memory Table Pool is the main entry point of operation. It contains a Vector of SSTable paths (not relevant for this question), and a memory pool object that handles all memory tables. The supported operations are insert, delete, and get. Get is pretty straightforward and not important, but delete and insert have a caveat.

Insert and delete are pretty similar so I'm going to talk only about one of them.
When inserting a Key-Value pair into the Memory Table pool, the pair is being inserted into the read-write memory table. Once enough inserts have been made, the read-write memory table is pushed to the read-only memory table vector. Once enough read-only tables exist in the vector, and a new one is pushed, the last one will be popped and sent to be flushed. The flush function is in the LSM structure. Here, the problem manifested. How can we call the LSM flush function form the memory pool that is a field in the LSM? We tried everything with Rc, RefCell, Weak, lifetimes but to no avail.

One solution I found that works pretty well is using raw pointers.

use std::ptr;

struct LSM {
    mem_pool: MemPool,
}

impl LSM {
    fn new() -> Self {
        let new_mempool = MemPool::new();

        let mut lsm = LSM {
            mem_pool: new_mempool
        };

        lsm.mem_pool.lsm_ref = &mut lsm;

        lsm
    }

    // Should imitate inserting a new key value pair
    fn insert(&mut self) {
        self.mem_pool.insert();
    }

    // Should imitate flushing one memory table into an sstable
    fn flush(&mut self, text_to_be_flushed: String) {
        println!("{}", text_to_be_flushed);
    }
}

#[derive(Clone)]
struct MemPool {
    pub lsm_ref: *mut LSM,
    pub count: usize
}

impl MemPool {
    fn new() -> Self {
        MemPool {
            lsm_ref: ptr::null_mut(),
            count: 0
        }
    }

    // Inserts the "key value pair" into the memory table pool
    fn insert(&mut self) {
        self.count += 1;

        // If all memory tables are filled up, the last one should be popped and sent to the "flush function" in the LSM
        // Not the exact behavior is happening here, but it's irrelevant to the question
        if self.count > 2 {
            unsafe {
                // Unwraping the pointer is always safe, because the owner of the mem pool is the LSM,
                // and when the LSM is dropped, the mem pool is also dropped so there can never be an invalid pointer
                self.lsm_ref.as_mut().unwrap().flush("Flush".to_string());
            }
        }
    }
}

fn main() {
    let mut lsm = LSM::new();

    // Inserting a first KV pair
    lsm.insert();

    // Inserting a second KV pair
    lsm.insert();

    // Inserting a third KV pair
    lsm.insert()

    // Here, the flush should happen
}

The main questions of this post are:
Can this code break at all?
Is there a better way to do this with or without raw pointers?

I would just change MemPool::insert to return a bool indicating whether a flush is needed, and call flush from LSM::insert when MemPool::insert returns true. Or instead of a bool an enum could be used to make the code more readable.

If you do that, you could also choose to move the count field into LSM, and do the increment and check for flush in LSM::insert. It depends on the details of MemPool as to whether it should be responsible for incrementing and checking. I know you have only shown a skeleton and the details are probably much different.

Yeah I thought about that, but it may bloat the code a bit because then Mempool::insert needs to return an io::Result<Option> and LSM always needs to check whether the option is some or not (which may cost time on a lot of data). It definitely is a solution in the general sense, but in this project we are trying to explore Rust and try to make use of all of it's features to make the code more readable and performant.

The first stop with unsafe is Miri (run it in the playground under Tools in the top right).

error: Undefined Behavior: out-of-bounds pointer use: alloc874 has been freed, so this pointer is dangling
   --> /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mut_ptr.rs:689:57
    |
689 |         if self.is_null() { None } else { unsafe { Some(&mut *self) } }
    |                                                         ^^^^^^^^^^ out-of-bounds pointer use: alloc874 has been freed, so this pointer is dangling
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
help: alloc874 was allocated here:
   --> src/main.rs:11:13
    |
11  |         let mut lsm = LSM {
    |             ^^^^^^^
help: alloc874 was deallocated here:
   --> src/main.rs:18:5
    |
18  |     }
    |     ^
    = note: BACKTRACE (of the first span):
    = note: inside `std::ptr::mut_ptr::<impl *mut LSM>::as_mut::<'_>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mut_ptr.rs:689:57: 689:67
note: inside `MemPool::insert`
   --> src/main.rs:55:17
    |
55  |                 self.lsm_ref.as_mut().unwrap().flush("Flush".to_string());
    |                 ^^^^^^^^^^^^^^^^^^^^^
note: inside `LSM::insert`
   --> src/main.rs:22:9
    |
22  |         self.mem_pool.insert();
    |         ^^^^^^^^^^^^^^^^^^^^^^
note: inside `main`
   --> src/main.rs:71:5
    |
71  |     lsm.insert()
    |     ^^^^^^^^^^^^

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

Or with println debugging:

Addr in new:  0x7ffe2428b8b0
Addr in main: 0x7ffe2428b950
Pre-flush
LSM { mem_pool: MemPool { lsm_ref: 0x7ffe2428b8b0, count: 2 } }
Flush
LSM { mem_pool: MemPool { lsm_ref: 0x5569bff3a880, count: 140729505069320 } }

You move the LSM so the pointer dangles.

Yeah this is a big problem that I haven't noticed. Is there any way to fix it?

There's other problems, like creating a &mut invalidating previous borrows (which is why a common pattern in Rust is to move into raw pointers, do all your work in raw pointer land, then throw out all the raw pointers if you need to return to reference land).

In particular, creating two aliasing &mut (two active &mut to the same LSM) is UB, and doing so is inherent in this design:

&mut LSM method
  &mut MemPool method
    &mut LSM method on a freshly minted `&mut`

There's no way to fix it with that design.[1] The best fix is probably going to avoid unsafe.

If it's possible to factor the LSM so that the fields needed for flushing are in their own struct, you can perhaps do something like this.


  1. without going whole-hog on raw pointers or something ↩ī¸Ž

1 Like

Thank you. We will not use raw pointers since there is no time for us to grasp the unsafe rust mechanics until the project deadline.

Returning and checking an Option once for every new memtable couldn't possibly have a performance impact.

One of the features of Rust you can take advantage of (for readability and performance) is that it has optimized the use of nice abstractions like Error and Option. They have almost no cost, and certainly no significant cost in this case. Unlike in some languages, these abstractions don't require any memory allocation. They are just simple structs and calling Option::is_some has roughly the same cost as checking a bool for true or false.

Another option is to "split" the LSM into a memory pool and the other data you need for flushing, and then pass the fragment of the LSM you need in to the MemPool methods that may trigger a flush.

use std::ptr;

struct LSM {
    mem_pool: MemPool,
}

impl LSM {
    fn new() -> Self {
        let new_mempool = MemPool::new();

        let mut lsm = LSM {
            mem_pool: new_mempool,
        };

        lsm.mem_pool.lsm_ref = &mut lsm;

        lsm
    }

    fn split(&mut self) -> (LSMFragment, &mut MemPool) {
        (LSMFragment {}, &mut self.mem_pool)
    }

    // Should imitate inserting a new key value pair
    fn insert(&mut self) {
        let (mut fragment, mempool) = self.split();
        mempool.insert(&mut fragment);
    }
}

struct LSMFragment {}

impl LSMFragment {
    // Should imitate flushing one memory table into an sstable
    fn flush(&mut self, mempool: &mut MemPool, text_to_be_flushed: String) {
        println!("{}", text_to_be_flushed);
    }
}

#[derive(Clone)]
struct MemPool {
    pub lsm_ref: *mut LSM,
    pub count: usize,
}

impl MemPool {
    fn new() -> Self {
        MemPool {
            lsm_ref: ptr::null_mut(),
            count: 0,
        }
    }

    // Inserts the "key value pair" into the memory table pool
    fn insert(&mut self, fragment: &mut LSMFragment) {
        self.count += 1;
        if self.count > 2 {
            fragment.flush(self, "Flush".to_string());
        }
    }
}

fn main() {
    let mut lsm = LSM::new();

    // Inserting a first KV pair
    lsm.insert();

    // Inserting a second KV pair
    lsm.insert();

    // Inserting a third KV pair
    lsm.insert()

    // Here, the flush should happen
}

LSMFragment is obviously just a stub here, which makes it an easy change. In your actual code it may be considerably harder to use this trick now, but it's a good technique to have in your toolbox for Rust I think.

One other note: If LSMFragment is going to need a bunch of fields copied from LSM, you will probably be better off just storing the fragment inside LSM and then having split return references to both rather than constructing a new LSMFragment on every split.