Properly preallocating memory for a collection of items

To learn more about unsafe rust I wanted to implement a simple HashTable for myself. Just a simple implementation with a string as key and value and simple insertion and retrieval. Nothing too fancy

Initially I tried to do it completely in safe Rust and simply stored the items inside a Vec<HashTableItem>. I preallocated memory with the with_capacity function, but inserting didn't work.
Then I took the opportunity to learn more about unsafe Rust and made the switch.

My table structs looks like this:

pub struct HashTableItem {
    key: Box<String>,
    value: RefCell<String>,
}
pub struct HashTable {
    ptr: *mut HashTableItem,
    cap: usize,
    len: usize,
    _marker: PhantomData<HashTableItem>,
}

I initialize the HashTable like this:

pub fn new(cap: usize) -> Self {
    let layout = if cap <= isize::MAX as usize {
        alloc::Layout::array::<HashTableItem>(cap).unwrap()
    } else {
        panic!("Capacity must be less than `isize::MAX`");
    };

    let ptr = unsafe { alloc::alloc_zeroed+(layout) };
    let ptr = ptr as *mut HashTableItem;

    HashTable {
        ptr,
        cap,
        len: 0,
        _marker: PhantomData,
    }
}

and want to insert an item like this:

pub fn insert(&mut self, key: String, _value: String) -> Result<()> {
    if self.len == self.cap {
        return Err(HashTableError::TableFull);
    }

    let index = hash_function(&key)?;

    unsafe {
        let current_item = self.ptr.add(index);
        if current_item.is_null() {
            // If the key is null, then there's no item at the given index and the new item can be inserted as-is
            println!("PTR is null");
        }
    }
 
    println!("Unimpl");
    Ok(())
}

However, the "PTR is null" clause is never hit. I guess I did something wrong when allocating the memory. Is my thinking too C-like?
I know there's MaybeUninit but I'm not quite sure if it's appropriate here.

I uploaded the whole code, with all methods to the Playground

Why it should be null? self.ptr always points to allocated memory, therefore self.ptr.add(index) always points at some offset from the start of this allocation.

1 Like

In C I can do

table->items = (Ht_item**) calloc (table->size, sizeof(Ht_item*));
for (int i=0; i<table->size; i++)
    table->items[i] = NULL;

and then test if the slot in the bucket is already filled with an item via

Ht_item* current_item = table->items[index];
if (current_item == NULL) {
    // Insert iten
} else {
    // Handle collision
}

I wanted to do the same in Rust, but I'm not entirely sure if that's a good way to do things (that's what I meant with "thinking too C like").

Fine that you're trying to learn unsafe rust, but concerned why the vector code didn't work - as that is the bread and butter of most rust datastructures (including BTreeMap); learning the rust-way will certainly be useful.

So as Cerber said, you should never get a null pointer after doing address arithmetic (even in C). What you're missing is you didn't dereference the pointer then look at it's contents.

Other comments, I think the Box is redundant (just use String directly).. You have a quadruple indirection instead of a double indirection. (hashtable -> HashTableItem -> boxed-String -> raw-string data).. Compare this to Rust's SwissHash which has hashtable -> raw-string (single indirection :slight_smile: )

Not sure if you're doing open or closed hash-table; if open, you'd need a *mut HashTableItem in your HashTableItem.

Not sure if you need the PhantomData; I'm not 100% sure how to use it, but I thought it was when you have a generic to the struct that isn't explicit anywhere else in the struct (e.g. if only the member impl functions would use it). Doesn't hurt either way - it's just compiler foo.

for cap <= isize::MAX that might be the good way to do it, but if you just say cap as isize as usize it'll panic for you; so just cleaner looking code IMO - not sure why 2^63 is a deal breaker anyway. :slight_smile:

You'd probably do best to use MaybeInit - which IIRC uses the value "1" to mean it's never been initialized. It's going to be more expensive than a memset(0) but I doubt by much memset(1u64) basically. This gives you 3 states; never-initialized, NULL (in the C sense) and valid. Technically since you're in raw-space land, you can do whatever you want and could even use 2 and 3 (as even 4 byte aligned pointers would always have bottom 2 bits as cleared) Most allocators are 8 or 16 byte aligned.

I think it's not safe nor sound to just cast the zero pointer as *mut HashtableItem and use it directly.. I think you should use the ptr::write to write a Default value - even if it's inherently all zeros. (similarly whenever 'taking' the data use ptr::read). I know you CAN do it - the killer would be if you ever added Enums - because it's difficult to make assumptions about what the compiler might do.

For me, the main reasons to use unsafe are

  1. talking to C FFI
  2. life-time games are too crazy or performance hurting; so just ptr::forget(val) and cast to raw *mut T .. then later restore - such that drop code works again.

Any other time (when there's a valid use case), there's bound to be some 3p library that wrapped the unsafe.. It's no different than you using it directly, but at least YOUR code is 100% safe/sound; and other's have at least inspected the unsafe paths to make sure they're not missing something. endian and other bitstream processors are big ones I use (or at least copy paste. :slight_smile:

But certainly have fun as a learning exercise.

There seems to be some confusion here. In your Rust code, self.ptr is a pointer to a list of items. But in your C snippet, table->items is a pointer to a list of pointers to items. Which one do you intend to use? (Also, how do you intend to handle collisions?)

1 Like

What you're missing is you didn't dereference the pointer then look at it's contents

D'Oh!

Not sure if you need the PhantomData

According to the Rustnomicon it's needed because *mut T doesn't indicate ownership and thus the DropChecker will not drop any values when the container itself is dropped.

not sure why 2^63 is a deal breaker anyway.

isize is only 2^63 on an 64-bit system. I'm currently on an x86/64 system, but that's not always the case; just yesterday I had to do some stuff on an 8-bit CPU interfacing with an x86 system. It was... weird.

As for the (short) answer to the question: LLVM and two's-complement. LLVM's GEP use signed indices, therefore you can't use an unsigned integer or you'll get an overflow. This is less of a problem on x64 system, since they usually only expose 2^48 of address space, but on anything less it's possible to f*** it up.

I think it's not safe nor sound to just cast the zero pointer as *mut HashtableItem and use it directly..

Yeah, that was just a trial. I took the code from the Chapter "Impementing Vec" from the Rustnomicon and changed it to try some stuff out. The actual line should be along the lines of:

let new_ptr =  unsafe { alloc::alloc(layout) };
self.ptr = match NonNull::new(new_ptr as *mut HashTableItem) {
    Some(p) => p,
    None => alloc::handle_alloc_error(new_layout),
};

Which one do you intend to use? (Also, how do you intend to handle collisions?)

  1. Pointer to a list of pointers to items
  2. Open Hashing

You may be interested in this thread; TLDR, if you don't use #[may_dangle], dropck does the right thing without it today (but if it's not a hinderence, might as well leave it in IMO).

(Just a passing comment; I didn't review the code.)

2 Likes

These days, I suggest instead trying

pub struct HashTable {
    buffer: Box<[MaybeUninit<HashTableItem>]>,
    len: usize,
}

Using a boxed slice instead of pointers can simplify things, since the Box is responsible for the memory (de)allocation, and you're responsible for only giving safe access to the initialized elements.

2 Likes

Because a hash table is a sparse datastructure, you’ll need some kind of flag for each slot to tell you whether or not it’s currently occupied— I’d consider using Option<HashTableItem> for this in place of MaybeUninit<HashTableItem>. The option will let you both store and check for None safely, and probably won’t take up any more memory space due to niche optimizations.

2 Likes

Some people make good suggestions, but they miss the point:
It's not about implementing a HashTable. It's about implementing a HashTable in UNSAFE Rust

Unsafe Rust is a superset of safe Rust, and idiomatic unsafe Rust uses safe constructs everywhere that doesn’t adversely affect performance or functionality.

4 Likes

The point is not about I could write idiomatic unsafe Rust. It's to learn how unsafe rust works! At some point I likely have to work with raw pointers (because much of the stuff I do is currently in C).
I'm not implementing a HashTable because I need a HashTable, I'm implementing it so I can learn more about unsafe Rust. Just following the Rustnomicon on how to write Vec won't get me far because I don't need to think for myself and I did that ad nauseam in university.

Sorry, but going your route I'm just reimplementing the already existing HashTable with a slightly different spin.

Thanks for the link. I'll read it when I got some time.

And that is fine – but the kind of unsafe you will write while interacting with C FFI is very different (and in fact much easier) than the kind of unsafe you would write for managing Rust-allocated memory yourself. The latter is much harder to justify and much less frequently-needed than doing FFI.

If you are trying to learn how to interface with C, I suggest you do that by actually interfacing with C (e.g. wrapping a C library), and not by writing a data structure.

I worded myself wrong there: For me it's not about learning how data structures work -- not anymore. I have implemented my fair share of them over the past years. It's also not mainly about FFIs; I want to understand how unsafe rust memory management works. And that's actually one of the main problems I have with the rust documentation: The topic of unsafe rust is spread all over the place.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.