How is a Vector of owned heap-allocated data (e.g. Strings) represented?

#1

I am writing a program for sorting a &mut [T] to help me learn generics. I have already had a working version of the simple sorting function (optimised bubble sort) when I decided that the swapping part could be its own function. I am now stuck trying to get the swap function to compile.

I think a first step to trying to fix it would be to understand how exactly a Vec<T> works with other owned, heap-allocated data types. Would I be correct to say that a Vec<String> consists of the pointer, capacity and length in a heap-allocated array and the String's chars is stored in some other random place in the heap? So that a Vec<String> containing "Hello" and ", world!" would look like this:

(assume 0xabffff00 is on the heap and the value in the Vec<T>'s pointer)

0xabffff00 => ptr = 0xabffffa0
0xabffff04 => cap >= 5
0xabffff08 => len = 5
0xabffff0c => ptr = 0xabffffd4
0xabffff10 => cap = >=8
0xabffff14 => len = 8

0xabffffa0 => ‘h’
0xabffffa1 => ‘e’
0xabffffa2 => ‘l’
0xabffffa3 => ‘l’
0xabffffa4 => ‘o’

0xabffffd4 => ‘,’
0xabffffd5 => ’ ’
0xabffffd6 => ‘w’
0xabffffd7 => ‘o’
0xabffffd8 => ‘r’
0xabffffd9 => ‘l’
0xabffffda => ‘d’
0xabffffd4 => ‘!’

In this case, I want to swap the ptr, cap and len of the first and second Strings, but keep the heap allocations just as they are. But I also want the swap() function to be generic, so it should work with any data type, e.g. with i32.

Here is the swap() function:

fn swap<T>(a: &mut T, b: &mut T) {
    let temp = a;
    a = b;
    b = temp;
}

It doesn’t compile yet, but this should make clearer what I am trying to do.

Here is the full program:

use std::fmt::Display;

fn sort<T: PartialOrd>(list: &mut [T]) {
    for i in (0..list.len()).rev() {
        let mut swapped = false;

        for j in 0..i {
            if list[j] > list[j+1] {
                swap(&mut list[j], &mut list[j+1]);
                swapped = true;
            }
        }
        if !swapped {
            break;
        }
    }
}

fn swap<T>(a: &mut T, b: &mut T) {
    let temp = a;
    a = b;
    b = temp;
}

fn print<T: Display>(list: &[T]) {
    for (i, item) in list.iter().enumerate() {
        print!("{}", item);
        print!("{}", if i < list.len()-1 { ", " } else { "\n\n" });
    }
}

fn main() {
    let mut number_list = vec![40, 31, 98, 25, 67, 59, 72, 14];

    print(&number_list);
    sort(&mut number_list);
    print(&number_list);

    let mut string_list = vec!["Johan", "Louis", "Louis", "Clarice", "JP",
                               "Hannes", "Thys", "Bernard", "Mariandri"];

    print(&string_list);
    sort(&mut string_list);
    print(&string_list);

    let mut char_list = vec!['b', 'e', 'z', 'l', 'y', 'a', 'r', 'o', 'n', 'k', 'q', 'c', 'v',
                             'm', 'x', 'd', 'h', 'f', 'g', 's', 'w', 't', 'p', 'i', 'u', 'j'];

    print(&char_list);
    sort(&mut char_list);
    print(&char_list);

}
  1. Am I correct in my hypothesis about how Strings are stored in a Vec?
  2. Can I somehow do what I want and write a generic swap function that swaps two instances of the same type efficiently?

Thanks in advance for your consideration!

#2
  1. Yes, you are completely right;

  2. Efficiently swapping two places of memory is not something you can implement by yourself without unsafe code while respecting ownership semantics (I guess that’s what you meant by a generic function); it must thus be something provided by the language itself, at its core: mem::swap.

    • the aformentioned non-unsafe swap function requires two unique references (&mut), which is “not easy” to get from a Vec (or more generally, any slice), given the granularity of borrows (at the binding / variable level, i.e., the whole slice / array). Thus we also have something provided by Rust, from its std library: <[T]>::swap.

      let mut array: [_; 2] = [", world!", "Hello"];
      {
          let slice: &mut [_] = &mut array[..];
          slice.swap(0, 1);
      }
      assert_eq!(array.concat(), "Hello, world!");
      

      As you can see, it is a method more general than one targeting Vecs (i.e., ptr, len, cap), since it can target any (fat) reference to a slice (i.e., ptr, len). This is definitely what you are looking for, when sorting arrays slices.


3 Likes
#3

Ok, so your proposals would do the following:

  1. with Copy types, like i32, swap() would copy one of the two values in its entirety, then copy the other one and write it to the first one, then write the temporary copy to the second one.
  2. with Drop types (i.e. with heap-allocated data) it would copy the representation which can be stored on the stack (i.e. ptr, len, cap for String) and swap these in the same way as the whole value was swapped with Copy types, but leave the heap data (i.e. the Vec<u8> of the String untouched.

?

Anyway, thanks for the clarification! I originally wanted to try to not use any library methods on the list I send in to get it sorted, but looks like this little project would need to wait until I am a bit more proficient in Rust. It seems to me I would need to use unsafe code to swap the values efficiently myself.

(I have also looked at the std modules’s functions and how exactly they do this, and it is a little more complicated than I anticipated.)

#4

All swap does is copying data in a triangular fashion

But it is, as I said, a complex / subtle pattern for Rust ownership semantics. Let’s dive into it!

First of all, this is the intuitive swapping implementation that comes to mind:

  • I call move the fact to first copy, and afterwards set the source to an “uninitialized-memory state” uninit/ empty (this is important to avoid double dropping values)

    //! Pseudo-code
    fn swap<T> (p1: &mut T, p2: &mut T)
    {
        #[move] let temp: T = *p1; // temporary copy of the first pointee elsewhere, now
        // *p1 is uninit
        #[move] *p1 = *p2; // reinit the first pointee with a copy of the second pointee, now
        // *p2 is uninit
        #[move] *p2 = temp; // reinit the second pointee with a copy of the temporary copy, which is now uninit and will thus not be dropped.
    }
    

Now, with non-unsafe Rust, it is not possible to express these semantics, since pointees cannot be in a uninit state!

  • When the data type is marked Copy, then the uninit / empty state is not needed, since we can replace moves by plain old copies. We can thus then write the dollowing intuitive code:

    use core::{mem, ops::Not};
    
    fn swap<T> (p1: &'_ mut T, p2: &'_ mut T)
    where
        T : Copy, // this line is required for the code to compile
    {
        assert!(mem::needs_drop::<T>().not());
        let temp = *p1; // temporary copy of the first pointee
        *p1 = *p2; // copy of the second pointee overwrites the first pointee
        *p2 = temp; // overwrite the second pointee with a copy of the temporary copy
    }
    

Note also that if we use an actual valid value to replace the data, we can “move out and uninit data” by replacing it with such a value; that’s what core::mem::replace is for:

use core::mem;

/// let's use Option<T> since `None` is guaranteed to be a valid Option<T>
fn swap<T> (p1: &mut Option<T>, p2: &mut Option<T>)
{
    let temp: Option<T> = mem::replace(&mut *p1, None);
    *p1 = mem::replace(&mut *p2, None);
    *p2 = temp; // move semantics make this Just Work
}

and this is such a pervasive pattern that Option offers a shortcut for it: Option::take

/// let's use Option<T> since it provides take(&mut Self) -> Self
fn swap<T> (p1: &mut Option<T>, p2: &mut Option<T>)
{
    let temp: Option<T> = p1.take();
    *p1 = p2.take();
    *p2 = temp; // move semantics make this Just Work
}

We are slightly generic with this approach, but not as much as we would like.

Solution

For full genericity we need unsafe, since it enables raw copying, and it is up to us to emulate the move semantics. There are two scenarios when wanting to do *a ← b:

  1. either *a is uninit, in which case we just have to

    ptr::write(&mut *a, b); // this consumes b
    
  2. or (the usual case) *a does have a value, in which case it has to be dropped to reach state 1.

    ptr::drop_in_place(&mut *a); // may panic btw
    // now we know *a is uninit
    ptr::write(&mut *a, b);
    

And how do we “move out” data? Easy, with ptr::read(& *a). This call evaluates to *a, and thus *a must now be considered in an uninit state (else we would double drop *a, which is UB).

Now we can swap:

fn swap<T> (p1: &'_ mut T, p2: &'_ mut T)
{
    let temp: T = ptr::read(& *p1);
    // *p1 is now uninit
    ptr::write(&mut *p1, ptr::read(& *p2));
    // *p2 is now uninit
    ptr::write(&mut *p2, temp); // temp is consumed: all is good
}

So, when you think about it, swapping two variables destroys neither one; hence Drop plays no role here (it would only play a (sad) role if the implementation were wrong).

This is what core does, except it uses copy_non_overlapping(src, dst, 1) instead of our write(dst, read(src)):

let temp = ptr::read(p1);
ptr::copy_nonoverlapping(p2, p1, 1);
ptr::write(p2, temp);
9 Likes
#5

That was some deep dive! I didn’t follow quite along with everything, but now I have a basic idea! Thanks! I will definitely revisit this when I am more proficient with Rust!

1 Like
#6

I will try to expand on a very Rust specific point, which was crucial for my explanation (why we need move semantics instead of copy ones). Let’s take the T : Copy code, and imagine we don’t have such bound:

/// Sketch code: this does not compile
fn swap<T> (p1: &'_ mut T, p2: &'_ mut T)
{
    // temporary copy of the first pointee
    let temp = *p1;
    // copy of the second pointee overwrites the first pointee, thus
    *p1 = *p2; // *p1 is dropped !!!!!
    // overwrite the second pointee with a copy of the temporary copy, thus
    *p2 = temp; // *p2 is dropped !!!!!
}

That is, in Rust, whenever you mutate a “variable” (place expression) with an assignment, that “variable” is dropped before it is overwritten.

This is why Rust has ownership-based move semantics.

If we take the above code and use owning bindings instead of &mut _ references, all is fine:

type T = ...;
let mut x: T = ...;
let mut y: T = ...;

// === time to swap === //
// x contents are "moved out" of the binding 'x', into a new binding 'temp'
let temp = x;
// the binding 'x' gets a value back: y contents, which are "moved out" of the binding 'y'
x = y; // x is not dropped because Rust knows it is uninit / its contents have been moved out
// the binding 'y' gets a value back: temp contents, which are "moved out" of the binding 'temp'
y = temp; // y is not dropped because Rust knows it is uninit / its contents have been moved out

In this case, Rust can know that x = y (and y = temp) are not mutations but “reinitializations”, and thus that x (and y) do not need to be dropped.

But Rust cannot grant this property from &mut _ references / pointers (not allowed to “move out of borrowed content”), thus it must be told to trust us and let us do all by hand with unsafe (but sound) ptr::reads and ptr::writes.

1 Like
#7

Interesting, thanks!