Why is Vec implemented un-safely?

As a new comer to any language, I try to implement my own versions of basic data structures. Trees, vectors, linked-lists, etc. Then, after beating my head against a wall enough, I accept my inferiority and see how the language gurus do it. Sometimes the answer is that I just didn't know about a feature. Other times it's because there is a much more idiomatic way of doing the same thing (e.g. Writing an array structure in Haskell, for example, instead of using the basic cons list).

When I tried to write my own vector type structure, I ultimately failed. I'd ask other Rust developers how to allocate slices or make mutable copies of immutable slices. In the end, they'd end up saying things like, "Are you trying to avoid using Vec for some reason?"

So, I decided to jump into the Vec source code. Here it is for convenience:

https://doc.rust-lang.org/stable/src/collections/vec.rs.html#142-146

Vec, as you can see, uses unsafe memory manipulation. Making the assumption that the language authors intended to do it this way and this isn't just crusty Rust 0.1 code (correct me if I'm wrong, please), it raises a very important question for me:

Why is it impossible for a fundamental data structure to be implemented safely, when one of the core claims of the language is that it is fundamentally safe? Am I to walk away with the lesson that it's appropriate and idiomatic to Rust to develop unsafe code when building data structures?

To me, this would be analogous to someone saying Linux is secure because all the controls on the system calls are really tightly locked down, but then finding out that cat needs a special kernel driver to work properly.

Can someone help set me straight here? What am I missing?

2 Likes

The point of rust isn't to remove unsafe code entirely - it's to limit and control it. By only doing unsafe things in controlled places, you can limit what code could cause memory safety errors, and continue writing completely safe code elsewhere.

What you are doing in an unsafe block is asking the compiler to trust you, and then everywhere else it will make sure. The compiler can't check that everything is safe - there are some operations which just by their nature require manual control.

In rust, if you do need to use unsafe, you typically do it in a very limited amount of code, and then you only have to scrutinize that specific code for memory errors, and then build up a completely safe API on top of it. Typically in end-programs, you will never need unsafe code - you will just use libraries which are widely-used and scrutinized, and they will guarantee that the unsafe code functions correctly.

When building your own data structures, yes, it is appropriate and idiomatic to use unsafe code when needed. The important part that you have to make sure is that your unsafe code will handle all edge cases for parameters - and that you only use it for the very internals. Once you've built a trusted unsafe base, you can build up a safe API on top of it.

I would highly suggest reading `unsafe` if you haven't done so already.

6 Likes

To build on what @daboross said, a common theme in rust is abstraction. Vectors, iterators, doubly linked lists, etc. all use unsafe blocks because they are core building blocks for other data structures and couldn't do be built without unsafe*. However, now that you have these core building blocks, you can often build your own data structures without writing any unsafe code of your own.

*Rust also uses unsafe when it doesn't strictly need to to optimize hotpaths.

4 Likes

Thanks for the reply.

I agree with you completely and see no problem with corner cases such as FFI or, say, programming a DMA controller in a kernel. These areas necessitate unsafe code because the semantics of something outside the control of the compiler require the developer to outright break the memory safety to do it. There's no way around this kind of problem unless you bake in every single use case imaginable into the language semantics, which is of course intractable.

Also, I understand the need for optimization of hotpaths in code as @stebalien mentioned. One might write pure Rust code, determine that the orders of magnitude in savings is worth the risk and then re-write with unsafe code. After all, that's what they do in things like fftw3 and libgmp. (OK, so there's no safety in those libraries, but they're dropping to assembler when they can take advantage of SIMD.)

This is where I get lost. Why would unsafe code ever be needed in a purely Rust-implemented data structure, let alone be idomatic? Memory safety is Rust's bread and butter. Why would we eschew that in data structures, which are the main consumers of memory?

Let's look at Java. Java, with all its shortcomings, was a boon to the industry because it introduced a really neat type system and memory safety. Can you imagine how awful the Java community would be today if Sun said, "Yeah, we have no way of implementing ArrayLists in pure Java because the type system is too restrictive in many places. Therefore, we're going to say it's idiomatic Java to implement data structures with JNI--but only when needed"? I shudder when thinking about what reflection and Java beans would have become with JNI intermingled there. Or, worse, native void doGet()!

I'm OK with the answer "it's a well-calculated performance-safety trade off that we've weighed carefully for use within our standard library." What I'd like to see is a pure solution so I can see whether I agree that your trade off makes sense. However, I'm frustrated if Vec is impossible to implement in the memory safety system altogether.

1 Like

If nothing else, Vec<T> implementation needs to cast a pointer to a raw block of allocated memory to "array of T". Rust compiler cannot check this transformation, so you have to say "just trust me" by way of unsafe.

This sort of thing happens in most (all?) programming languages.
Look, for example, at the C++ standard library. You will find there the most convoluted and un-idiomatic code there is this side of boost. However STL exposes nice idiomatic interface to the users and that's what matters.

The reason that it's idiomatic to use unsafe is really just that it is practically impossible for a compiler to check if you are using raw pointers correctly. Not all data structures can be represented in safe code, just due to the nature of what rust guarantees with "safety".

We could try to build a compiler which would allow this, but it would be extremely hard to guarantee that you are using memory safely when creating structures like Vec or LinkedList without creating some kind of runtime overhead.

For the analogy with Java, unsafe is a feature in rust, not a completely separate language. unsafe is there to be used! It's not like when you're using unsafe it's completely non-rust code. Literally the only difference between safe code and unsafe code is that you are allowed to access raw pointers in unsafe code (unsafe code can also access other unsafe code, but most of the time it just boils down to raw pointers).

In short: it's impossible for the compiler to know every way you can do things, and unsafe is for those cases where the memory safety compiler checks just can't figure it out.

1 Like

Yeah, we have no way of implementing ArrayLists in pure Java because the type system is too restrictive in many places.

I don't think this is equivalent - it is possible to implement a 'list' interface backed by a vec in safe rust code.

To me, the equivalent would be asking you to implement the primitive array in pure Java. I suspect you'll find this isn't possible without trickery - you can't request contiguous blocks of memory in Java, so you simply can't get the O(1) performance one would expect from an array.

What safe languages are you aware of that give array support in a (safe) library?

3 Likes

ArrayList in Java is a bit of a special case because (IIRC) it's implemented using arrays, which are a built-in, unalterable feature. You can't implement arrays themselves efficiently in pure Java; you could only simulate them with some tree structure. In contrast, since Rust allows more down-to-the-metal operations within unsafe blocks, you can implement arrays in pure (but unsafe) Rust, and that's basically what Vec does, except that it integrates the auto-resizing behavior of ArrayList rather than putting it in a separate wrapper. Any structure that can be implemented efficiently in pure Java using arrays can be implemented in safe Rust using Vec. However, it may be desirable to use unsafe code as a performance optimization.

In the case of non-array-based data structures - especially the many tree- or graph-like structures that can be implemented reasonably idiomatically in safe Java - Java has a big advantage in having a garbage collector. Depending on the structure, it may be possible to implement it in safe Rust using Box or Rc, and it's definitely possible if you use ugly hacks like heap-simulation-via-vecs, or just adopt a real garbage collection library for Rust once one gets written (but as a library writer that requires you to put a big burden on users - GC will never be idiomatic in Rust). Java's pervasive garbage collection doesn't require you to weigh different strategies here, and the amount of tuning put into it makes it likely more efficient than naive use in Rust of Rc, garbage collectors, and possibly Vec-heaps (if you can implement a structure with Box, though, it's probably optimal anyway), but compared to manual allocation and deallocation using unsafe code, GC typically has unavoidable memory and time overhead. Rust's type system has some almost unprecedented features to allow you to safely do things that in other languages would require additional overhead or GC to be safe, but for low-level data structures it's not expressive enough. There are some languages that can truly safety check almost any low-level operation (ATS), but only via extremely complicated manual proofs, relegating those languages to academia for now. In lieu of that, Rust values performance, and takes the philosophy that hiding small amounts of carefully written unsafe code behind safe interfaces will still prevent almost all memory safety vulnerabilities in practice.

(edited for more detail)

6 Likes

@comex, I think you basically summed it all up, thank you. Thanks everyone who contributed above, too.

Here's what I concluded. Vec is ultimately a primitive in the sense that it cannot be implemented safely otherwise. This would be similar to Java's primitive arrays, not with one of the higher level collections.

My confusion has come from three points.

First, I expected such primitive structures to be part of the language syntax, much like [T;N] is. I suppose it would just have been syntactic sugar to turn Array<T> into something with more of an intrinsic feeling syntax. I guess that's just me projecting prejudices I've learned from other languages.

Second, slices and Vec share an enormous amount of interface, but slices are simpler. Therefore, I presumed that Vec should have been implementable with something more primitive that's safe and related to slices. Actually, at this moment, I'm still unsure why we even have slices when Vec seems to handle all slice's interface plus allocation, but maybe that's a separate question.

Third, maybe it would have been more appropriate to split Vec into two structures. Something like an Array structure that just does allocation of run-time sized arrays, and then Vec which uses Array to safely implement growable and shrinkable arrays (and iterators, stacks, etc.).

I don't think I'm completely satisfied, but I understand the philosophy behind it thanks to you all. My issue now is more about whether I think the safety claims are actually going to work out in practice making fighting to learn the borrow checker worth the time. Toward that,

I sure hope that works out. I want it to work out. Rust is cool. I fear the day when the corpus of Rust code is so large that no one person can fit into their mind at once all the unsafe code that they have to rely on. Calling unsafe idomatic hurries that day along toward us.

For the record: slices exist because a Vec always represents a whole allocation, while a slice can point to part of a Vec (or a fixed-length array or other structure), hence the name, and is never its own allocation. Having them as separate types allows the slice abstraction to be zero-cost - slices don't have to have a destructor that checks whether they're owned or not, decreases a reference count, etc.

1 Like

Slices borrow memory, but don't own it. These two things are fundamentally different.

1 Like

Another way of looking at unsafe blocks is they are not really "unsafe". They are actually safe, but their safety is guaranteed by programmer, not a compiler.

5 Likes

Here's what I concluded. Vec is ultimately a primitive in
the sense that it cannot be implemented safely otherwise. This would be
similar to Java's primitive arrays, not with one of the higher level
collections.

I wish I had seen this message earlier. I think you've gotten some really off-target advice in this thread, and all the preaching about "unsafe" completely misses the point: You can use safe public interfaces in the library to get a lower (than Vec) level array of data, but they just didn't make it very obvious.

What you want is a "slice" with a "Box" to own it:

    use std::boxed::Box;

    fn alloc<T: Clone + Default>(size: usize) -> Box<[T]> {
        return vec![Default::default(); size].into_boxed_slice();
    }

    fn main() {
        let mut a: Box<[u8]> = alloc(20);
        a[9] = 123;
        print!("got: {} {:?}\n", a.len(), a[19]);
    }

This is basically tricking the Vec type to allocate the slice and then disregarding the Vec it came from. The boxed slice knows it's length and performs array bounds checking. Nothing unsafe about it as far as I can see, and I believe the borrow checker will keep tabs on the dynamically allocated memory and free it when appropriate.

If they did expose a direct way to allocate a boxed slice, then Vec itself could've been implemented in safe code. This is something I think the Rust library creators should consider - both to minimize their own exposure to unsafe code (what's good for the goose), and because it would make a great model for how end users should implement any needed containers which are not in the std library (Matrices, MultiArrays, ???).

1 Like

And how are you going to allocate it? You will need to call allocator, that is operate on raw pointers, which is unsafe.

In this case Vec allocates for you. If "direct interface" to allocate such slice as you describe existed, it would be implemented with unsafe code. Either way you end up with some unsafe piece of code buried into some nice API.

You just can't make an omlet without breaking some eggs.

1 Like

It can't be implemented in unsafe code even with a direct way to create a slice (see below for more discussion on this), since a Box<[T]> assumes that all elements are valid instance of type T, which is not at all guaranteed for a Vec: it has slop space to be able to efficiently fill in more elements, and these are in various states of invalidity. This means that one can't store Box<[T]> and a usize indicating how many of those are "real", because Rust has no universal "safe" default object state (like null in Java). And, we can't require that all objects implement Default.

Anyway, as @kstep says, creating a Box<[T]> would either require unsafe code itself, or would require making allocating Box<[T]>s a language feature. Rust baking "opinionated" features like this (allocation) into the language would require a very strong reason, and this doesn't seem to be one of them (it probably wouldn't remove that much unsafe code by LoC, and is effectively just shifting the unsafety into the compiler's code-generation).

2 Likes

And how are you going to allocate it? You will need to call allocator, that is operate on raw pointers, which is unsafe.

In this case Vec allocates for you. If "direct interface" to allocate
such slice as you describe existed, it would be implemented with unsafe
code. Either way you end up with some unsafe piece of code buried into
some nice API.

So your logic is, "since you need some unsafe code to allocate a basic array, you might as well have unsafe code in every place that needs a basic array"? Having unsafe code buried in a nice API is exactly what you want.

Seems like we don't understand each other. Unsafe impl wrapped into safe API — is one of main Rust motto. What I wanted to say, there's always some unsafety under the hood. Now this unsafety is hidden in Vec, you propose to hide it under Box<[T]>. It doesn't matter what API is, there's always some unsafe allocation somewhere.

Where have you seen I told you need unsafe code for any basic array? I haven't said it in any way.
"Basic arrays" are implemented with unsafe code, but I never meant you need unsafe code to use them.

It can't be implemented in unsafe code even with a direct way to create a
slice (see below for more discussion on this), since a Box<[T]> assumes that all elements are valid instance of type T, which is not at all guaranteed for a Vec: it has slop space to be able to efficiently fill in more elements, and these are in various states of invalidity.

"Can't" is a strong word, but I understand your point about not letting null pointers creep into the language. The only counter argument I'll make is that you you could have a low level array which hides nulls by returning Option<> for reference types. Then Vec and friends could be built safely from that. Maybe using Option like this would be a performance problem, I dunno.

Anyway, as @kstep says, creating a Box<[T]> would either require unsafe code itself, or would require making allocating Box<[T]>s a language feature. Rust baking "opinionated" features like this (allocation) into the language would require a very strong reason, and this doesn't seem to be one of them (it probably wouldn't remove that much unsafe code by LoC, and is effectively just shifting the unsafety into the compiler's code-generation).

I already can create a Box<[T]>, so call it a language feature or not, that's what I'd use if I was building a data structure and didn't want Vec's unnecessary baggage added to the mix. The opinionated part I worry about is this notion that end users don't need that kind of ability unless they're willing to use unsafe blocks in their own code.

This worry is the reason for my original question. As an end user, I want an array-like object (similar to Java's primitive arrays). With that array, I want to not have to worry about writing unsafe code to build more complex data structures. As you describe, Box<[T]> fits that intuition nicely. Since Vec seems like a complex data structure (e.g. it expands and shrinks), I expected that Vec would be made with a Java-like primitive array. Of course, Vec doesn't use anything like this.

As a user, I want the basic language to be powerful enough that I can implement any algorithm or data structure I want without having to use unsafe code. My question was two fold: Does Rust have this property? If so how does that work for basic Computer Science 101 class data structures such as an expandable and shrinkable array list?

At this point, the answer seems to be: Yes, Rust has this property, and I use Vec--I just get over the "baggage" and move on. So, if in the future I want to write a rope or a matrix or any other host of different data structures, I need to either use Vec or roll my own unsafe code.

The Vec is exactly this:

basic Computer Science 101 class data structures such as an expandable and shrinkable array list