What are intrusive linked lists and does Rust really struggle with them?

akrauze · September 22, 2024, 8:38pm

Hi. From time to time I've heard term "intrusive linked list". I also remember that I've read claims that they are particularly (more than standard ones) difficult to implement in Rust. I've found a nice article Intrusive linked lists, to understand what they are, and maybe why their implementation would be difficult. This article is really put well, however after reading it I have more questions than before.

I think that the main point of this article is that in C you can define a linked list either like this:

struct list_node {
    struct list_node *next;
    struct list_node *prev;
    char *data; // type erased linked list value
};

Where data and node are allocated separately, or:

struct list_node {
    struct list_node *next;
    struct list_node *prev;
    struct Data data; // concrete Data type
};

Since C does not support generics the former definition is easier to use, since one can erase type of data using another layer of indirection. And the latter one does not result in memory fragmentation, but requires defining a unique node type for each "generic instance" of a list.

However an intrusive linked list is defined as:

struct list {
    struct list *next;
    struct list *prev;
};

Data stores it inline:

typedef struct item {
  int val;
  struct list items;
} item;

And later one can use pointer offsets to go from pointer to node, to pointer to data.

What I don't understand is what is the difference between intrusive linked list (as defined above) and a linked list, which defines node which stores data inline. It's as if node and data were two boxes, one containing the other. It doesn't matter witch one is the outer, and which one is the inner one. Outer box will always occupy the same amount of space, and getting from one to the other can always be defined as some pointer offset. Is the real usefulness of intrusive linked lists only that they allow C code to have a "generic" implementation of list algorithms?

Does any of this changes when implementation is written in Rust, and not in C? Is there anything more than standard headache when it comes to this variant of lists? I've found quite popular intrusive_collections crate, so there must be something that intrusive lists provide, that cannot be done using std::collections::LinkedList, is there?

erelde · September 22, 2024, 8:53pm

There two things in your question:

Intrusive data structures, in short they are the poor man's generics. We, mostly, don't need them, we have generics.
Linked list and co are hard to implement in Rust, but not impossible, because they have intrinsic semantics hard to reconcile with the borrow checker.

the8472 · September 22, 2024, 8:57pm

github.com/rust-lang/unsafe-code-guidelines

Storing an object as &Header, but reading the data past the end of the header

opened 04:26AM - 11 Nov 20 UTC

thomcc

C-open-question A-aliasing-model A-SB-vs-TB

This is related to https://github.com/rust-lang/unsafe-code-guidelines/issues/2 …but the read is not out of bounds of the allocation, not being written to by other threads, not the bytes of a `&mut Blah`, etc. That is to say, really the code is trying to model a dynamically sized type, that for one reason or another does not support (Note that ther are a number of custom DST proposals). So, I heard that it was UB for you to have a &T and read outside the bounds of that T, even if conceptually it's a totally in-bounds read. E.g. `T` here may be a ZST, or it may be a header after which a trailing array is expected, or standing that sits at the head of a trailing array, or it may be a struct that's the common shared fields of some set of other struct... These are pretty common in unsafe code as it's a pattern which is both legal and useful in C and C++. It's pretty common in Rust too: - It's not unheard of in C apis to use a `#[repr(C)] struct Foo { _priv: [u8; 0] }`, as this is what bindgen uses. Some of these APIs then go on use `&Foo` in the Rust code. (This is essentially a workaround for a lack of a stable `extern Type`). This code doesn't read the data, so the only issue would be if we told LLVM it could assume things about the pointer that turn out to be untrue in a situation like cross-lang LTO, probably. - Similarly, I've seen other FFI code that used a `struct CStr([u8; 0])` for a similar purpose — as a version of `std::ffi::CStr` that you can actually pass to C directly. (I even almost did this for [ffi_support::FfiStr](https://docs.rs/ffi-support/0.4.2/ffi_support/struct.FfiStr.html), but went with a pointer inside so I could easily check for code passing in null). - `bitvec` has a `BitSlice` type which acts a lot like a slice that magically has bit-level indexing. Internally it's something like `struct BitSlice { _mem: [()] }` which lets it behave like an unsized type, The "pointer" and length are both specially encoded values that contain both the actual pointer/length as well as bit-level offsets for tracking where withing byte things are. There are a lot of reasons this might be illegal, but I had not thought `mem::size_of_val` returning the wrong value was the actual one. - `anyhow::Error` internally wraps a `Box<ErrorImpl<()>>`, where `ErrorImpl<T>` contains a vtable, a backtrace, and then the `T`. `ErrorImpl<()>` is used as it behaves as the "common header" for real ErrorImpl values. On construction, `Box<ErrorImpl<T>>` is converted to `Box<ErrorImpl<()>>`, when stored in the Error. Whenever a method is called that needs to delegate to the vtable, the `Box<ErrorImpl<()>>` is converted into the right pointer type for the vtable function (one of `&ErrorImpl<()>`, `&mut ErrorImpl<()>`, `Box<ErrorImpl<()>>`) which is called with that pointer. The first thing the vtable function generally does is convert the reference to e.g. `&ErrorImpl<T>`, example: https://github.com/dtolnay/anyhow/blob/99c982128458fecb8d1d7aff9478dd77dac0ee3b/src/error.rs#L538-L545. (I had always kind of thought it wasn't okay to use `Box<T>` here, but I'm surprised that stuff like `&ErrorImpl<()>` to `&ErrorImpl<RealType>` isn't okay either). - `wio-rs` contains `VariableSizedBox` which provides this pattern in a library form, and IIUC is mostly intended for the flexible-array-member case. The API attempts to launder pointers to the object, which is... very non-obvious. It seems like it plausibly avoids the issue here, though, but it's insanely subtle, and if this is the recommended pattern, I suspect it will need a very good nomicon entry. https://github.com/retep998/wio-rs/blob/9bf021178b2d02485f1bd35e6cff41bf52d4a9a2/src/vsb.rs#L98-L113 - I do [something similar](https://github.com/thomcc/arcstr/blob/main/src/arc_str.rs#L725-L728) in `arcstr`, where there's a header and a variable length segment that trails it. I avoided issues here by luck, as I took great care to avoid ever putting the inner type behind a reference. This was lucky since I wasn't aware of this at all, and did it for other reasons. This was painful as it required field hard-coding offsets. - This isn't to say anything of the numerous C or C++ apis which expose polymorphism in this way — In c++ this is how single non-virtual inheritance is represented, so it's especially common, although it was common in C too. Additionally, C code with a flexible array member is in tons of places, and not just windows APIs. This is just a few off the top — there's a lot of unsafe code that does this. Personally, I had thought it was allowed so long as you don't go past the actual bounds of the allocation, it makes *some* sense that it's not though, unfortunately. (Somehow, I don't think I've ever had miri trouble me about it, but it's seeming like it's just because of luck && coincidence more than anything else). Anyway, I think if this is UB we should start being way more vocal about it, because it's a totally legal pattern in C and C++, and common.

jumpnbrownweasel · September 22, 2024, 9:01pm

Hm, maybe I'm wrong. I have always thought intrusive linked lists were used when the list element must be allocated separately, for whatever reason, and the links are placed in that element's data type.

This avoids another data structure with additional allocations and overhead. There is less indirection to traverse the list.
Perhaps more importantly, it allows putting elements into a linked list when those elements are part of some other data structure, such as a hashmap. I believe this is how linked hashmaps work.

I don't know of any reason that intrusive linked lists would be more or less difficult in Rust than extrusive linked lists.

akrauze · September 22, 2024, 9:02pm

Ad 2. I understand that. I've even linked Learn Rust With Entirely Too Many Linked Lists.

Ad 1. This is my real question. If they are really that, then why would they be defficult to do in Rust. Is it caused by uncertainty around provenence (alluded by @the8472)?

akrauze · September 22, 2024, 9:07pm

Wouldn't that add space overhead of next/prev pointers to each element? Or is this tradeoff that such data structure makes? And how would it be different from storing Node<T> instead of T with adjacent data, other that conceptually being more straightforward?

jumpnbrownweasel · September 22, 2024, 9:14pm

Yes, it is called intrusive because the links have to be defined when the element's data type is defined. For example, if you want the elements in two linked lists, you need to add two sets of link fields! However, the total overhead is the same for intrusive or extrusive -- the size of the links is the same in either case.

In an extrusive linked list (as in Rust's std linked list), the items (the wrapped elements along with the link fields) are allocated in a single array (a Vec for example), or perhaps in chunks. In the intrusive linked list, normally the elements are each allocated separately as I mentioned, or as part of some other data structure.

afetisov · September 23, 2024, 1:13am

You can't implement shared ownership between multiple linked lists via Node<T>. If you have this kind of node wrapper, when would you drop the node? When it's removed from the list, the only reasonable answer. But if the node is supposed to be still present in a different list, how do you handle it? That's where the fundamental clash between Rust and intrusive linked lists happens: Rust wants unique ownership, while intrusive linked lists are fundamentally about shared ownership.

On its own, it's not a big deal. After all, ordinary (double) linked lists are also a form of shared ownership, as well as many other memory management primitives. This just means you need to implement it via unsafe code. But intrusive linked lists can include an object in an arbitrary number of lists, while you can't write code polymorphic over the number of fields. This means the user needs to write unsafe code, or use a macro.

There may also be issues with mutation. An object placed in two intrusive linked lists can be accessed by both lists simultaneously. Mutating the data requires some synchronization, but removing the node from different collections, or adding it to them, can be safely performed concurrently, since they use different links. I think this kind of constraint may be difficult (impossible?) to express via a safe API.

Overall, I don't think Rust handles intrusive linked lists worse than C, it's just not as good as we usually expect. And personally I don't think intrusive linked lists are common outside of certain niche C code. Linked lists are already niche collections, multiple-ownership intrusive linked lists are even more so.

Unfortunately, this niche includes the Linux kernel, where intrusive linked lists are commonly used to manage resources.

riking · September 23, 2024, 4:26am

Take a look at the ListLinks type in the Rust-for-Linux documentation.

akrauze · September 23, 2024, 8:21am

Do you mean this? Or do you mean some other documentation?

akrauze · September 23, 2024, 8:41am

This is very interesting. I recall that intrusive linked lists were indeed mentioned in the context of many lists sharing the same nodes. However, now when I think about it, I cannot understand how it is possible. Say we have two circular double linked lists A & B, both having n elements. a_i points to a_(i - 1) and a_(i + 1). Even if b_i points to a_i as its successor, a_i already uses both of its pointers, so it cannot point to b_(i + 1) as "another" successor.

Could you point to me where I make deduction error? Because it feels like sharing nodes between lists makes sense, but I cannot make it click.

2e71828 · September 23, 2024, 8:56am

In this scenario, each node will contain 4 pointers, two for its position in list 𝒜 and another two for its position in list ℬ.

afetisov · September 23, 2024, 8:56am

You just include 2 pairs of pointers: one for a and one for b.

2e71828 · September 23, 2024, 9:00am

It also includes the implementation of some better-behaved collection types, like B+ Trees, which link the leaf pages together for more efficient iteration.

akrauze · September 23, 2024, 9:01am

Ok. That makes sense. So we must statically know in how many lists given node should be in. Still, how does this differ from something like this?

struct DoubleNode<T> {
    first_next: *mut DoubleNode,
    first_priv: *mut DoubleNode,
    second_next: *mut DoubleNode,
    second_priv: *mut DoubleNode,
    data: T,
}

2e71828 · September 23, 2024, 9:16am

Right. What makes it intrusive is that the payload type is aware of and cooperates with the list implementation instead of being an encapsulated passenger— The overall collection and the individual items can be entangled together in whatever arbitrary ways suit the problem domain.

afetisov · September 23, 2024, 9:36am

That's pretty much how it works, except that a pair of list pointers would be its own type.

I don't think there is some deep reason why "Rust is bad with intrusive linked lists". Rust is bad with linked lists, period. That's well known. Intrusive linked lists just crank up the difficulty an extra level: now there are multiple lists with shared ownership of data, and there is no longer a clean separation between the data and the collection.

Right, but in that case the linked lists are an implementation detail, not part of a public API, so I'd argue that their complexities don't matter that much. It's just part of the implementation of an (internally unsafe) collection. Also those inner lists would probably be non-owning, which significantly simplifies the issue.

akrauze · September 23, 2024, 10:58am

Thanks. That makes sense on a conceptual level, but I still struggle to understand usefulness of this approach. Could you point me to some implementation, where payload indeed uses this information? I've found and linked earlier kernel crate and its List type. Would this be example of what I am looking for?

ShangBrol · September 23, 2024, 11:00am

Are they doing it only for better cache locality (the data is stored along the next/prev pointers) or do they also use the feature, that one item can be in multiple lists?

alice · September 23, 2024, 11:43am

That's an extremely old and incorrect the kernel List implementation. Please see this link for the latest version:

Topic		Replies	Views
C-like List implementation in rust help	19	2202	January 12, 2023
When is a linked list not a linked list?	98	4777	February 20, 2021
Intrusive linked list of different generic types which implement the same trait	9	1001	June 16, 2023
What's everyone working on this week (8/2016)?	19	4258	January 12, 2023
Offsetof a field of a "C" struct (to implement an intrusive linked list) help	28	1703	May 20, 2021

What are intrusive linked lists and does Rust really struggle with them?

Related topics