An Introduction to Programming using ECS & EBP, in Rust

This started as a problem with extending an existing OOP application. The domain model didn't fit the problem any longer. A serious refactoring was required. It bothered me.

Why did I write OOP in the first place? Why did I spend so much time migrating CRUD? All that indirection wasn't free. Where did I go wrong?

It is very hard to go up against mainstream ideas. Every textbook is OOP + CRUD, and so every coding agent wants to write OOP + CRUD. But the pointer chasing fights raw performance, and the rigidity of the domain model makes development slow.

After twenty years I've had enough. I committed to Entity-Component-Systems (ECS) and Existence-Based Processing (EBP). This book is the lessons I learned.

The shape is tables and the systems that walk them: data laid out the way the machine reads it, and a new feature added as a column rather than a refactor.

Read online

Source

Forty-four sections. Every performance claim is backed by a measurement you can run on your own laptop in under a minute. Work in progress. No paywall.


Summary

The book teaches programming from first principles of data-oriented design: entity-component-systems (ECS) and existence-based processing (EBP). Rust is the only language and it assumes no prior programming experience - but it is not a "learn Rust" book. Rust is the vehicle; the subject is laying data out so programs stay fast and stay easy to change as they grow.

The spine is one ecosystem simulator that grows from a hundred wandering creatures to a hundred million streamed ones. Each part moves up one order of magnitude in problem size, and each step forces the next technique - you feel the array-of-structures layout hurt before you are handed structure-of-arrays, you feel the cost of a per-entity conditional before existence-based processing removes it.

Two things set the book apart from most intro material. It is measured, not opinionated: dependencies are priced, not banned (write the from-scratch version, read the crate, then decide), and every performance claim ships with a benchmark you run on your own hardware. There is a cross-machine table - Pi 4, an old i7, a 2015 NUC, a modern Ryzen - so the numbers are yours, not mine. And concepts are named only after they are built: two to three pages of prose, then four to twelve compounding exercises; the vocabulary is earned through working code, never front-loaded.

Is it worth the effort? That was the first question on the Python edition's thread, so let me answer it here, ranked by what actually pays off:

  1. Complexity, first and largest. A new feature is an addition, not a restructuring - a new column, or a new presence table plus the system that drives it, bolted on without touching what already works. The cost is front-loaded into learning the structure-of-arrays mindset; after that the marginal cost per feature drops instead of compounding. This is the win I came for, and it is structural rather than a tuning trick.
  2. Footprint at scale. Past a million rows, memory is the limit. A structure-of-arrays layout drops the padding an array-of-structures drags through cache; in the book's ยง7 benchmark the Pi 4 - no L3, a tight memory channel - pays 5.7ร— for that waste, while a modern desktop with generous L3 mutes it to about 2ร—. Same principle, slope set by the cache budget.
  3. Performance, conditional. It holds as long as your systems stay array-shaped. EBP makes idle behaviour genuinely free: a million-creature world with zero hungry creatures spends zero cycles in the hunger system - work is proportional to active rows, not to population times behaviours, which is one or two orders of magnitude cheaper than a flag-based design for a sparsely active world.

Call for feedback
It is a work in progress (first complete draft). What I would most value from this forum is a close read - where the Rust is non-idiomatic, where a measurement looks wrong on your hardware, where the teaching skips a step.

300 pages -- might be a great work.

I had a short random look, took me two minutes to find the first possible minor issue:

Rust gives you several integer widths: u8 (one byte, range 0..256),

Sorry, :slight_smile: OK, in Rust notation it is correct, as you did not wrote 0..=256, but still for a text book I would prefer common 0..255. Perhaps we could write "with values ranging from zero up to 255 (inclusive)."

Another point, which irritated me a few seconds before, is

A cache line is 64 bytes.

Is that really true for all architectures? I would bet that we could find at least one exotic one with a different size. I am always a bit skeptical for such absolute statements.

Thanks Stefan - you have keen eyes. The few seconds of irritation are fair; an unqualified "a cache line is 64 bytes," in a book about precision, deserves it.

Two things came out of it. The [0, 256) range notation is swept and gone, and ยง2 no longer states 64 as a constant. ยง33 (false sharing) now carries a table of cache-line sizes by architecture, marked by which the book has actually measured - x86-64 and a Cortex-A Raspberry Pi - versus which it cites from documentation: 128 B on Apple Silicon and POWER, 256 B on s390x, and so on. If you or any reader has one of the unmeasured machines and runs the suite, the numbers go in, with credit.

The line the book keeps: memory moves in fixed-size lines, and that granularity is the budget. The value is the anchor; the granularity is the lesson.

It is interesting that you published a Python version simultaneously. I don't know too much about Python, but I have some doubts about

The minimum size, even for 0, is 28 bytes. As the value grows past one digit, the object grows by four bytes per additional digit.

A grow by each digit would suggest a number storage format as string, which I doubt strongly. I would more guess more a storage increase when the four-, 8-byte boundaries are reached. Well, I am not a native English speaker --at least I have some trouble understanding it correctly.

Might I ask if you are a native speaker? Your Rust text is not written too fluent -- more a style typical for a presentation with slides.

Agreed. And the table just below that line already contradicts it - 1_000 shows 28 bytes, so it isn't four bytes per decimal digit. CPython stores an int as 30-bit limbs, four bytes each, and the size steps up one limb at a time: 28 bytes covers everything below 2ยณโฐ (~10โน), then +4 bytes per limb, roughly every ninth decimal digit. No decimal/string storage. Fixed in ยง2.

Thank you.

in exercise 5 "pointer chasing" of ยง1, the book states it should be running at RAM speed, yet a sidebar also notes, the nodes are allocated at consecutive addresses in the heap so the pointer chainsg looks free.

this is the dilemma for many educational materials in modern technology: in order to explain the reason why it is better, you need to demonstrate why it was worse, but you actually need to intentionally make extra efforts to do the "worse" things, thanks to improvements in technology.

and indeed, on my laptop with a zen3 cpu, the per element measurement averages around 2.3ns for a list with 1_000_000 elements. for reference, the sequential baseline is roughtly 0.35ns to 0.4ns.

one might think this is because the entire heap of allocated nodes fits in the L3 cache (I have 16MiB), so I increased the element count to 100 million, but the average measurement actually went down, consistently under 2ns now. supprise! and good job, prefetcher!

so the RAM speed can only be observable when the nodes are shuffled, but the code in the solution didn't do this.

since it's not easy to shuffle the nodes of a linked list after the it is already constructed, I propose a two step list construction process (pseudo code):

// step 1: allocate all nodes, save them in an array (vec)
let mut nodes: Vec<Box<Node>> = (0..N).map(|_| Box::new(Node::empty())).collect();

// step 2: shuffle the nodes in random order and link them to the list
let mut head = None;
while !nodes.is_empty() {
    let i = random() % nodes.len();
    let mut node = nodes.swap_remove(i);
    node.next = head;
    head = Some(node);
}

a side note:

since the Node is a recursive type, the default drop glue (a.k.a destructor) could overflow the stack for long lists. although this doesn't affect the result for the experiment, since it happens at scope exit, after the measurement is done. but for readers who are new to rust, it might be confusing to see a stackoverflow message when they try to run the code.

I think it worth adding a note to the example code in the solution document. this is unnecessary for the book though: if a reader can do the exercise themselves, they already know rust well.

so the RAM speed can only be observable when the nodes are shuffled, but the code in the solution didn't do this.

Good point, and thanks for the careful read. The published solution built the list in a tight for loop, so the allocator handed back the boxes at consecutive addresses and the chase walked straight up contiguous cache lines. The reference benchmark (code/measurement/src/bin/pointer_chase.rs) already did the allocate-all-then-shuffle construction, but the inline solution in the chapter had drifted to the naive build and was quoting a number it couldn't reproduce.
Fixed now: It shows the sequential build as a labelled anti-pattern, then the two-step shuffled version, so the chain visits the nodes in scrambled order and each next is a real DRAM round-trip.

Your prefetcher numbers match what I see. Without the shuffle the ratio collapses toward 1x, which is the sub-2ns/elem I believe you measured at 100M.
Your second point is in there too: The default recursive Drop of a long Box chain overflows the stack at scope exit even when build and sum are iterative, so the solution now names all three stack-overflow traps and tears the list down with an iterative drop_list. It also prompted a re-measure across the boxes I have here (Pi 4, a couple of old Intel laptops, a Ryzen); the shuffled ratio runs ~60-270x depending on the chip, which is the order of magnitude the chapter claims.

Thanks @nerditation

another feedback, this one is a pitpick: I find this sentence in ยง2 troubling:

If your loop touches one element per cache line, the u64 version makes 8ร— as many memory loads as the u8 version.

this is talking about cache efficiency and wasted bus bandwidth, and I almost skimmed over it, but then I read it again, "if your loop touches one element per cache line", hmmmm, that's a strange access pattern, isn't it?

if I would translate this sentence to code, it probably looks like this:

let xs: Vec<u8> = todo!();
let ys: Vec<u64> = todo!();

let mut i = 0;
while i < xs.len() {
    let x = xs[i];
    do_something(x);
    i += 64 / size_of::<u8>();
}

let mut i = 0;
while i < ys.len() {
    let y = ys[i];
    do_something(x);
    i += 64 / size_of::<u64>();
}

tbh, I don't think this is a good example to explain the concept, but I don't know what wording is better either. I just want to point it out for discussion.

I'm now at ยง8, and looking at exercise 5, I get the point you are making, but I believe the performance claim is wrong for this particular example.

I ran the example code on my laptop, with 100K and 1M elements (the reference solution code is 100K), the indexed loop version is consistently faster than the iterator version.

when I increased the number of entries to 10M, the iterator version is faster most of the time, but the variance can flip the result occasionally. only when the number is 100M, I get the claimed result consistently.

while I agree with the performance explanation in principle, this is not a good example to demonstrate the concept of this chapter.

first of all, the indexed loop is too simple to make a significant difference.

the main reason is the .collect() call and the intermediate Vec<bool>. although there's only a single allocation thanks to the iterator's size hint, heap allocation already is expensive. the benefit of the aggresive vectorization enabled by it can only offset the allocator overhead when the dataset is very large.

so it's the dilemma of performance-aware programming education again: the performance gain is real, but it's hard to make it obvious with simplified examples.

Thanks for your carefully investigations. I had similar doubts -- a decade ago I played with Robin_hood and Hopscotch hashing, and read a few tutorials about the benefits of cache use and localized memory access. But my experience was inconsistent and not too convincing. But that was still for the Nim language.

Thanks @nerditation - these are two good catches and they're related.

On ยง2, "one element per cache line." You're right that it's a strange access pattern to reach for. Reworded to the case the exercise actually exercises: walk the whole vector and the u64 version pulls in 8ร— as many cache lines as the u8 version - same element count, eight times the bytes. No hypothetical stride.

On ยง8 exercise 5. I agree, and the benchmark was measuring the wrong thing. face_cards(&ranks) allocates and fills an N-byte Vec<bool> and then reads it back, so it's slower than a loop that just counts, at every N - the "2-5ร— faster" was backwards. Even with the allocation removed, a plain indexed for i in 0..n already vectorises, so "array-first" was never "iterators beat loops." The real cost is per-entity opaque dispatch (a non-inlined method called N times) versus one fused pass, and that's a machine-reality argument that simply doesn't bite at 52 cards.
So rather than patching the number, I've cut the exercise and dropped the speed claim from ยง8. ยง8 goes back to being what it's for: The architectural move, array as the unit of code, singleton as N=1.
The cost argument now lives only in ยง19 (EBP dispatch), grounded at a million rows in the simulator where it's actually measurable - memory traffic proportional to active rows, with an assembly-inspection exercise to see the branch that isn't emitted.

@StefanSalewski - I agree on the point about the hashing tutorials. The cache/SIMD benefit doesn't show in a simplified example because the optimiser fuses the simplified example. The honest place to make the claim is at the scale where the work no longer fits, not in a toy where it does. Putting the claim where it can be felt, and removing it where it can't, is the fix.

have read a couple more chapters today. the information of this book is very dense and it cost a lot mental engergy, but I'm really enjoying it.

some small questions and suggestions so far.

  • in exercise 9-3, I think there's a typo:

    Remove the card at slot 7 with swap_remove(7) on each column. Print player 1โ€™s hand. Note that the cards at slots [17, 21, 28, 41] are unchanged but slot 3 may now hold what was previously the last card;

    I believe "slot 7" should be "slot 3", player 1 has the cards [3, 17, 21, 28, 41], there's no 7

  • starting from chapter 10, the example code uses the identifier gen for "generation", but gen is a reserved keyword[1] in rust edition 2024. if readers copy-n-paste the code locally, they may get a compilation error.

    however, I'm not sure what better solution: change to raw identifier r#gen, or change it to a different name (shorter g or longer generation?), or replace with a misspelled form (I've seen names like klass or kontinue in people's code, but I don't think the same can be done for gen, I mean, can people guess what dgen or jen actually stands for?)

  • exercise 12-1 uses a Vec for an event queue. since I've not finished the book yet, I wonder when the VecDeque will be introduced, or if it will be introduced at all?

    at least for me, my first reach for an event queue is VecDeque, not Vec. but this isn't directly related to the topic of the chapter, and after all, the main focus of the book is not about teaching rust the language and its standard library. just a thought I had when reading.

  • in reference solution for exercise 12-6, there's this line of code:

    while events.first().map(|e| e.0 <= sim_now).unwrap_or(false) {
        //...
    }
    

    there's nothing wrong to map an Option to a boolean predicate in this way, but I prefer to use is_some_and() (or its counterpart is_none_or()) for similar situations:

    while events.first().is_some_and(|e| e.0 <= sim_now) {
        //...
    }
    

    again, this is just a personal preference I'd like to share, I'm not sure if it is considered more idiomatic or not by the community though.

  • in exercise 15-3, the reference code has this code:

    for &id in to_remove.iter() {
        //...
    }
    to_remove.clear();
    

    this can be rewritten using the drain() iterator. in fact, the very next line in the same snippet does use drain() for the to_insert table:

    for row in to_insert.drain(..) {
        //...
    }
    

    I'm a little confused why the inconsistency between these two loops? is the code generated by LLM?

  • in chapter 21, there's a paragraph about swap_remove():

    The mechanism is small: take the last element, move it into the deleted slot, shrink the table by one. Two memory writes and a length decrement. O(1) regardless of N.

    I think the "two memory writes" part is wrong. swap_remove() actually has two memory reads and one memory write, or more accurately, one memory read[2] and one memory copy.

  • for exercise 21-3, the reference solution and it's explanation:

    for i in 0..v.len() {
        if v[i] % 2 == 0 { v.swap_remove(i); }
    }
    

    ... About half the deletions get correct, half are skipped. The remaining Vec is not โ€œall odd valuesโ€ - it is some mix.

    this exercise is meant to show a wrong way to remove elements, but the explanation is not accurate. this code will in fact panic if any element gets removed at all, because the for loop will run for the original length of the vec.

    if I would guess, the conclusion is probably meant to explain this code snippet instead, which is also wrong, but it is wrong in the way that fits the explanation:

    let mut i = 0;
    while i < v.len() {
        if v[i] % 2 == 0 { v.swap_remove(i); }
        i += 1;
    }
    

  1. reserved for the future "generator" feature โ†ฉ๏ธŽ

  2. for the return value, and it can be eliminated if the return value is discarded and its type doesn't have drop glue โ†ฉ๏ธŽ

Thank you @nerditation ! - This is exactly the kind of close read the book needs.

  • 9-3: typo, fixed. It's slot 3; there's no 7 in that hand.
  • gen: you're right, it's reserved in 2024 and breaks copy-paste. Renamed to generation throughout.
  • 12-1 / VecDeque: deliberate. The event queue is ordered by timestamp and drained in time order, not FIFO, so it's a sorted Vec, not a VecDeque. VecDeque does show up later, in ยง30's streaming chapter.
  • 12-6 is_some_and: agreed, it's cleaner - switched both call sites.
  • 15-3: no good reason for the asymmetry; it's now drain(...) to match the line below it.
  • ยง21 "two memory writes": correct, that's wrong. It's one read and one write (plus the length decrement). Fixed in the prose and the solution.
  • 21-3: good catch, and it sharpened the exercise. The shown for i in 0..v.len() panics (range bound fixed at the original length); the while i < v.len() variant is the one that silently corrupts. Rather than pick one, the solution now shows both - same mistake, two failure modes, which is the actual point.

Keep them coming :wink:

as I read more, I'm starting to enter unfamiliar territory, but I'll keep trying. here's more feedback for today:


first of all, the same inconsistent loops (as the previously reported one in excercise 15-3) also appear in later chapters, one such instance is the code snippet in the prose of ยง22, it's the same cleanup() system. maybe this is already fixed in the source but not yet published, I didn't check; if not, could be a good LLM task.


it seems exercise 22-6 is a duplicate of excercise 15-5? the wording is slightly different, but they are essentially the same problem.


the exercise 23-2 and the reference solution does not match. the exercise is talking about a parallel hungry_membership: Vec<bool> table so is_hungry(id) is a query of O(1), but the code in the solution is a function is_alive() which simply checks if the id maps to a valid slot.

btw, the is_alive(world, id) logic is mostly duplicate of slot_of(world, id), it can be simplified using Option::is_some(), something like this:

fn is_alive(world: &World, id: u32) -> bool {
    slot_of(world, id).is_some()
}

fn slot_of(world: &World, id: u32) -> Option<usize> {
    let slot = *world.id_to_slot.get(id as usize)?;
    if slot == INVALID { None } else { Some(slot as usize) }
}

the solution for exercise 23-3 has a if branch which can be eliminated, if you reorder the code. here's the reference code:

fn delete_by_id(world: &mut World, id: u32) {
    let slot = world.id_to_slot[id as usize] as usize;
    world.creatures.swap_remove(slot);
    if slot < world.creatures.len() {
        // Some row was moved into `slot`. Update its mapping.
        let moved_id = world.creatures[slot].id;
        world.id_to_slot[moved_id as usize] = slot as u32;
    }
    world.id_to_slot[id as usize] = INVALID;
}

simplified code without if condition:

fn delete_by_id(world: &mut World, id: u32) {
    let slot = world.id_to_slot[id as usize] as usize;
    // grab the id of the last row **before** calling `swap_remove()`
    let moved_id = world.creatures.last().unwrap().id;
    world.creatures.swap_remove(slot);
    world.id_to_slot[moved_id as usize] = slot as u32;
    world.id_to_slot[id as usize] = INVALID;
}

I believe this has the same behavior as the reference. in the reference code, the if condition will be false when the id is mapped to the last row/slot, which means moved_id equals to id, which means the unconditional update of moved_id's slot will be overwritten in this edge case.

the difference is subtle and the code is tricky to reason about. I didn't measure the difference and it probably makes no practical difference.

still, I want to mention it to make myself look smart. :winking_face_with_tongue:


ยง25 has a disastrous clash of nomenclature: ownership.

I understand what ownership is supposed to mean in this chapter, but unfortunately, the term has a different meanng in the context of rust.

rust ownership is more about destructors and dropping values. it's related but not equivalent to unique access permission. heck, there's even the concept of "shared ownerships" in rust, see the documentation of Rc and Arc. the concept of "the sole function that has the access permission to the variable" corresponds to "exclusive borrows" (a.k.a. "mutable references", &mut T) in rust.

to be honest, I have no idea how to reconcile this. it feels awkward if the prose uses the word "borrow", but it would be more confusing if the ownership is used in a complete different way than its rust definition. maybe the almighty AI can give a good answer?


the entire ยง26 chapter is confusing to me. to me, the hot/cold split is mostly relevant to AoS layouts, while SoA already naturally splits hot and cold fields into separate columns. but the chapter says:

The fix is a split: fields touched on the hot path go in one table; fields read rarely go in another. Two tables, same length, same id alignment.

maybe I'm just getting tired today. I'll re-read tomorrow and check out the code to see what I am missing.

@nerditation - Thank you for the detailed notes.

First I want to apologize for the delayed response. I wanted it to be the right response on ยง26. Not just a quick one.

The current deployment of the book includes your feedback as follows:

On ยง22 loops. Unified on drain(..) for both passes; prose and solution read the same now.

On the ยง22-6 / ยง15-5 duplicate. Confirmed and gone. ยง22-6 was a second copy of the tick-delayed-visibility exercise; it is now "Buffers keep their capacity" - measure capacity() settling across 100 ticks, then argue why reusing the buffers matters against the tick budget.

On the ยง23-2 solution. Fixed; the solution is the sparse-set build the prompt asks for, O(1) subscribe/unsubscribe/is_member with no boolean.

On ยง25, "ownership." You are right that the bare word collides with the one meaning a Rust reader already has reserved. The chapter is now "One writer, many readers." The word survives only inside a callout that states the overlap on purpose: write-ownership, meaning which single system may mutate a table, is exactly what &mut T expresses at the type level, so the two senses are relatives rather than a coincidence. But the title no longer makes you carry the ambiguity to find that out. I hope it helps?

On ยง26, the hot/cold split. This was the one that mattered. Your explicit expression of confusion was the signal: a hot/cold field split is an AoS idiom, and SoA already subsumes it - every field is its own column, so a "cold" field is just a column the loop never touches. The chapter was teaching a non-problem. It is rewritten as "Subscription tables, keyed by slot": a system iterates a Vec<slot> of its members and indexes the attribute columns directly. The old AoS-to-SoA bandwidth number moved back to ยง7, where it belongs as the case for SoA in the first place. The rewrite opened a real question: Key the subscription tables by slot or by id? So I measured it: slots win the amortized cost roughly 2x, because the id_to_slot hop is a scattered miss per element in the hot loop. That pulled the membership tables in ยง17 through ยง23 over to holding slots, not ids. The new benchmark is in code/measurement/src/bin/ebp_partition.rs if you want to run it.

Changelog:

Since you are reading the chapters in the order...The ยง26 fix did not stay in ยง26 - keying the subscription tables by slot meant the membership tables back in ยง17 through ยง23 had to hold slots too, so a few chapters you have already passed have changed under you. I did not want you re-reading them on the old model.

The short version: membership- and subscription-tables are now Vec<slot>, id_to_slot has retreated to resolving identity only at the boundaries (saves, replay, the wire, the UI), and ยง21's swap_remove is now explicitly the wrong-way stepping stone that ยง22 to ยง24 walk you off. ยง28 is also new ("Proximity is a property of position", replacing the old sort-for-locality chapter). No obligation to go back over any of it, but I would rather tell you, than let you discover a contradiction with your own notes.

Hi @root-11.
In the first section you conditioned me to write sums of vector.
So when I did exercise 4 in section 3 "Indexing cost", I wrote this :

let _: u64 = std::hint::black_box(my_hashmap.into_values().sum());

I had to look up the solution to understand that you expected me to use a for loop to iterate over the HashMap in order of the indices to see the actual ratio of x373.

You can either put this on user-error ( I won't mind :slight_smile: ) or try to rewrite the exercise so people won't simply pull out the into_values projection to iterate over all values.

@Idix Good catch, and thanks for the exact repro.

You're right - into_values().sum() drains the map in bucket order, a sequential walk, not the per-index hash-and-probe the exercise is trying to price. The earlier sum exercises prime exactly that reflex, so I'd call it a wording gap more than user error. I'll make the indexed access explicit (sum both v[i] and map[&i] over the same indices) and call out that into_values() measures the wrong thing, so the next reader doesn't hit it. Fix going out shortly - thanks for reading closely.

Another issue, in exercise 2 section 6 "Mishandle the alignment" you ask to sort the suits but by the nature on how they are built for s in 0..4u8 { they are already sorted.

You probably wanted to sort by ranks in this exercise to demonstrate the corruption.

Edit : I tested to sort by ranks and by shear luck (or is it misfortune ?) location 17 actually ends up on the same rank after being sorted.