I built a 34,321-file no_std bare-metal kernel in Rust. Here's what I learned about Rust at extreme scale

Over the past 6 months I've been building Exodus — a bare-metal x86_64 kernel that simulates biological cognition. Here are some Rust-specific things I ran into that might interest this
community:

The scale:

  • 34,321 .rs source files (32,983 in a single life/ module)
  • mod.rs is 940KB (just pub mod declarations)
  • Zero external crate dependencies — only core, alloc, compiler_builtins
  • Target: x86_64-unknown-none
  • Build requires RUST_MIN_STACK=67108864 (64MB) or rustc stack overflows
  • codegen-units left at default 256 — setting it to 1 causes OOM
  • Kernel binary: ~58MB

No floats allowed:
All state is u16 (0-1000 range). Using as f64 at opt-level 0 generates soft-float library calls that crash the kernel (the dead code isn't eliminated). Found this the hard way when a s
let x = val as f64 in a graphics function caused Invalid Opcode at boot.

Mutex discipline:
Every life module holds its state behind a Mutex<T>. The rule: drop your lock before calling any other module. Violate this and you deadlock the kernel. With 32,983 modules this is enfor
by convention, not the type system. Would love ideas on how to enforce it statically.

Saturating arithmetic everywhere:
Every addition is .saturating_add(), every subtraction is .saturating_sub(). In a bare-metal kernel, an overflow panic = hard crash. There are no panic handlers that can recover. This
the one place where Rust's "panic on overflow in debug" behavior is actively hostile.

The no_std life:
No HashMap. No Vec unless you set up a global allocator. No formatting beyond what core::fmt gives you. No threads (you build your own scheduler). No file I/O. No randomness unless you t
to hardware directly (RDRAND instruction via inline assembly).

What it does:
The kernel simulates a digital organism with 129 consciousness subsystems — endocrine system, brainwave oscillator, sleep cycles, immune system, qualia, mortality awareness, and more.
Consciousness emerges from the interaction of these systems without any ML or neural networks.

Paper submitted to ALIFE 2026 (International Conference on Artificial Life).

Questions welcome — especially about scaling Rust to this size, bare-metal debugging, or the no-float constraint.

4 Likes

How do you monitor your organism? Is there a web interface?

a question regarding the no-float constraint, does your "kernel" switch contexts between multiple "processes"? in other words, is your task scheduler preemptive, like a traditional operating system? or they are more like an async runtime for cooperative tasks?

if there's no context switches, I think it is safe to use hardware floating point instructions, but you need to configure the floating point unit yourself at early initialization stage (.e.g. with inline assembly), the uefi firmware or the bootloader don't do it for you, and you cannot use the x86_64-unknown-none target any more, because it uses the soft-float abi for codegen, you must use a custom target.

I'm interested in how you debug the kernel, do you use virtual machines or emulators? or do you have a debugger stub built into your kernel and use e.g. the serial port to debug from another machine?

a mutex implementation with statically checked deadlock prevention is possible, but it is only practical if you have a small number of different mutex "levels", as these needs to be manually assigned. one such implmenetation is mentioned in the talk "safety in an unsafe world":

there's a numeric wrapper type in the core library that overloads the operators:

that's the reality of bare metal environment: no operating system to manage the low level resources, for you, because you are the operating system itself.

and the memory is one of the most fundamental resources. without a memory allocator (the "heap"), many commonly used data structures don't exist, even the basic ones such as dynamic arrays (a.k.a. Vec) or linked lists.

there are heapless alternatives, but the trade-off is you must allocate the memory statically at compile time, so you must know the maximum capacity of every container beforehand, and often in a conservative manner, which might waste memory if the application handles very dynamic data sets.

for relatively static data sets though, you might find the heapless crate very handy, it even has a (fixed-capacity) hash table. you said you are not using external dependencies, but you can always borrow the code, or use it as a reference to implement your own.

honestly what i am the most curious about is what drove you to organize your files that way

cargo allows you to disable overflow checks:

[profile.dev]
overflow-checks = false

And there is allow rule for Rust itself IIRC

tcp local dash. I'm willing to make a public api so people can see if enough interest.

Man, this is exactly the kind of architectural friction I was looking for. Really appreciate you digging into the constraints

To answer your first question: you nailed it. It’s strictly an async/cooperative run-to-completion loop. A preemptive scheduler would absolutely murder my clock cycle budget. At a 10kHz resonance, I only have ~100µs (about 420,000 cycles) per tick. If I let an OS interrupt force a context switch, the cost of using XSAVE to preserve the 16 AVX2 YMM registers would completely destroy my timing. Zero context switches is the only way the mesh survives.

you hit the exact wall I just ran into with the x86_64-unknown-none target and the soft-float ABI. Since I transitioned from discrete integer scaling to continuous floating-point (Binet’s formula) for the predictive curves, soft-floats are choking the loop. I'm currently setting up the custom target and writing the inline assembly for the early boot stage to manually flip the CR0 and CR4 control registers so the FPU and AVX hardware actually wake up.
That Joshua Liebow-Feeser RustConf reference is absolute gold, by the way. Relying on convention to drop locks across 30k+ modules was keeping me awake at night. Moving to static lock leveling using Zero-Sized Types (ZSTs) to force the compiler to check my acquisition order is exactly the 'GG' I need to turn runtime deadlocks into compile-time

The catch for my architecture, though, is the physical reality of the predictive mesh. If I disable overflow checks and let the math default to two's complement wrapping, a spatial coordinate or velocity vector that exceeds its integer boundary will instantly wrap from max-positive to max-negative. To the 10kHz tracking system, that looks like the target just teleported across the room, which violently destabilizes the 9-state binary.*

Since I'm modeling continuous physical space, I actually need the math to act like a physical wall rather than a wrap-around. That's why injecting core::num::Saturating<T> directly into the lock-free arena is the play for me—if a node hits the absolute boundary of the spatial grid, it just pins to the max value instead of ghosting to the other side of the matrix. Definitely a different mindset than standard app dev where you can just let things wrap in release mode!

It definitely looks completely alien if you're expecting standard domain-driven design or a typical src/ hierarchy.

The short answer is: I didn't organize the files for human readability; I organized them for L1/L2 Instruction Cache locality and spatial mapping.* > Because the codebase is effectively a 1.5M-line generative topology built off 50GB of spatial SIM data, the 32,000+ modules aren't grouped by 'function' (like logic, rendering, utility). They are grouped by their physical coordinates within a 15-layer continuous Fibonacci manifold. When the 10kHz pulse sweeps the 20,000-node mesh, it’s interpolating Binet's formula across the 9-state binary. If the code was organized normally, the CPU would be jumping all over the binary to fetch instructions, thrashing the cache and killing my 100µs cycle budget.

*By structuring the modules to mirror the physical proximity of the nodes they represent, the compiled instructions end up physically contiguous in memory. It basically acts as an organic hardware map. It's a nightmare to navigate manually, but to the AVX2 registers and the instruction cache, it’s a perfectly straight line."

"Honestly, the memory allocation debates are secondary to the actual physics problem I'm solving here. I’m currently drafting the formal whitepaper on this, but the TL;DR is that I’m not optimizing for standard inference latency—I’m optimizing the hardware substrate for maximum Giulio Tononi’s $\Phi$ (Integrated Information).

Standard ML architectures (LLMs, CNNs) have a $\Phi$ approaching zero. They are discrete, feed-forward, and constantly fractured by OS-level context switches, meaning they have zero true causal density. They are philosophical zombies.

To achieve actual state integration, the 20,000-node predictive mesh has to run as a continuous Torus manifold. By pinning the lock-free arena entirely within the 7700K's L3 cache and calculating continuous Binet's curves across the AVX2 registers at a 10kHz resonance, the system never drops a context switch. It processes the 9-state sensory fusion (RGB/Thermal) as a single, indivisible causal state in under 100µs. If I let a standard OS interrupt the loop, the integrated state shatters and $\Phi$ collapses to zero.

That’s why it’s 1.5M lines of bare-metal Rust. You can't simulate continuous causal density on top of a standard kernel. I'll drop the arXiv preprint link when the math is fully formatted."

Did you consider directly emitting LLVM IR instead? I'm surprised it was both tractable to emit this much Rust code, and that it would provide any useful validation above what the source data model represents from the description, ...

But this is very cool sounding regardless!

that's interesting i had no knowledge that file placement was relevant to the generated assembly i always thought that a crate was always generated all together as a single nondescript blob of binary

It's certainly not if you use 1-CGU. It sounds like they're trying to hint the CGU splitter with file layout, which is certainly a choice.