Releasing dtact v0.2.2 and rssn-advanced v0.1.0

panayang · May 26, 2026, 2:37pm

Hello Rustaceans!

Today we are releasing dtact v0.2.2 and rssn-advanced v0.1.0, both of which are experimental, high-performance projects currently under development. Dtact is an async concurrent engine, and rssn-advanced is a new generation symbolic core for rssn. Both will be detailed below.

Dtact

Dtact is a coordinative, truly lockless async runtime designed for maximum work coordination speed and the highest throughput. In short, we mainly utilized a P2P network, a Lock-Free Context Arena, and a Zero-Copy Future Migration to achieve this. For a detailed analysis of the architecture, please refer to https://dtact.apich.org/ and GitHub - Apich-Organization/dtact: Dtact: The Universal Topology-Affinity Async Runtime · GitHub. The website's UI was written by AI—because we really aren't that good at writing UI, so please forgive us for that. A detailed benchmark report and CI run links can also be found on the official website, but in short, we can see the following chart (CI run: perf: optimize memory allocation by implementing tiered mmap strategi… · Apich-Organization/dtact@13db4ff · GitHub):

Task Spawn Efficiency

This benchmark measures the time required to spawn and execute a batch of asynchronous tasks.

Task Scale	Runtime	Min Bound	Mean	Max Bound	Dtact Speedup
1M	Dtact	103.92 ms	104.92 ms	105.94 ms	6.31x faster

	Tokio	648.02 ms	662.11 ms	676.42 ms	Reference

100k	Dtact	11.667 ms	11.807 ms	11.954 ms	5.34x faster

	Tokio	61.064 ms	63.067 ms	65.043 ms	Reference

10k	Dtact	1.9672 ms	2.0144 ms	2.0620 ms	2.63x faster

	Tokio	5.1595 ms	5.2986 ms	5.4395 ms	Reference

1k	Dtact	152.301 µs	157.731 µs	163.311 µs	4.76x faster

	Tokio	719.65 µs	750.891 µs	783.411 µs	Reference

Yield Efficiency

This test measures the time taken for 10 concurrent tasks to perform 100 cooperative yield_now operations each.

Test Case	Runtime	Min Bound	Mean	Max Bound	Comparison
10 tasks	Dtact	795.651 µs	827.511 µs	860.191 µs	~4.41x slower

	Tokio	179.981 µs	187.731 µs	195.651 µs	4.41x faster

Work Deflection (Hot Core) Performance

This benchmark simulates task dispatching and throttle coordination under heavy load imbalances across a multi-core scheduler.

Task Scale	Runtime	Min Bound	Mean	Max Bound	Dtact Speedup
10M	Dtact	1.6624 s	1.6792 s	1.6962 s	4.13x faster

	Tokio	6.8482 s	6.9386 s	7.0291 s	Reference

100k	Dtact	17.472 ms	17.659 ms	17.847 ms	2.84x faster

	Tokio	49.110 ms	51.112 ms	53.114 ms	Reference

10k	Dtact	2.4961 ms	2.5315 ms	2.5675 ms	2.31x faster

	Tokio	5.7240 ms	5.8411 ms	5.9605 ms	Reference

1k	Dtact	273.791 µs	285.231 µs	297.07 µs	2.70x faster

	Tokio	739.641 µs	769.841 µs	801.701 µs	Reference

The extensive use of unsafe and naked assembly in dtact may cause some doubt about this project, but we are striving to achieve higher engineering goals to ensure the project remains safe, and we are continually working on it. Also, special thanks to @newpavlov and @SkiFire13 for their helpful advice when we first tried the stackful approach in the bincode-next UAF backend async fiber module.

rssn-advanced

This project is highly experimental and is eager for external reviews. Its core design is my own, but it was primarily complemented by @cn-starlabs (it seems he doesn't even have a Rust-lang user account) with the help of Claude. After performing some architectural fixes and code quality improvements myself, I decided to release it alongside dtact. Regardless, the project currently lacks extensive code reviews and is in an early stage, even without fully meeting OSSF standards. However, I personally think it is impressive to see a full JIT symbolic computing engine with almost infinite extensibility that truly achieves a combination of symbolic and numerical computing. The official website is also under development, so you might prefer to check the repository first: GitHub - Apich-Organization/rssn-advanced: This is rssn-advanced: The next generation symbolic core of rssn. · GitHub or checkout https://rssn-advanced.apich.org
And the first performance reports:

==============================================================================
   RSSN-Advanced JIT vs NumPy — Bulk Evaluation Benchmark
   N = 1,000,000 rows per expression  |  5 repeats, best time reported
==============================================================================

──────────────────────────────────────────────────────────────────────────────
  1. Trivial (baseline)
  x + y + 10.0
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              1.868 ms     1.87 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.133 ms     1.13 ns/eval
  NumPy (SIMD / C, hand-optimised)                2.824 ms     2.82 ns/eval
  SymPy lambdify → numpy backend                  2.763 ms     2.76 ns/eval

  JIT bulk  vs NumPy:  1.51x faster
  JIT batch vs NumPy:  2.49x faster

  Accuracy  bulk  max|Δ|=0.00e+00  ✔
            batch max|Δ|=0.00e+00  ✔

  NumPy intermediate arrays: ~2 ops → ~15 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

──────────────────────────────────────────────────────────────────────────────
  2. Degree-4 polynomial  (x-y)^4  [2 vars]
  x^4 - 4*x^3*y + 6*x^2*y^2 - 4*x*y^3 + y^4
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              2.729 ms     2.73 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.276 ms     1.28 ns/eval
  NumPy (SIMD / C, hand-optimised)               19.207 ms    19.21 ns/eval
  SymPy lambdify → numpy backend                 18.933 ms    18.93 ns/eval

  JIT bulk  vs NumPy:  7.04x faster
  JIT batch vs NumPy: 15.06x faster

  Accuracy  bulk  max|Δ|=5.46e-12  ✔
            batch max|Δ|=5.46e-12  ✔

  NumPy intermediate arrays: ~16 ops → ~122 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

──────────────────────────────────────────────────────────────────────────────
  3. Cubic surface  [3 vars, 10 terms]
  x^3 + y^3 + z^3 - 3*x*y*z + x^2*y - x*y^2 + y^2*z - y*z^2 + z^2*x - z*x^2
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              3.733 ms     3.73 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.772 ms     1.77 ns/eval
  NumPy (SIMD / C, hand-optimised)               75.751 ms    75.75 ns/eval
  SymPy lambdify → numpy backend                 78.214 ms    78.21 ns/eval

  JIT bulk  vs NumPy: 20.29x faster
  JIT batch vs NumPy: 42.74x faster

  Accuracy  bulk  max|Δ|=2.84e-13  ✔
            batch max|Δ|=2.84e-13  ✔

  NumPy intermediate arrays: ~27 ops → ~206 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

──────────────────────────────────────────────────────────────────────────────
  4. Rational w/ CSE  [2 vars, repeated subexpr]
  (x^2 + y^2) / (x^2 + y^2 + 1.0) + x*y*(x^2 - y^2) / (x^2 + y^2 + 1.0)^2
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              2.534 ms     2.53 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.266 ms     1.27 ns/eval
  NumPy (SIMD / C, hand-optimised)               15.913 ms    15.91 ns/eval
  SymPy lambdify → numpy backend                 22.961 ms    22.96 ns/eval

  JIT bulk  vs NumPy:  6.28x faster
  JIT batch vs NumPy: 12.57x faster

  Accuracy  bulk  max|Δ|=0.00e+00  ✔
            batch max|Δ|=0.00e+00  ✔

  NumPy intermediate arrays: ~20 ops → ~153 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

==============================================================================
  SUMMARY: JIT speedup vs hand-optimised NumPy
  Expression                                          bulk     batch
  ──────────────────────────────────────────────  ────────  ────────
  1. Trivial (baseline)                             1.51x    2.49x
  2. Degree-4 polynomial                            7.04x   15.06x
  3. Cubic surface                                 20.29x   42.74x
  4. Rational w/ CSE                                6.28x   12.57x

  Observation: speedup grows with expression complexity because
  NumPy's intermediate arrays overflow L2/L3 cache at N=1,000,000.
  JIT maintains register-resident computation across the entire
  expression, paying one memory read/write per input element.
==============================================================================

I ran it in Python using CPython FFI, so it is significantly slower than if it were compiled into a Rust binary—but for the sake of fairness, I have kept it as is.

Also from our docs:
Architecture

Module	Role
[`dag`]	Hash-consed expression DAG — the canonical, deduplicated store for all symbolic nodes
[`ast`]	Lightweight local tree projection of a DAG subgraph via relative `i32` pointers
[`parser`]	`nom`-based infix parser: `"x^2 + 2*x + 1"` → DAG root
[`jit`] (feature: `cranelift-jit`/`jit`)	Cranelift JIT; emits scalar `f64` closures and 2-row ILP batch functions
[`heuristic`]	Configurable greedy/beam simplifier with a pluggable [`heuristic::rule_registry::RuleRegistry`]
[`egraph`]	Lightweight equality saturation over the DAG (no `egg` dependency)
[`custom`]	Unified custom-operator system — one [`custom::descriptor::CustomOpDescriptor`] wires into JIT + simplifier + e-graph
[`simd`]	Slice-level batch arithmetic using the inline-asm presets
[`asm_presets`]	Hand-written `f64×2` / `f64×4` kernels for `x86_64` (SSE2/AVX2/AES-NI), `AArch64` (NEON/crypto), riscv64 (RVV/Zkn)
[`ffi`]	Flat `extern "C"` surface generated by `cbindgen`; includes a fiber-backed async bridge
[`parallel`]	Fiber-based parallel simplification via the `dtact` runtime
[`storage`]	Disk-backed DAG spillover and a frequency-based hot-node cache
[`error`]	Cold-path error types and the `rssn_error!` macro

Bincode-next

Finally, a small update for bincode-next at the end of this post to avoid posting too frequently. Bincode-next has released v3.0.0-rc.15, and we are continuously fuzzing. We have decided to release the stable version after joining the OSS-Fuzz project and running it for a while. Link: GitHub - Apich-Organization/bincode: Bincode-next: The next official rust implementation of bincode · GitHub

Common Links

Discord Server: Apich Organization
Contact E-mail: info@apich.org
OSSF Registration (bincode-next): BadgeApp
OSSF Registration (dtact): BadgeApp
Score Card: OpenSSF scorecard report
Discussions on Dtact style designs: Discussion on Synchronous Crate Concurrency Refactor using Stackful Coroutines Model in Rust

Topic		Replies	Views
Releasing bincode-next v3.0.0-rc.1 announcements	3	1806	March 29, 2026
Discussion on Synchronous Crate Concurrency Refactor using Stackful Coroutines Model in Rust code review	11	321	May 3, 2026
Std library inclusion policy, and a data point of compilation times	52	7181	October 24, 2022
Help comparing Rust vs Julia speed help	21	4360	May 1, 2021
New version of mandel-rust: uses Rayon, added benchmark announcements	38	6073	January 12, 2023