Releasing dtact v0.2.2 and rssn-advanced v0.1.0

Hello Rustaceans!

Today we are releasing dtact v0.2.2 and rssn-advanced v0.1.0, both of which are experimental, high-performance projects currently under development. Dtact is an async concurrent engine, and rssn-advanced is a new generation symbolic core for rssn. Both will be detailed below.

Dtact

Dtact is a coordinative, truly lockless async runtime designed for maximum work coordination speed and the highest throughput. In short, we mainly utilized a P2P network, a Lock-Free Context Arena, and a Zero-Copy Future Migration to achieve this. For a detailed analysis of the architecture, please refer to https://dtact.apich.org/ and GitHub - Apich-Organization/dtact: Dtact: The Universal Topology-Affinity Async Runtime · GitHub. The website's UI was written by AI—because we really aren't that good at writing UI, so please forgive us for that. A detailed benchmark report and CI run links can also be found on the official website, but in short, we can see the following chart (CI run: perf: optimize memory allocation by implementing tiered mmap strategi… · Apich-Organization/dtact@13db4ff · GitHub):

Task Spawn Efficiency

This benchmark measures the time required to spawn and execute a batch of asynchronous tasks.

Task Scale Runtime Min Bound Mean Max Bound Dtact Speedup
1M Dtact 103.92 ms 104.92 ms 105.94 ms
6.31x faster
Tokio 648.02 ms 662.11 ms 676.42 ms Reference
100k Dtact 11.667 ms 11.807 ms 11.954 ms
5.34x faster
Tokio 61.064 ms 63.067 ms 65.043 ms Reference
10k Dtact 1.9672 ms 2.0144 ms 2.0620 ms
2.63x faster
Tokio 5.1595 ms 5.2986 ms 5.4395 ms Reference
1k Dtact 152.301 µs 157.731 µs 163.311 µs
4.76x faster
Tokio 719.65 µs 750.891 µs 783.411 µs Reference

Yield Efficiency

This test measures the time taken for 10 concurrent tasks to perform 100 cooperative yield_now operations each.

Test Case Runtime Min Bound Mean Max Bound Comparison
10 tasks Dtact 795.651 µs 827.511 µs 860.191 µs ~4.41x slower
Tokio 179.981 µs 187.731 µs 195.651 µs
4.41x faster

Work Deflection (Hot Core) Performance

This benchmark simulates task dispatching and throttle coordination under heavy load imbalances across a multi-core scheduler.

Task Scale Runtime Min Bound Mean Max Bound Dtact Speedup
10M Dtact 1.6624 s 1.6792 s 1.6962 s
4.13x faster
Tokio 6.8482 s 6.9386 s 7.0291 s Reference
100k Dtact 17.472 ms 17.659 ms 17.847 ms
2.84x faster
Tokio 49.110 ms 51.112 ms 53.114 ms Reference
10k Dtact 2.4961 ms 2.5315 ms 2.5675 ms
2.31x faster
Tokio 5.7240 ms 5.8411 ms 5.9605 ms Reference
1k Dtact 273.791 µs 285.231 µs 297.07 µs
2.70x faster
Tokio 739.641 µs 769.841 µs 801.701 µs Reference

The extensive use of unsafe and naked assembly in dtact may cause some doubt about this project, but we are striving to achieve higher engineering goals to ensure the project remains safe, and we are continually working on it. Also, special thanks to @newpavlov and @SkiFire13 for their helpful advice when we first tried the stackful approach in the bincode-next UAF backend async fiber module.

rssn-advanced

This project is highly experimental and is eager for external reviews. Its core design is my own, but it was primarily complemented by @cn-starlabs (it seems he doesn't even have a Rust-lang user account) with the help of Claude. After performing some architectural fixes and code quality improvements myself, I decided to release it alongside dtact. Regardless, the project currently lacks extensive code reviews and is in an early stage, even without fully meeting OSSF standards. However, I personally think it is impressive to see a full JIT symbolic computing engine with almost infinite extensibility that truly achieves a combination of symbolic and numerical computing. The official website is also under development, so you might prefer to check the repository first: GitHub - Apich-Organization/rssn-advanced: This is rssn-advanced: The next generation symbolic core of rssn. · GitHub or checkout https://rssn-advanced.apich.org
And the first performance reports:

==============================================================================
   RSSN-Advanced JIT vs NumPy — Bulk Evaluation Benchmark
   N = 1,000,000 rows per expression  |  5 repeats, best time reported
==============================================================================

──────────────────────────────────────────────────────────────────────────────
  1. Trivial (baseline)
  x + y + 10.0
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              1.868 ms     1.87 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.133 ms     1.13 ns/eval
  NumPy (SIMD / C, hand-optimised)                2.824 ms     2.82 ns/eval
  SymPy lambdify → numpy backend                  2.763 ms     2.76 ns/eval

  JIT bulk  vs NumPy:  1.51x faster
  JIT batch vs NumPy:  2.49x faster

  Accuracy  bulk  max|Δ|=0.00e+00  ✔
            batch max|Δ|=0.00e+00  ✔

  NumPy intermediate arrays: ~2 ops → ~15 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

──────────────────────────────────────────────────────────────────────────────
  2. Degree-4 polynomial  (x-y)^4  [2 vars]
  x^4 - 4*x^3*y + 6*x^2*y^2 - 4*x*y^3 + y^4
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              2.729 ms     2.73 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.276 ms     1.28 ns/eval
  NumPy (SIMD / C, hand-optimised)               19.207 ms    19.21 ns/eval
  SymPy lambdify → numpy backend                 18.933 ms    18.93 ns/eval

  JIT bulk  vs NumPy:  7.04x faster
  JIT batch vs NumPy: 15.06x faster

  Accuracy  bulk  max|Δ|=5.46e-12  ✔
            batch max|Δ|=5.46e-12  ✔

  NumPy intermediate arrays: ~16 ops → ~122 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

──────────────────────────────────────────────────────────────────────────────
  3. Cubic surface  [3 vars, 10 terms]
  x^3 + y^3 + z^3 - 3*x*y*z + x^2*y - x*y^2 + y^2*z - y*z^2 + z^2*x - z*x^2
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              3.733 ms     3.73 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.772 ms     1.77 ns/eval
  NumPy (SIMD / C, hand-optimised)               75.751 ms    75.75 ns/eval
  SymPy lambdify → numpy backend                 78.214 ms    78.21 ns/eval

  JIT bulk  vs NumPy: 20.29x faster
  JIT batch vs NumPy: 42.74x faster

  Accuracy  bulk  max|Δ|=2.84e-13  ✔
            batch max|Δ|=2.84e-13  ✔

  NumPy intermediate arrays: ~27 ops → ~206 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

──────────────────────────────────────────────────────────────────────────────
  4. Rational w/ CSE  [2 vars, repeated subexpr]
  (x^2 + y^2) / (x^2 + y^2 + 1.0) + x*y*(x^2 - y^2) / (x^2 + y^2 + 1.0)^2
──────────────────────────────────────────────────────────────────────────────
  Rust JIT bulk  (scalar, Rust loop)              2.534 ms     2.53 ns/eval
  Rust JIT batch (2-row ILP vectorised)           1.266 ms     1.27 ns/eval
  NumPy (SIMD / C, hand-optimised)               15.913 ms    15.91 ns/eval
  SymPy lambdify → numpy backend                 22.961 ms    22.96 ns/eval

  JIT bulk  vs NumPy:  6.28x faster
  JIT batch vs NumPy: 12.57x faster

  Accuracy  bulk  max|Δ|=0.00e+00  ✔
            batch max|Δ|=0.00e+00  ✔

  NumPy intermediate arrays: ~20 ops → ~153 MB peak temp memory
  JIT: 0 intermediate arrays — all values kept in CPU registers

==============================================================================
  SUMMARY: JIT speedup vs hand-optimised NumPy
  Expression                                          bulk     batch
  ──────────────────────────────────────────────  ────────  ────────
  1. Trivial (baseline)                             1.51x    2.49x
  2. Degree-4 polynomial                            7.04x   15.06x
  3. Cubic surface                                 20.29x   42.74x
  4. Rational w/ CSE                                6.28x   12.57x

  Observation: speedup grows with expression complexity because
  NumPy's intermediate arrays overflow L2/L3 cache at N=1,000,000.
  JIT maintains register-resident computation across the entire
  expression, paying one memory read/write per input element.
==============================================================================

I ran it in Python using CPython FFI, so it is significantly slower than if it were compiled into a Rust binary—but for the sake of fairness, I have kept it as is.

Also from our docs:
Architecture

Module Role
[dag] Hash-consed expression DAG — the canonical, deduplicated store for all symbolic nodes
[ast] Lightweight local tree projection of a DAG subgraph via relative i32 pointers
[parser] nom-based infix parser: "x^2 + 2*x + 1" → DAG root
[jit] (feature: cranelift-jit/jit) Cranelift JIT; emits scalar f64 closures and 2-row ILP batch functions
[heuristic] Configurable greedy/beam simplifier with a pluggable [heuristic::rule_registry::RuleRegistry]
[egraph] Lightweight equality saturation over the DAG (no egg dependency)
[custom] Unified custom-operator system — one [custom::descriptor::CustomOpDescriptor] wires into JIT + simplifier + e-graph
[simd] Slice-level batch arithmetic using the inline-asm presets
[asm_presets] Hand-written f64×2 / f64×4 kernels for x86_64 (SSE2/AVX2/AES-NI), AArch64 (NEON/crypto), riscv64 (RVV/Zkn)
[ffi] Flat extern "C" surface generated by cbindgen; includes a fiber-backed async bridge
[parallel] Fiber-based parallel simplification via the dtact runtime
[storage] Disk-backed DAG spillover and a frequency-based hot-node cache
[error] Cold-path error types and the rssn_error! macro

Bincode-next

Finally, a small update for bincode-next at the end of this post to avoid posting too frequently. Bincode-next has released v3.0.0-rc.15, and we are continuously fuzzing. We have decided to release the stable version after joining the OSS-Fuzz project and running it for a while. Link: GitHub - Apich-Organization/bincode: Bincode-next: The next official rust implementation of bincode · GitHub

Common Links

Discord Server: Apich Organization
Contact E-mail: info@apich.org
OSSF Registration (bincode-next): BadgeApp
OSSF Registration (dtact): BadgeApp
Score Card: OpenSSF scorecard report
Discussions on Dtact style designs: Discussion on Synchronous Crate Concurrency Refactor using Stackful Coroutines Model in Rust