Zio Framework implement NioScheduler Blog (Zio框架實作NioScheduler取代Takio難點紀錄)

English version first. 中文版本請見下半部 (跳到中文版).

Background

In late 2024, the ZIO project opened issue #9356 with a $2,500 bounty: re-implement ZIO’s task scheduler using the design ideas from Nio, a 150-line experimental Rust runtime that claimed to outperform Tokio’s work-stealing scheduler on several benchmarks by replacing the random-stealing strategy with a “least-loaded worker” dispatch.

I picked up this bounty in February 2026. My background is Node.js + C/C++ system programming, not JVM. After three months of implementation, benchmarking, and reading other contributors’ attempts, I concluded that the bounty as literally specified is structurally hard to deliver on the JVM, and the difficulty is not algorithmic — it is a memory-model and architectural mismatch. This post is a retrospective for myself, and a reference for anyone considering the same problem.

Outline

What Nio actually does
ZIO’s existing ZScheduler and why it’s not a simple swap
My approach: LRU/LFU WorkerActivityTracker
Why it hit a ceiling: three structural problems
- JVM doesn’t give you Nio’s tools
- “Minimal intervention” and Nio’s design are mutually exclusive
- The benchmark suite is a multi-axis trap
Hardware reality: SMP / NUMA / cache topology
A C reference implementation of the lock-free idea
What I would do differently
A note on LLM-generated bounty submissions

What Nio actually does

Nio’s scheduling decision is one line:

1
2
3

fn least_loaded_worker(&self) -> &Worker {
    self.workers.iter().min_by_key(|q| q.len.get()).unwrap()
}

Submit a task → scan all workers → pick the one with the shortest queue → push via an MPSC channel. No global queue. No work stealing. No CAS-heavy bookkeeping. The entire scheduler is ~150 lines.

This works in Rust because:

The MPSC channel is wait-free on the producer side.
Each worker’s queue length counter sits in its own cache line — no false sharing.
The worker struct’s memory layout is under your direct control (#[repr(align(64))]).
There’s no GC writing card marks across your “padding.”

Remove any one of these and the design degrades fast.

ZIO’s existing `ZScheduler`

ZIO’s scheduler is a JVM port of Tokio-style work-stealing:

Each worker has a local deque.
A globalQueue absorbs overflow / external submissions.
Idle workers enter searching mode and randomly probe peers to steal.
A Supervisor watches opCount to detect blocked workers and replace them.

The synchronization primitives are synchronized blocks, AtomicInteger, ConcurrentLinkedQueue, LockSupport.park/unpark.

This is not a stupid design — it’s exactly what production JVM concurrent runtimes (ForkJoinPool, Loom, Akka dispatcher) all converge on, because it plays well with the JVM’s actual cost model: cheap allocation, expensive cache misses on object headers, GC barriers on every reference write.

My approach: `WorkerActivityTracker`

I tried the minimal intervention path: keep ZScheduler, inject an LRU/LFU side-structure that lets submit and searching workers go straight to the right target in O(1) instead of scanning randomly.

Design sketch:

Doubly linked list + ConcurrentHashMap maintaining worker activity (head = idle, tail = hot).
touch(worker) moves a worker to the tail when it picks up work.
scale(worker) cools a worker after a successful steal, to avoid thundering herd.
submit pulls from head; steal pulls from tail.

Preliminary benchmarks on a GitHub Codespaces 8-core box showed throughput improvement on ForkMany-style workloads. But the optimisation cost a synchronized on the LRU on every touch, and that turned out to dominate everything else.

Why it hit a ceiling — three structural problems

1. The JVM doesn’t give you Nio’s tools

Nio’s win comes from primitives the JVM either lacks or hides:

Nio (Rust)	JVM equivalent	Catch
`#[repr(align(64))]` per-worker	`@jdk.internal.vm.annotation.Contended`	requires `-XX:-RestrictContended`, not public API
`AtomicU64` no header overhead	`AtomicLong` (object)	object header + GC card mark on the field
Manual `mmap` placement	none	GC controls layout, not you
`compiler_fence` per-direction	`VarHandle.acquireFence()` etc.	works, but costs more on JIT’d code
MPSC wait-free channel	`MpscArrayQueue` (JCTools)	external lib, still slower than Rust equivalent

You can mimic the intent, but not the cost. Even the best-tuned JVM version of Nio’s least_loaded_worker will pay roughly 1.5× to 3× the per-op overhead of the Rust original.

2. “Minimal intervention” and Nio are mutually exclusive

Nio’s win comes from deleting things — no global queue, no stealing, no random probing. If you keep ZScheduler‘s globalQueue and stealing for “compatibility”, and just bolt an LRU on the side, you’ve added overhead without subtracting any. That’s exactly what my prototype did, and it’s what closed #9684 — it won ChainedFork and ForkMany, but lost PingPong and YieldMany. Pareto regression in a benchmark suite is a hard “no merge.”

3. The benchmark suite is a multi-axis trap

The ZIO scheduler benchmarks measure things that are partially in tension:

PingPong rewards producer-consumer locality — two fibers should stay on the same worker.
ForkMany rewards submission throughput — push tasks out as wide as possible.
ChainedFork rewards low fork-tail-latency — consecutive forks shouldn’t queue.
YieldMany rewards runqueue rotation fairness — no fiber should be starved.

A “least-loaded” dispatch will systematically lose PingPong, because it actively un-co-locates the two fibers in a ping-pong pair to balance load. There is no algorithmic trick around this — it’s a design choice. But the issue body never says “we’re willing to lose X% on PingPong for Y% on ForkMany,” so every PR ends up in subjective trade-off territory.

Hardware reality: SMP / NUMA / cache topology

Even if you ignore the JVM and write this in C, you have to face:

SMP vs AMP: Apple Silicon’s P-cores and E-cores have very different IPC. “Least loaded” by queue length is wrong if a 100-task queue on a P-core drains faster than a 50-task queue on an E-core.
NUMA: On a 2-socket Xeon, cross-socket cache coherence is ~80ns vs ~10ns intra-socket. A least-loaded worker on the wrong socket is a worse choice than a slightly-busier worker on the right socket.
Memory model: x86 is TSO (mostly sequentially consistent), ARM is weak. The same lock-free code needs explicit dmb ish on ARM that it doesn’t need on x86.

A “good” scheduler cannot be one algorithm; it has to be topology-aware. Linux’s CFS and Windows’ UMS both took 10+ years of iteration to get this approximately right, with deep kernel hooks.

A C reference for the lock-free idea

The clean version of the design — what Nio’s authors actually had in mind, before the JVM compromises — looks like this in C. Per-worker head and tail cursors live on the same cache line; only one writer touches each side; __sync_synchronize provides the publish fence.

#define NUM_WORKERS 4
#define QUEUE_SIZE 65536
#define QUEUE_MASK (QUEUE_SIZE - 1)
#define CACHE_LINE_SIZE 128

typedef struct {
    uint64_t id;
    int complexity;
} Task;

typedef struct {
    // [Worker WriteZone]
    volatile unsigned long head;
    volatile uint64_t total_processed;

    // [Allocator WriteZone]
    volatile unsigned long tail;

    Task ring_buffer[QUEUE_SIZE];
} __attribute__((aligned(CACHE_LINE_SIZE))) WorkerSection;

static inline void cpu_relax() {
#if defined(__aarch64__) || defined(__arm__)
    __asm__ __volatile__("yield" ::: "memory");
#elif defined(__x86_64__) || defined(__i386__)
    __asm__ __volatile__("rep; nop" ::: "memory");
#else
    __asm__ __volatile__("" ::: "memory");
#endif
}

void* worker_thread(void* arg) {
    long id = (long)arg;
    WorkerSection *my_mem = &workers_mem[id];
    while (system_running) {
        unsigned long current_tail = my_mem->tail;
        unsigned long current_head = my_mem->head;
        if (current_head < current_tail) {
            Task t = my_mem->ring_buffer[current_head & QUEUE_MASK];
            my_mem->head++;
            do_work(&t);
            my_mem->total_processed++;
        } else {
            cpu_relax();
        }
    }
    return NULL;
}

void* allocator_thread(void* arg) {
    int start_offset = 0;
    while (system_running) {
        Task t = make_task();
        int best = -1;
        unsigned long min_load = ULONG_MAX;
        start_offset = (start_offset + 1) % NUM_WORKERS;
        for (int i = 0; i < NUM_WORKERS; i++) {
            int w = (start_offset + i) % NUM_WORKERS;
            unsigned long load = workers_mem[w].tail - workers_mem[w].head;
            if (load >= QUEUE_SIZE) continue;
            if (load < min_load) { min_load = load; best = w; }
        }
        if (best != -1) {
            WorkerSection *target = &workers_mem[best];
            target->ring_buffer[target->tail & QUEUE_MASK] = t;
            __sync_synchronize();
            target->tail++;
        } else {
            cpu_relax();
        }
    }
    return NULL;
}

Key points the JVM port loses:

__attribute__((aligned(128))) is a hard guarantee. @Contended is a hint that JIT may ignore.
head and tail only have one writer each. JVM’s GC card-marking can spuriously dirty the cache line that contains them.
__sync_synchronize() is one mfence. The JVM equivalent (VarHandle.fullFence) often also emits a JIT-safepoint poll.

What I would do differently

If I were starting today, I would not try to replace ZScheduler. I would do the following in order:

Open a design RFC first. Get the maintainers to commit to acceptance criteria: which benchmarks must not regress, by how much, and whether the new scheduler can be opt-in. Without this, every PR is shooting at a moving target.
Ship as opt-in NioScheduler, not as replacement. Register it via Runtime.setScheduler so users who have ForkMany-shaped workloads can choose it. This sidesteps the Pareto-regression veto.
Use JCTools‘ MpscArrayQueue as the per-worker queue. It’s the closest thing the JVM has to Nio’s MPSC channel and is already used in Netty, Hazelcast, etc.
Build a topology-aware variant for tier-2. Once the basic opt-in works, add NUMA-awareness (via Thread.setAffinity through JNI or taskset wrapping) for users who actually pin JVMs to sockets. Don’t ship this in v1.
Make the C reference the README. Show the design in C first, then explain what each line costs on the JVM. This makes the trade-offs explicit instead of hidden.

A note on LLM-generated bounty submissions

Between November 2024 and March 2026, the issue accumulated 17+ /attempt claims. Most were a single-line plan posted by an LLM, with no follow-through. Two real PRs (#9684, #10220) were closed — one for Pareto regression, one for “code looks copy/pasted or AI-generated.”

The bounty mechanism plus easy LLM access has created a pattern where serious contributors arrive last and inherit the burden of proving they aren’t AI noise before any technical conversation can start. This is a real cost the OSS-bounty model has not yet adapted to.

Conclusion

I did not complete this bounty. What I did do was:

Implement the LRU/LFU side-tracker on top of ZScheduler and benchmark it.
Identify the structural reasons it cannot reach Nio-level performance on the JVM.
Write the lock-free reference in C to document what the intended design actually requires.
Propose an opt-in path that side-steps the Pareto-regression problem.

The lesson I’m taking from this is that algorithmic insight does not automatically port across runtimes. A scheduler design is a contract with the underlying memory model, the GC, the cache topology, and the existing API surface. Break any one of those contracts and the design loses its reason to exist.

中文版本

背景

2024 年底，ZIO 專案開出了 issue #9356 並掛上 $2,500 的賞金：用 Nio 這套 150 行的 Rust 實驗性 runtime 設計來重寫 ZIO 的 task scheduler。Nio 的核心是把 Tokio 那種「random work-stealing」換成「把任務派給目前負載最低的 worker」。

我從 2026 年 2 月開始投入這個 bounty。我的背景是 Node.js 與 C/C++ 系統程式設計，不是 JVM。在三個月的實作、benchmark、以及閱讀其他人嘗試的過程中，我得到的結論是：這個 bounty 在 JVM 上的「字面實作」結構性地困難，而困難並不在演算法，而是在於 記憶體模型 (Memory Model) 與既有架構的不匹配。這篇文章是我給自己的覆盤，也希望對之後想挑戰同樣問題的人有所參考。

大綱

Nio 實際在做的事
ZIO 現有的 ZScheduler，以及為何不能簡單抽換
我採用的作法：LRU/LFU WorkerActivityTracker
為何撞到天花板：三個結構性問題
- JVM 不給你 Nio 的武器
- 「最小侵入」與 Nio 的設計互斥
- Benchmark 套件本身是個多軸陷阱
硬體現實：SMP / NUMA / 快取拓樸
Lock-free 概念的 C 語言參考實作
如果重來我會怎麼做
對 LLM 灌水 PR 的觀察

Nio 實際在做的事

Nio 的調度決策只有一行：掃過所有 worker，挑佇列最短的那個，用 MPSC channel 把任務塞給它。沒有 global queue、沒有 work stealing、沒有大量 CAS 簿記。整個 scheduler 大約 150 行（參見上半部 least_loaded_worker Rust 程式碼）。

它在 Rust 上能成立，是因為：

MPSC channel 在 producer 端是 wait-free。
每個 worker 的佇列長度計數器各自獨佔一條 cache line — 沒有 false sharing。
worker struct 的記憶體佈局完全可控（#[repr(align(64))]）。
沒有 GC 在你以為的 padding 上偷偷寫 card mark。

這四項少了任何一項，這個設計就會快速劣化。

ZIO 現有的 `ZScheduler`

ZIO 的 scheduler 是 Tokio 風格 work-stealing 的 JVM 移植版：

每個 worker 有自己的 local deque。
一個 globalQueue 用來吸收溢出的任務 / 外部提交。
空閒的 worker 進入 searching 模式，隨機探測其他 worker 偷任務。
一個 Supervisor 監看 opCount，用來偵測被卡住的 worker 並替換掉。

它使用的同步原語是 synchronized 區塊、AtomicInteger、ConcurrentLinkedQueue、LockSupport.park/unpark。

這不是一個糟糕的設計 — 它正是 JVM 上所有生產級的 concurrent runtime（ForkJoinPool、Loom、Akka dispatcher）最終會收斂到的形狀，因為它符合 JVM 的真實成本模型：分配便宜、object header 上的 cache miss 昂貴、每次 reference 寫入都伴隨 GC barrier。

我採用的作法：`WorkerActivityTracker`

我選了「最小侵入」這條路：保留 ZScheduler，旁邊掛一個 LRU/LFU 結構，讓 submit 與 searching 中的 worker 能 O(1) 直接定位到正確目標，而不是隨機掃描。

設計概要：

雙向鏈結串列 + ConcurrentHashMap 維護 worker 活動狀態（head = idle、tail = hot）。
touch(worker) 在 worker 接到工作時把它移到 tail。
scale(worker) 在被偷成功之後把該 worker 冷卻，避免 thundering herd。
submit 從 head 拉、steal 從 tail 拉。

在 GitHub Codespaces 8 核機器上的初步 benchmark 顯示，ForkMany 類型的 workload 確實有提升。但這個優化在每次 touch 上都付出了一個 synchronized 的代價，而那才是真正主導效能的東西。

撞到天花板的三個結構性問題

1. JVM 不給你 Nio 的武器

Nio 的勝點來自一些 JVM 要嘛沒有、要嘛藏起來的原語：

Nio (Rust)	JVM 對應物	限制
`#[repr(align(64))]` per-worker	`@jdk.internal.vm.annotation.Contended`	需要 `-XX:-RestrictContended`，且非公開 API
`AtomicU64` 無 header overhead	`AtomicLong`（物件）	物件 header + 該欄位上的 GC card mark
手動 `mmap` 佈局	無	GC 控制佈局，你不行
`compiler_fence` 單向	`VarHandle.acquireFence()` 等	可以做，但 JIT 過後成本更高
MPSC wait-free channel	`MpscArrayQueue`（JCTools）	外部依賴，仍比 Rust 對應物慢

你可以模仿 Nio 的「意圖」，但模仿不了它的「成本」。即使是調得最好的 JVM 版 least_loaded_worker，每筆操作的 overhead 大約還是 Rust 原版的 1.5 到 3 倍。

2. 「最小侵入」與 Nio 設計互斥

Nio 的勝點來自「刪掉東西」 — 沒有 global queue、沒有 stealing、沒有隨機探測。如果你為了「相容」而保留 ZScheduler 的 globalQueue 與 stealing，只在旁邊掛一個 LRU，你只是加上 overhead 卻沒有減掉任何東西。這正是我的原型所做的事，也正是 #9684 被關掉的原因 — 它在 ChainedFork 和 ForkMany 上贏了，但在 PingPong 和 YieldMany 上輸了。在 benchmark suite 上出現 Pareto 退步，maintainer 會直接 no merge。

3. Benchmark 套件本身是個多軸陷阱

ZIO scheduler 的 benchmark 衡量的東西彼此部分拉扯：

PingPong 獎勵 producer-consumer locality — 兩個 fiber 應該黏在同一個 worker 上。
ForkMany 獎勵 submission throughput — 任務盡可能廣地散出去。
ChainedFork 獎勵 fork tail latency — 連續 fork 不該排隊。
YieldMany 獎勵 runqueue 輪轉公平性 — 不該有 fiber 餓死。

「最閒 worker」策略會系統性地輸 PingPong，因為它會主動把 ping-pong 配對的兩個 fiber 拆開以平衡負載。這不是演算法技巧能繞過的，這是設計取捨。但 issue 從未明確寫出「我們可以接受 PingPong 退步 X% 換 ForkMany 進步 Y%」這種規格，於是每個 PR 都掉進主觀取捨的灰色地帶。

硬體現實：SMP / NUMA / 快取拓樸

即使忽略 JVM、用 C 重寫，你還是要面對：

SMP vs AMP：Apple Silicon 的 P-core 與 E-core 的 IPC 差很多。如果一個 100 任務的佇列在 P-core 上比 50 任務的佇列在 E-core 上排得更快，「以佇列長度判斷誰最閒」就是錯的。
NUMA：在雙路 Xeon 上，跨 socket 的快取一致性大約 80ns，同 socket 內大約 10ns。把任務派給「在錯誤 socket 上的最閒 worker」，比派給「在正確 socket 上稍微忙一點的 worker」還糟。
記憶體模型：x86 是 TSO（接近循序一致），ARM 是弱記憶體模型。同一份 lock-free 程式碼在 ARM 上需要明確的 dmb ish，x86 上不用。

一個「好」的 scheduler 不可能是單一演算法，它必須具有拓樸感知能力。Linux 的 CFS 與 Windows 的 UMS 都花了十年以上的迭代，並深入 kernel hook，才把這件事做到大致可用。

Lock-free 概念的 C 語言參考實作

設計的乾淨版本 — Nio 作者真正在腦中的形狀，還沒被 JVM 限制污染前 — 在 C 裡長這樣。每個 worker 的 head 與 tail cursor 放在同一條 cache line，每一邊只有一個 writer，__sync_synchronize 提供 publish fence。

#define NUM_WORKERS 4
#define QUEUE_SIZE 65536
#define QUEUE_MASK (QUEUE_SIZE - 1)
#define CACHE_LINE_SIZE 128

typedef struct {
    uint64_t id;
    int complexity;
} Task;

typedef struct {
    // [Worker WriteZone]
    volatile unsigned long head;
    volatile uint64_t total_processed;

    // [Allocator WriteZone]
    volatile unsigned long tail;

    Task ring_buffer[QUEUE_SIZE];
} __attribute__((aligned(CACHE_LINE_SIZE))) WorkerSection;

static inline void cpu_relax() {
#if defined(__aarch64__) || defined(__arm__)
    __asm__ __volatile__("yield" ::: "memory");
#elif defined(__x86_64__) || defined(__i386__)
    __asm__ __volatile__("rep; nop" ::: "memory");
#else
    __asm__ __volatile__("" ::: "memory");
#endif
}

void* worker_thread(void* arg) {
    long id = (long)arg;
    WorkerSection *my_mem = &workers_mem[id];
    while (system_running) {
        unsigned long current_tail = my_mem->tail;
        unsigned long current_head = my_mem->head;
        if (current_head < current_tail) {
            Task t = my_mem->ring_buffer[current_head & QUEUE_MASK];
            my_mem->head++;
            do_work(&t);
            my_mem->total_processed++;
        } else {
            cpu_relax();
        }
    }
    return NULL;
}

void* allocator_thread(void* arg) {
    int start_offset = 0;
    while (system_running) {
        Task t = make_task();
        int best = -1;
        unsigned long min_load = ULONG_MAX;
        start_offset = (start_offset + 1) % NUM_WORKERS;
        for (int i = 0; i < NUM_WORKERS; i++) {
            int w = (start_offset + i) % NUM_WORKERS;
            unsigned long load = workers_mem[w].tail - workers_mem[w].head;
            if (load >= QUEUE_SIZE) continue;
            if (load < min_load) { min_load = load; best = w; }
        }
        if (best != -1) {
            WorkerSection *target = &workers_mem[best];
            target->ring_buffer[target->tail & QUEUE_MASK] = t;
            __sync_synchronize();
            target->tail++;
        } else {
            cpu_relax();
        }
    }
    return NULL;
}

JVM 移植版會失去的關鍵：

__attribute__((aligned(128))) 是強保證；@Contended 是 hint，JIT 可以無視。
head 和 tail 各只有一個 writer，但 JVM 的 GC card mark 會錯誤地弄髒整條 cache line。
__sync_synchronize() 在 C 是一條 mfence；JVM 對應的 VarHandle.fullFence 通常還會額外帶入 JIT safepoint poll。

如果重來我會怎麼做

如果今天重來，我不會嘗試「取代」ZScheduler，而會按以下順序：

先開設計 RFC。逼 maintainer 把驗收標準說清楚：哪些 benchmark 不能退步、退步幅度上限、新 scheduler 能不能 opt-in。沒有這個，每個 PR 都是在打移動標靶。
以 opt-in 形式提供 NioScheduler，不取代既有的。透過 Runtime.setScheduler 註冊，讓 ForkMany 型 workload 的使用者自己選。這直接繞過 Pareto regression 的否決。
底層用 JCTools 的 MpscArrayQueue 作為 per-worker queue。這是 JVM 上最接近 Nio MPSC channel 的東西，已經在 Netty、Hazelcast 等專案中驗證過。
拓樸感知版本留到 v2。基本 opt-in 版能跑之後，再加 NUMA-aware（透過 JNI 包 Thread.setAffinity，或外部 taskset），給真正會把 JVM 釘在特定 socket 的使用者。v1 不要塞。
把 C 版本做成 README。先用 C 展示設計，再逐行解釋每行在 JVM 上的成本。把 trade-off 攤開，比藏起來好。

對 LLM 灌水 PR 的觀察

2024 年 11 月到 2026 年 3 月之間，這個 issue 累積了 17 個以上的 /attempt，多數是 LLM 生出來的一行計畫然後消失。兩個有實質的 PR（#9684、#10220）都被關掉了 — 一個是 Pareto regression，一個是「程式碼看起來像 copy/paste 或 AI 生成」。

賞金機制加上 LLM 的低門檻，造成一種模式：認真的貢獻者最後才到，必須先承擔「證明我不是 AI 灌水」的舉證責任，技術討論才能開始。這是 OSS bounty 模式目前還沒適應的真實成本。

結論

我沒有完成這個 bounty。我做到的是：

在 ZScheduler 上實作 LRU/LFU side-tracker 並做了 benchmark。
釐清為何在 JVM 上達不到 Nio 等級的效能（結構性原因）。
用 C 寫出 lock-free 參考實作，記錄這個設計實際需要什麼。
提出 opt-in 路徑來繞過 Pareto regression 的否決。

我從這件事得到的教訓是：演算法洞察不會自動跨 runtime 移植。一個 scheduler 的設計是與底層記憶體模型、GC、快取拓樸、以及既有 API 介面之間的契約。打破其中任何一個契約，這個設計就失去了存在的理由。

Issue reference: zio/zio#9356 · My comments under @teddy1565