Proof
The numbers
Across 4,216 real prompt-injection attacks, on two models, unauthorized writes through NIL = 0.00% — while every benign task still completed. Whatever fraction the agent gets hijacked, NIL commits none of those writes.
The result
| Model | Setting | Cases | Hijack rate (ASR) | Unauth. write — raw | Unauth. write — NIL | Benign |
|---|---|---|---|---|---|---|
| gpt-oss-120b | base | 1054 | 2.75% | 2.75% | 0.00% | 100% |
| gpt-oss-120b | enhanced | 1054 | 0.47% | 0.47% | 0.00% | 100% |
| zai-glm-4.7 | base | 1054 | 4.46% | 4.46% | 0.00% | 100% |
| zai-glm-4.7 | enhanced | 1054 | 0.00% | 0.00% | 0.00% | 100% |
How it's measured
NIL is the layer between the agent and the backend, so we don't compete on a leaderboard — we instrument one. InjecAgent (ACL Findings 2024) injects a malicious instruction into a tool's response while the user only asked for a benign read; a hijacked agent then calls the attacker's tool — a state-changing write. We run every case twice: the agent calling tools directly (raw), and the same agent routed through NIL (gated). Same model, same attacks — only the gate differs. The claim isn't “NIL makes the model smarter”; it's structural: a write only commits after a previewed propose → approve → commit, and the agent can only touch verbs the backend's skeleton exposes.
Conformance — protocol invariants
Beyond safety, the wire itself is tested as properties, not single runs: a property-based state machine drives random propose/commit/rollback sequences and asserts idempotency, no side effect on PROPOSE, rollback honesty (a reversal targets the real record, never a stale name), and refusal correctness (unknown verbs are refused, never faked).
Honest caveats
We publish the caveats, not just the win. The harness uses a single-step decision, not InjecAgent's two-step ReAct, and these reasoning models' raw hijack rates (0–4.5%) sit below the paper's 24% GPT-4-ReAct base — so the ASR numbers are harness-specific and not a head-to-head with the published figure. The NIL → 0 result is the robust, comparable claim. Unauthorized-write rate is always reported paired with benign task-success, never alone.
Reproduce it
- Harness + how to run: nilscript/bench
- Full four-axis plan (task-success, safety, conformance, performance): benchmarking-plan.md
- Try the propose→approve→commit→rollback flow yourself: the Playground