Proof

The numbers

Across 4,216 real prompt-injection attacks, on two models, unauthorized writes through NIL = 0.00% — while every benign task still completed. Whatever fraction the agent gets hijacked, NIL commits none of those writes.

The result

InjecAgent unauthorized-write rate: raw vs NIL — NIL bars are zero across every model and setting

4,216 evaluations, one headline

Raw agents were hijacked into a real write on up to 1 in 22 cases. Through NIL, unauthorized writes commit 0.00% — across every model and attack setting — while benign tasks stay at 100%. The defense is structural, not model-dependent.

Model	Setting	Cases	Hijack rate (ASR)	Unauth. write — raw	Unauth. write — NIL	Benign
gpt-oss-120b	base	1054	2.75%	2.75%	0.00%	100%
gpt-oss-120b	enhanced	1054	0.47%	0.47%	0.00%	100%
zai-glm-4.7	base	1054	4.46%	4.46%	0.00%	100%
zai-glm-4.7	enhanced	1054	0.00%	0.00%	0.00%	100%

How it's measured

NIL is the layer between the agent and the backend, so we don't compete on a leaderboard — we instrument one. InjecAgent (ACL Findings 2024) injects a malicious instruction into a tool's response while the user only asked for a benign read; a hijacked agent then calls the attacker's tool — a state-changing write. We run every case twice: the agent calling tools directly (raw), and the same agent routed through NIL (gated). Same model, same attacks — only the gate differs. The claim isn't “NIL makes the model smarter”; it's structural: a write only commits after a previewed propose → approve → commit, and the agent can only touch verbs the backend's skeleton exposes.

Conformance — protocol invariants

Beyond safety, the wire itself is tested as properties, not single runs: a property-based state machine drives random propose/commit/rollback sequences and asserts idempotency, no side effect on PROPOSE, rollback honesty (a reversal targets the real record, never a stale name), and refusal correctness (unknown verbs are refused, never faked).

Honest caveats

We publish the caveats, not just the win. The harness uses a single-step decision, not InjecAgent's two-step ReAct, and these reasoning models' raw hijack rates (0–4.5%) sit below the paper's 24% GPT-4-ReAct base — so the ASR numbers are harness-specific and not a head-to-head with the published figure. The NIL → 0 result is the robust, comparable claim. Unauthorized-write rate is always reported paired with benign task-success, never alone.

Reproduce it

Harness + how to run: nilscript/bench
Full four-axis plan (task-success, safety, conformance, performance): benchmarking-plan.md
Try the propose→approve→commit→rollback flow yourself: the Playground