LAYER 1 Causal Graph Engine THE REAL INNOVATION
Nodes = lots, warehouses, crossdocks, retailers. Edges = shipment and repack events. Hidden edges = the inference problem.
Ground truth is a DAG with latent interventions — the agent never sees it directly. 30–50% of edges are hidden at episode start.
Each reset() generates a unique procedural graph. No two episodes share the same topology or contamination pattern.
LAYER 2 Hidden Intervention Layer CAUSAL, NOT CORRELATIONAL
3 intervention types sampled per episode: lot_relabel, mixing_event, record_deletion
Agent must infer which intervention occurred — not just where contamination spread. This is causal reasoning, not graph traversal.
Adversary chooses placement: source, midstream, or downstream nodes. Adds decoys, red herrings, and phantom lots.
LAYER 3 Agent Tool Calls 3 CATEGORIES
🔍 Observe
inspect_node()
Reveals hidden edges and local evidence at a node
trace_lot()
Returns full movement history of a lot ID
🧠 Hypothesize
cross_reference()
Checks shared origin between two lots
request_lab_test()
Confirms contamination at a specific node
✅ Commit
quarantine()
Containment action — penalized if target is safe
finalize()
Triggers ground truth evaluation and scoring
LAYER 4 Belief State Tracker THEME 3.1 — WORLD MODELING
After each tool call, environment returns: P(edge exists) per hidden arc, P(contaminated) per node.
Agent decides: is this belief certain enough to quarantine, or should it spend a step to reduce entropy?
Trained agent learns to stop gathering evidence when marginal information gain < step cost. Untrained agent over-explores.
LAYER 5 Composable Reward
RECALL
+2.0
per unsafe lot correctly quarantined
PRECISION
−1.5
per safe lot incorrectly blocked
CALIBRATION
+0.3
if P(contam) > 0.8 before quarantine
LAYER 6 Adversarial Curriculum THEME 4 — SELF-PLAY
Replaces static difficulty tiers. Adversary agent tracks investigator failure modes and adapts episode generation.
If agent over-quarantines → next episode has more safe stock (decoys, false positives). If agent under-quarantines → next episode adds more hidden relabel hops.
Recursive skill amplification: both agents improve simultaneously. The benchmark teaches itself to be harder. Neither agent was told the strategies they discover.
LAYER 7 What Judges See
1
Procedural generationreset() live: new graph, new hidden intervention sampled, unique topology every episode
2
World modeling visible — belief tracker panel shows P(contaminated) rising as agent inspects nodes in real time
3
Two orthogonal improvements — F1 curve 0.24→0.79 and belief calibration score rising together over 200 episodes
4
Learning is legible — side-by-side: untrained scattershots 6 nodes vs trained agent stops when P > 0.85 with 2 precise quarantines