RecallTrace — Architecture

LAYER 1 Causal Graph Engine THE REAL INNOVATION

Nodes = lots, warehouses, crossdocks, retailers. Edges = shipment and repack events. Hidden edges = the inference problem.

Ground truth is a DAG with latent interventions — the agent never sees it directly. 30–50% of edges are hidden at episode start.

Each reset() generates a unique procedural graph. No two episodes share the same topology or contamination pattern.

LAYER 2 Hidden Intervention Layer CAUSAL, NOT CORRELATIONAL

3 intervention types sampled per episode: lot_relabel, mixing_event, record_deletion

Agent must infer which intervention occurred — not just where contamination spread. This is causal reasoning, not graph traversal.

Adversary chooses placement: source, midstream, or downstream nodes. Adds decoys, red herrings, and phantom lots.

LAYER 3 Agent Tool Calls 3 CATEGORIES

🔍 Observe

inspect_node()

Reveals hidden edges and local evidence at a node

trace_lot()

Returns full movement history of a lot ID

🧠 Hypothesize

cross_reference()

Checks shared origin between two lots

request_lab_test()

Confirms contamination at a specific node

✅ Commit

quarantine()

Containment action — penalized if target is safe

finalize()

Triggers ground truth evaluation and scoring

LAYER 4 Belief State Tracker THEME 3.1 — WORLD MODELING

After each tool call, environment returns: P(edge exists) per hidden arc, P(contaminated) per node.

Agent decides: is this belief certain enough to quarantine, or should it spend a step to reduce entropy?

Trained agent learns to stop gathering evidence when marginal information gain < step cost. Untrained agent over-explores.

LAYER 5 Composable Reward

RECALL

+2.0

per unsafe lot correctly quarantined

PRECISION

−1.5

per safe lot incorrectly blocked

CALIBRATION

+0.3

if P(contam) > 0.8 before quarantine

LAYER 6 Adversarial Curriculum THEME 4 — SELF-PLAY

Replaces static difficulty tiers. Adversary agent tracks investigator failure modes and adapts episode generation.

If agent over-quarantines → next episode has more safe stock (decoys, false positives). If agent under-quarantines → next episode adds more hidden relabel hops.

Recursive skill amplification: both agents improve simultaneously. The benchmark teaches itself to be harder. Neither agent was told the strategies they discover.

LAYER 7 What Judges See

1

Procedural generation — reset() live: new graph, new hidden intervention sampled, unique topology every episode

2

World modeling visible — belief tracker panel shows P(contaminated) rising as agent inspects nodes in real time

3

Two orthogonal improvements — F1 curve 0.24→0.79 and belief calibration score rising together over 200 episodes

4

Learning is legible — side-by-side: untrained scattershots 6 nodes vs trained agent stops when P > 0.85 with 2 precise quarantines