feat: Persistent causal arena, BoolQ binary task fix, SBERT-only ablation baseline
Persistent Causal Arena, BoolQ Fix, SBERT-Only Ablation
1. Persistent Causal Arena (canonical.py)
Before: Per-choice SCMs were rebuilt from scratch for every benchmark item with uniform Dirichlet CPTs. They got ~1 observation per iteration, producing nearly identical energies β pure noise.
After: A domain SCM library (_domain_scm_library) persists across items within a task session. When a new item arrives:
- The pipeline looks up existing SCMs for the item's domain (e.g. "science", "causal", "logic")
- Per-choice SCMs are seeded with the domain SCM's accumulated CPTs instead of uniform priors
- After feedback, the gold-label observation updates BOTH the per-choice SCM and the persistent domain SCM
This means the causal arena gets progressively better at discriminating within a domain as it processes more items. Item 50 has much richer causal priors than item 1.
2. BoolQ Binary Task Fix (controller.py)
Before: BoolQ was β12% because the template parser detected negation words ("not", "doesn't") in passages and injected false evidence. PR#1's negation-flip removal helped (β23% β β12%) but the parser still generated spurious relations from passage keywords that happened to match "yes"/"no" hypothesis labels.
After: _observation_to_vector() detects binary yes/no tasks (2 active hypotheses where at least one is "yes"/"no"/"true"/"false") and returns a zero evidence vector. The template parser is completely bypassed for these tasks. SBERT sentence similarity in the canonical pipeline provides the actual signal β it compares the passage+question against each answer option in embedding space, which is what works for reading comprehension.
3. SBERT-Only Ablation Script (scripts/ablation_sbert_only.py)
The single most important experiment for intellectual honesty. This script runs the same benchmark tasks using ONLY SBERT cosine similarity:
score(choice_i) = cosine_sim(sbert(prompt), sbert(prompt + choice_i))
No NGC, no causal arena, no Hopfield, no belief updates, no falsification. The cognitive layer's delta over this baseline is the only number that justifies the "manifold reasons" framing.
Usage:
python scripts/ablation_sbert_only.py --max-samples 100
python scripts/ablation_sbert_only.py --tasks copa,boolq,sciq --output ablation.json