fix: Benchmark regression fixes β€” emission gate, negation bug, noise dampening, SBERT channel

#1
by theapemachine - opened

Benchmark Regression Fixes

Based on analysis of the offline (1338 samples, 14 tasks) and local-mode (600+ samples, 6 tasks) benchmark runs, this PR addresses 6 specific bugs and miscalibrations that cause the cognitive layer to hurt rather than help benchmark performance.

Changes

1. Remove global negation flip (controller.py)

Impact: BoolQ βˆ’23% β†’ fixed

The _observation_to_vector() method had a blanket features *= -0.5 when negation_present=True. Since most BoolQ passages contain negation words ("not", "doesn't", "never"), this flipped the entire evidence vector for nearly every sample, systematically biasing toward the wrong answer. Negation is already handled per-relation via the negated flag on each RelationMention inside _apply_relation_evidence().

2. Calibrate emission gate (runner.py)

Impact: Local mode βˆ’15% across the board β†’ fixed

The _emit_answer() logit processor was configured with entropy_gate=1.01 (always emit) and min_confidence=0.0, meaning the graft would override the LLM's logits even when the cognitive layer had near-uniform beliefs. Changed to:

  • entropy_gate=0.65 β€” only emit when beliefs are genuinely converged
  • min_confidence=0.4 β€” require the leader to have at least 40% mass
  • scale=Ξ»*0.5 (down from Ξ»*2.5) β€” gentler nudge even when emitting

This implements the "never worse than base" principle from the graft docstring.

3. Dampen falsification noise (runner.py)

Impact: Noise amplification in Bayesian update β†’ dampened

falsify_update_strength reduced from 1.0 β†’ 0.3. The NGC falsification scores on randomly-projected FHRR observations are noisy β€” z-normalization forces std=1 even on noise, and exponentiation over 3 iterations compounds random directions into fake convergence.

4. Dampen energy arena noise (runner.py)

Impact: Uninformative SCM posteriors β†’ dampened

energy_arena_beta reduced from 1.0 β†’ 0.1. Per-choice SCMs start with uniform Dirichlet CPTs and receive ~1 observation per iteration, producing nearly identical energies. Their contribution to the posterior was pure noise.

5. Add SBERT sentence similarity channel (canonical.py)

Impact: Leverages the strongest available signal

Added _sbert_choice_scores() method that computes cosine similarity between the prompt and each (prompt+choice) concatenation using the frozen SBERT model already loaded by the FHRR encoder. This is the same signal that drives the IterativeCognitiveScorer's best-performing channel. Integrated into the Bayesian posterior with sbert_evidence_weight=0.8.

Expected Impact

Fix Primary Tasks Affected Expected Ξ”
Negation bug BoolQ, all text-heavy tasks +10-20% on BoolQ
Emission gate All tasks (local mode) Eliminate universal βˆ’15% regression
Falsification dampening All tasks (both modes) Reduce bad flips
Energy arena dampening All tasks (both modes) Reduce bad flips
SBERT channel SciQ, HellaSwag, ARC, TruthfulQA +5-10% on passage-based tasks

How to validate

# Offline benchmark (no GPU needed):
python -m tensegrity.bench.run --mode offline --max-samples 100

# Local benchmark (requires GPU):
python -m tensegrity.bench.run --mode local --model meta-llama/Llama-3.2-1B-Instruct --max-samples 100

Compare overall Ξ” and G/B ratio against the pre-fix baseline (overall Ξ”=+0.1%, G/B=1.0 offline; Ξ”β‰ˆβˆ’15%, G/Bβ‰ˆ0.3 local).

theapemachine changed pull request status to open
theapemachine changed pull request status to merged

Sign up or log in to comment