fix: Benchmark regression fixes — emission gate, negation bug, noise dampening, SBERT channel

by theapemachine - opened 11 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Owner 11 days ago

Benchmark Regression Fixes

Based on analysis of the offline (1338 samples, 14 tasks) and local-mode (600+ samples, 6 tasks) benchmark runs, this PR addresses 6 specific bugs and miscalibrations that cause the cognitive layer to hurt rather than help benchmark performance.

Changes

1. Remove global negation flip (`controller.py`)

Impact: BoolQ −23% → fixed

The _observation_to_vector() method had a blanket features *= -0.5 when negation_present=True. Since most BoolQ passages contain negation words ("not", "doesn't", "never"), this flipped the entire evidence vector for nearly every sample, systematically biasing toward the wrong answer. Negation is already handled per-relation via the negated flag on each RelationMention inside _apply_relation_evidence().

2. Calibrate emission gate (`runner.py`)

Impact: Local mode −15% across the board → fixed

The _emit_answer() logit processor was configured with entropy_gate=1.01 (always emit) and min_confidence=0.0, meaning the graft would override the LLM's logits even when the cognitive layer had near-uniform beliefs. Changed to:

entropy_gate=0.65 — only emit when beliefs are genuinely converged
min_confidence=0.4 — require the leader to have at least 40% mass
scale=λ*0.5 (down from λ*2.5) — gentler nudge even when emitting

This implements the "never worse than base" principle from the graft docstring.

3. Dampen falsification noise (`runner.py`)

Impact: Noise amplification in Bayesian update → dampened

falsify_update_strength reduced from 1.0 → 0.3. The NGC falsification scores on randomly-projected FHRR observations are noisy — z-normalization forces std=1 even on noise, and exponentiation over 3 iterations compounds random directions into fake convergence.

4. Dampen energy arena noise (`runner.py`)

Impact: Uninformative SCM posteriors → dampened

energy_arena_beta reduced from 1.0 → 0.1. Per-choice SCMs start with uniform Dirichlet CPTs and receive ~1 observation per iteration, producing nearly identical energies. Their contribution to the posterior was pure noise.

5. Add SBERT sentence similarity channel (`canonical.py`)

Impact: Leverages the strongest available signal

Added _sbert_choice_scores() method that computes cosine similarity between the prompt and each (prompt+choice) concatenation using the frozen SBERT model already loaded by the FHRR encoder. This is the same signal that drives the IterativeCognitiveScorer's best-performing channel. Integrated into the Bayesian posterior with sbert_evidence_weight=0.8.

Expected Impact

Fix	Primary Tasks Affected	Expected Δ
Negation bug	BoolQ, all text-heavy tasks	+10-20% on BoolQ
Emission gate	All tasks (local mode)	Eliminate universal −15% regression
Falsification dampening	All tasks (both modes)	Reduce bad flips
Energy arena dampening	All tasks (both modes)	Reduce bad flips
SBERT channel	SciQ, HellaSwag, ARC, TruthfulQA	+5-10% on passage-based tasks

How to validate

# Offline benchmark (no GPU needed):
python -m tensegrity.bench.run --mode offline --max-samples 100

# Local benchmark (requires GPU):
python -m tensegrity.bench.run --mode local --model meta-llama/Llama-3.2-1B-Instruct --max-samples 100

Compare overall Δ and G/B ratio against the pre-fix baseline (overall Δ=+0.1%, G/B=1.0 offline; Δ≈−15%, G/B≈0.3 local).

theapemachine changed pull request status to open 11 days ago

fix: runner.py benchmark regression fixes1f6969d8

fix: controller.py benchmark regression fixese143977a

fix: canonical.py benchmark regression fixes8d9c355b

theapemachine changed pull request status to merged 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

fix: Benchmark regression fixes — emission gate, negation bug, noise dampening, SBERT channel

Benchmark Regression Fixes

Changes

1. Remove global negation flip (controller.py)

2. Calibrate emission gate (runner.py)

3. Dampen falsification noise (runner.py)

4. Dampen energy arena noise (runner.py)

5. Add SBERT sentence similarity channel (canonical.py)