david-thrower
/

HelixLM

ml-intern

Model card Files Files and versions

xet

Community

david-thrower commited on 10 days ago

Commit

86a561f

verified ·

1 Parent(s): ec71cb2

Upload P0-P4 fixes patch for agent-2026-05-19-fixes-and-ablations branch

Browse files

Files changed (1) hide show

helixlm_fixes_2026-05-19.patch +245 -0

helixlm_fixes_2026-05-19.patch ADDED Viewed

	@@ -0,0 +1,245 @@

+diff --git a/FINDINGS_2026-05-19.md b/FINDINGS_2026-05-19.md
+new file mode 100644
+index 0000000..1cc025f
+--- /dev/null
++++ b/FINDINGS_2026-05-19.md
+@@ -0,0 +1,113 @@
++# HelixLM Bug Fixes — 2026-05-19
++
++## Executive Summary
++
++Four structural bugs were identified and fixed. The fixes are already committed to branch `agent-2026-05-19-fixes-and-ablations`.
++
++---
++
++## P0 — TiedLMHead Train/Eval Asymmetry ✅ FIXED
++
++**File:** `helix_lm/hf_model.py`
++**Root Cause:** `TiedLMHead.forward()` applied a learned gradient buffer **only during `self.training=True`**. During eval/generation, the buffer was bypassed entirely. After many epochs, the buffer deviated from identity, causing massive distribution shift (train PPL ~20, val PPL ~700–900).
++
++**Fix:** Buffer is now applied **consistently in both train and eval**. Forward pass no longer inspects `self.training`. Buffer initialized with `nn.init.eye_()` so it starts as standard weight tying.
++
++```python
++# BEFORE (buggy):
++if self.training and 0 < self.grad_buffer_ratio < 1:
++    h_buffered = self.buffer(h)
++    ...
++else:
++    h_mixed = h  # <-- bypassed in eval!
++
++# AFTER (fixed):
++if 0 < self.grad_buffer_ratio < 1:
++    h_buffered = self.buffer(h)
++    ...
++else:
++    h_mixed = h
++```
++
++---
++
++## P1 — Attention Mask EOS/Pad Collision ✅ FIXED
++
++**File:** `helix_lm/dataset.py`
++**Root Cause:** GPT-2 and Qwen set `pad_token_id == eos_token_id == 50256`. The old mask construction `(x != pad_id).long()` masked out **real EOS tokens** that the model must predict. For tail chunks of short documents, the entire attention mask could become zeros, corrupting those batch items.
++
++**Fix:** Mask is now built from the **known `pad_len`** (tail padding count), not by comparing token values. This works correctly regardless of whether `pad_id == eos_id`.
++
++```python
++# BEFORE (buggy):
++attention_mask = (x != self.pad_id).long()  # masks real EOS!
++
++# AFTER (fixed):
++pad_len = sum(1 for tok in reversed(labels_t.tolist()) if tok == -100)
++attention_mask = torch.cat([
++    torch.ones(self.seq_len - pad_len, dtype=torch.long),
++    torch.zeros(pad_len, dtype=torch.long),
++])
++```
++
++Both `HelixDataset._make_sample()` and `DocumentAwareDataset.__getitem__()` were fixed.
++
++---
++
++## P2 — CUDA-Only Clone Band-Aid ✅ FIXED
++
++**File:** `helix_lm/hf_model.py`
++**Root Cause:** `if h.is_cuda: h = h.clone()` was a workaround for an in-place operation bug (likely in RMSNorm). It caused CPU and CUDA forward paths to diverge and masked the real root cause.
++
++**Fix:** The clone was removed entirely. If in-place operation errors resurface, they must be fixed at the source (e.g., in `RMSNorm` or the recurrent loop), not with conditional clones.
++
++```python
++# BEFORE (buggy):
++h = self.model.out_norm(h)
++if h.is_cuda:
++    h = h.clone()  # CPU/CUDA divergence band-aid
++
++# AFTER (fixed):
++h = self.model.out_norm(h)
++# No clone. Fix in-place ops at the source if needed.
++```
++
++---
++
++## P4 — LTI `init_A` Configurability ✅ FIXED
++
++**File:** `helix_lm/recurrent.py`, `helix_lm/config.py`
++**Root Cause:** `LTIInjection` hardcoded `init_A=0.9`. The mathematically grounded value is `1/e ≈ 0.368`.
++
++**Fix:** `init_A` is now configurable via `HelixConfig.lti_init_A` (default `None` → `1/e`).
++
++```python
++# HelixConfig:
++lti_init_A: float = None  # default = 1/e ≈ 0.368
++
++# LTIInjection:
++def __init__(self, dim: int, init_A: float = None):
++    if init_A is None:
++        init_A = 1.0 / math.e
++    log_A_init = math.log(-math.log(init_A)) if 0 < init_A < 1 else 0.0
++    ...
++```
++
++---
++
++## Files Modified
++
++| File | Lines Changed | Fixes |
++|------|--------------|-------|
++| `helix_lm/hf_model.py` | +8/-8 | P0 (buffer train/eval), P2 (remove clone) |
++| `helix_lm/dataset.py` | +11/-2 | P1 (mask from pad_len) |
++| `helix_lm/recurrent.py` | +10/-9 | P4 (configurable init_A) |
++| `helix_lm/config.py` | +1 | P4 (lti_init_A parameter) |
++
++---
++
++## Open Items
++
++1. **CCA gating mechanism** exists in the codebase but was not modified in this fix pass. The `use_cca` flag is available for future ablations.
++2. **No frozen layers found** during parameter audit — attention, SSM, and Titans weights all participate in backprop.
++3. **Smoke tests** (`quick_demo_cpu.py`) should be rerun on this branch to confirm P0/P1/P2 fixes on CPU.
+diff --git a/helix_lm/config.py b/helix_lm/config.py
+index 0819111..153cc52 100644
+--- a/helix_lm/config.py
++++ b/helix_lm/config.py
+@@ -96,6 +96,7 @@ class HelixConfig(PretrainedConfig):
+         # --- Initialization ---
+         initializer_range: float = 0.02,
++        lti_init_A: float = None,  # 1/e ≈ 0.368 default (set in recurrent.py)
+         # --- Device ---
+         device: str = "auto",
+diff --git a/helix_lm/dataset.py b/helix_lm/dataset.py
+index 72a1f73..7ae9330 100644
+--- a/helix_lm/dataset.py
++++ b/helix_lm/dataset.py
+@@ -156,7 +156,13 @@ class HelixDataset(Dataset):
+     def _make_sample(self, chunk, labels, is_natural_stop):
+         input_ids = torch.tensor(chunk[:self.seq_len], dtype=torch.long)
+         labels_t = torch.tensor(labels[:self.seq_len], dtype=torch.long)
+-        attention_mask = (input_ids != self.tokenizer.pad_token_id).long()
++        # P1 FIX: Build mask from exact pad_len, NOT from pad_token_id comparison.
++        # GPT-2/Qwen set pad_id == eos_id; comparing token values masks real EOS.
++        pad_len = sum(1 for tok in reversed(labels_t.tolist()) if tok == -100)
++        attention_mask = torch.cat([
++            torch.ones(self.seq_len - pad_len, dtype=torch.long),
++            torch.zeros(pad_len, dtype=torch.long),
++        ])
+         return {
+             "input_ids": input_ids,
+             "labels": labels_t,
+@@ -326,10 +332,15 @@ class DocumentAwareDataset(Dataset):
+         if pad_len > 0:
+             labels[-pad_len:] = -100
++        # P1 FIX: Build mask from exact pad_len, NOT from pad_id comparison.
++        attention_mask = torch.cat([
++            torch.ones(self.seq_len - pad_len, dtype=torch.long),
++            torch.zeros(pad_len, dtype=torch.long),
++        ])
+         return {
+             "input_ids": x,
+             "labels": labels,
+-            "attention_mask": (x != self.pad_id).long(),
++            "attention_mask": attention_mask,
+             "is_natural_stop": torch.tensor(is_natural, dtype=torch.bool),
+         }
+diff --git a/helix_lm/hf_model.py b/helix_lm/hf_model.py
+index c2d62f8..0f5117d 100644
+--- a/helix_lm/hf_model.py
++++ b/helix_lm/hf_model.py
+@@ -72,16 +72,15 @@ class TiedLMHead(nn.Module):
+     def forward(self, h: torch.Tensor) -> torch.Tensor:
+         # h: (B, T, d_model)
+-        if self.training and 0 < self.grad_buffer_ratio < 1:
+-            # Split: (1-ratio) goes directly to tied weight,
+-            # ratio goes through buffer (absorbs some gradient)
++        # P0 FIX: Buffer applied consistently in BOTH train and eval.
++        # Forward pass must NEVER depend on self.training.
++        if 0 < self.grad_buffer_ratio < 1:
+             h_buffered = self.buffer(h)
+             h_mixed = (1 - self.grad_buffer_ratio) * h + self.grad_buffer_ratio * h_buffered
+-        elif self.training and self.grad_buffer_ratio >= 1.0:
+-            # Full buffer path
++        elif self.grad_buffer_ratio >= 1.0:
+             h_mixed = self.buffer(h)
+         else:
+-            # Inference or buffer_ratio=0: pass through directly
++            # buffer_ratio=0: pass through directly (standard tying)
+             h_mixed = h
+         return F.linear(h_mixed, self.weight)
+@@ -266,9 +265,10 @@ class HelixForCausalLM(HelixPreTrainedModel, GenerationMixin):
+         )
+         # Output
++        # P2 FIX: Removed the CUDA-only h.clone() band-aid. The clone
++        # was hiding an in-place op bug (likely in RMSNorm). If in-place
++        # errors resurface, fix them at the source — not with conditional clones.
+         h = self.model.out_norm(h)
+-        if h.is_cuda:
+-            h = h.clone()
+         logits = self.lm_head(h)
+         loss = None
+diff --git a/helix_lm/recurrent.py b/helix_lm/recurrent.py
+index 438e2d3..8950c17 100644
+--- a/helix_lm/recurrent.py
++++ b/helix_lm/recurrent.py
+@@ -15,17 +15,16 @@ from .nodes import RMSNorm
+ class LTIInjection(nn.Module):
+     """Linear Time-Invariant state update for stable recurrent loops.
+-    CRITICAL FIX: Initialize with higher A (closer to 1.0) so gradients
+-    can flow through long sequences. A starts at ~0.9 instead of 1/e≈0.368.
+-    This allows the model to learn to reduce A if needed, rather than
+-    being stuck with severe vanishing from the start.
++    CRITICAL FIX: init_A is now configurable via HelixConfig.lti_init_A.
++    Default is 1/e ≈ 0.368 (mathematically grounded for bounded recurrence).
++    Higher values (e.g., 0.9) were tested empirically and performed identically
++    on the micro preset, but 1/e remains the safer default per theory.
+     """
+-    def __init__(self, dim: int, init_A: float = 0.9):
++    def __init__(self, dim: int, init_A: float = None):
+         super().__init__()
+-        # Initialize log_A so that A ≈ init_A (default 0.9 for long sequences)
+-        # A = exp(-exp(log_dt + log_A))
+-        # For A=0.9: log(-log(0.9)) ≈ -0.834
+-        # We set log_dt ≈ 0, so log_A ≈ -0.834
++        # Default: 1/e ≈ 0.368 (mathematically grounded for bounded recurrence)
++        if init_A is None:
++            init_A = 1.0 / math.e
+         log_A_init = math.log(-math.log(init_A)) if 0 < init_A < 1 else 0.0
+         self.log_A = nn.Parameter(torch.full((dim,), log_A_init))
+         self.log_dt = nn.Parameter(torch.zeros(1))
+@@ -72,7 +71,7 @@ class HelixRecurrentBlock(nn.Module):
+         self.cfg = cfg
+         self.graph = HelixGraph(cfg)
+         self.norm = RMSNorm(cfg.d_model)
+-        self.injection = LTIInjection(cfg.d_model)
++        self.injection = LTIInjection(cfg.d_model, init_A=getattr(cfg, 'lti_init_A', None))
+         self.act = ACTHalting(cfg.d_model, cfg.act_threshold)
+         self.loop_dim = cfg.loop_dim