Add SmolLM2-360M architecture analysis, fix PositionExtractor tokenization

ARCHITECTURE ANALYSIS
---------------------
- Add SMOLLM2_ARCHITECTURE.md: comprehensive technical reference (457 lines)
- 361.82M params, hidden_dim=960, 32 transformer layers
- Grouped Query Attention: 15 query heads, 5 KV heads (3:1 ratio)
- SwiGLU MLP: gate/up (960->2560), down (2560->960)
- RoPE position encoding (theta=100k, max 8192 tokens)
- Weight inventory: per-layer breakdown, parameter distribution

- Document critical tokenization behavior:
- Digits tokenized individually: token_id = 32 + digit_value
- "47 + 86" -> ['4', '7', ' +', ' ', '8', '6'] (6 tokens, not 8)
- Operator tokens: ' +'=1232, ' -'=731, ' *'=1672, ' >'=2986, ' <'=2067, ' =='=1758
- Space token: 216

- Hidden state analysis: Layer 31 (final) has std=1.34, ideal for extraction
- Add analyze_smollm2.py and smollm2_analysis.json for reproducibility

POSITIONEXTRACTOR FIX (model.py)
--------------------------------
Previous implementation had hardcoded position assumptions:
- Assumed 3 tokens for operand A (positions 0-2)
- Assumed 2 tokens for operator (positions 3-4)
- Assumed 3 tokens for operand B (positions 5-7)

This was wrong: "47 + 86" is 6 tokens with A at 0-1, op at 2, space at 3, B at 4-5

Fix implements dynamic token-based detection:
- DIGIT_TOKENS = set(range(32, 42)) for '0'-'9'
- OPERATOR_TOKENS dict maps token IDs to operation indices
- _find_operator_position() scans for known operator tokens
- _extract_digit_features() handles 1-3 digit operands with LEFT-PADDING
(ensures units digit always aligned regardless of number length)
- Now requires token_ids parameter for accurate parsing
- Returns op_indices_from_tokens for potential supervision signal

ARITHMETICMODEL UPDATES (model.py)
----------------------------------
- get_hidden_states() now returns (hidden, mask, token_ids)
- forward() passes token_ids to PositionExtractor when position_extract=True
- Handles variable return signatures across extractor types:
- Extractor: (result_bits, a_bits, b_bits, op_logits)
- PositionExtractor: + op_indices_from_tokens
- DigitExtractor: + a_digit_logits, b_digit_logits

TRAIN.PY UPDATES
----------------
- evaluate_llm() uses indexed outputs for compatibility with all extractors
- Training loop uses outputs[0], outputs[1], outputs[2], outputs[3]
- Sample predictions updated similarly

README.MD UPDATES
-----------------
- Add "Target Model: SmolLM2-360M-Instruct" section with architecture table
- Link to SMOLLM2_ARCHITECTURE.md for full technical reference
- Update Interface Layers section with actual Extractor/MultiHeadBitExtractor code
- Update Trainable Parameters with accurate counts (~4.4M for full Extractor)
- Update Training Strategy with actual loss components and commands
- Update Stage 3 progress with training infrastructure table
- Update Files section: split Core/LLM Integration, add new files
- Add references: SmolLM2 model card, Transformer paper, RoPE paper

VERIFICATION
------------
All operator detection tests pass:
5 + 3 -> A=5, B=3, op=add [OK]
47 + 86 -> A=47, B=86, op=add [OK]
127 - 28 -> A=127, B=28, op=sub [OK]
12 * 11 -> A=12, B=11, op=mul [OK]
200 > 50 -> A=200, B=50, op=gt [OK]
3 < 100 -> A=3, B=100, op=lt [OK]
42 == 42 -> A=42, B=42, op=eq [OK]

Files changed (6) hide show

README.md +142 -59
llm_integration/SMOLLM2_ARCHITECTURE.md +456 -0
llm_integration/analyze_smollm2.py +232 -0
llm_integration/model.py +137 -36
llm_integration/smollm2_analysis.json +439 -0
llm_integration/train.py +7 -3

README.md CHANGED Viewed

@@ -308,6 +308,25 @@ We solve this by embedding **frozen, proven-correct arithmetic circuits** direct
 The model learns **call dispatch**, not arithmetic. The arithmetic is already solved.
 ### Architecture
 Standard MLP block with parallel circuit path:
@@ -323,7 +342,7 @@ x ──┬── MLP path ────────────────┬
 Augmented MLP forward pass:
 ```python
-def forward(x):  # x: [batch, seq, d_model]
     # Original MLP path (unchanged)
     mlp_out = self.down_proj(silu(self.gate_proj(x)) * self.up_proj(x))
@@ -370,56 +389,75 @@ Full adder = 2 half-adders + carry OR, ~4 threshold layers.
 ### Interface Layers (Trainable)
-**BitExtractor** — Maps embedding → two 8-bit operands:
 ```python
-class BitExtractor(nn.Module):
-    def __init__(self, d_model):
-        self.proj = nn.Linear(d_model, 16)
-    def forward(self, x):
-        logits = self.proj(x)
-        bits = heaviside(logits)  # STE for training
-        return bits[..., :8], bits[..., 8:]
 ```
-**BitInjector** — Maps result bits → embedding delta:
 ```python
-class BitInjector(nn.Module):
-    def __init__(self, d_model):
-        self.proj = nn.Linear(16, d_model)
-        self.scale = nn.Parameter(torch.tensor(0.1))
-    def forward(self, result_bits, flags):
-        combined = torch.cat([result_bits, flags], dim=-1)
-        return self.proj(combined) * self.scale
 ```
-**Router** — Decides when to use circuits:
 ```python
-class Router(nn.Module):
-    def __init__(self, d_model):
-        self.net = nn.Sequential(
-            nn.Linear(d_model, 64), nn.ReLU(),
-            nn.Linear(64, 2), nn.Softmax(dim=-1)
-        )
 ```
 ### Trainable Parameters
-For SmolLM2-360M (d_model=960), augmenting 11 layers:
-| Component | Params/Layer |
-|-----------|-------------|
-| BitExtractor | 15,376 |
-| BitInjector | 16,321 |
-| Router | 61,698 |
-| OpSelector | ~31,000 |
-| **Total** | ~124,395 |
-**11 layers × 124,395 = ~1.37M trainable parameters** (0.38% of model)
 ### Gradient Flow
@@ -438,20 +476,33 @@ class HeavisideSTE(torch.autograd.Function):
 ### Training Strategy
-1. **Data**: Generate 8-bit arithmetic problems exhaustively (256×256 = 65,536 unique)
-2. **Loss**: Cross-entropy on answer tokens only (prompt masked with -100)
-3. **Optimizer**: AdamW on interface params only, lr=1e-4
-4. **Curriculum**: Single-digit → two-digit → full 8-bit → adversarial (127+128, 255+1)
 ### Inference
-At inference, Heaviside is true step function—no approximation. If BitExtractor correctly extracts operands, the circuit **will** output the correct result. Circuit computation adds ~5-10% latency overhead.
 ### Target Performance
-| Model | Baseline | Target |
-|-------|----------|--------|
-| SmolLM2-360M | ~5-10% | >95% |
 The interface generalizes to **all** 65,536 8-bit additions once trained—no memorization, the circuits compute.
@@ -535,19 +586,37 @@ Head-to-head on 50 random cases: SmolLM2 got 7/50 (14%), circuits got 50/50 (100
 The actual challenge: train an interface that extracts operands and operations from LLM hidden states (not from pre-formatted bit inputs).
 ```
-"What is 47 + 86?"
     ↓
-[LLM hidden states]
     ↓
-BitExtractor (must LEARN: "47" → 00101111, "86" → 01010110)
-OpRouter (must LEARN: "+" → add operation)
     ↓
 [Frozen threshold circuits]
     ↓
-[Result bits] → "133"
 ```
-The `train_passthrough_*.py` files demonstrate that routing works when given labels, but this is trivial—the real test is learning to parse from natural language.
 #### Proof of Concept Scope
@@ -555,7 +624,11 @@ The `train_passthrough_*.py` files demonstrate that routing works when given lab
 - **Six operations**: ADD, SUB, MUL, GT, LT, EQ
 - **Pure ALU profile** (no memory access)
-**Current status**: Circuit validation complete. LLM hidden state extraction in development.
 ### Extension Roadmap
@@ -581,18 +654,26 @@ The following extensions are planned after proof-of-concept validation:
 ## Files
 | File | Description |
 |------|-------------|
-| `neural_computer.safetensors` | 15,685 tensors, 43,366 parameters (pure ALU profile) |
-| `eval.py` | Unified evaluation suite (6,738 tests, GPU-batched) |
-| `build.py` | Build tools with configurable memory partitioning |
 | `prune_weights.py` | Weight magnitude pruning (GPU-batched, binary search conflict resolution) |
-| `llm_integration/baseline.py` | SmolLM2-360M arithmetic baseline evaluation (11.90% fitness) |
-| `llm_integration/fitness.py` | Shared fitness function for randomized arithmetic tests |
-| `llm_integration/circuits.py` | Frozen threshold circuit wrapper with STE gradients |
-| `llm_integration/model.py` | Interface layer definitions (BitEncoder, OpRouter, BitDecoder) |
-| `llm_integration/train_passthrough.py` | Scaffolding: trains with pre-formatted bit inputs |
-| `llm_integration/train_passthrough_router.py` | Scaffolding: router-only with ground truth bits |
 ### Build Tool Usage
@@ -653,4 +734,6 @@ MIT
 3. Siegelmann & Sontag (1995). "On the Computational Power of Neural Nets"
 4. Bengio et al. (2013). "Estimating or Propagating Gradients Through Stochastic Neurons"
 5. Ma et al. (2024). "The Era of 1-bit LLMs" (BitNet b1.58)
-6. HuggingFace (2024). "SmolLM2: Small Language Models"

 The model learns **call dispatch**, not arithmetic. The arithmetic is already solved.
+### Target Model: SmolLM2-360M-Instruct
+We use HuggingFace's SmolLM2-360M-Instruct as our base model. See [`llm_integration/SMOLLM2_ARCHITECTURE.md`](llm_integration/SMOLLM2_ARCHITECTURE.md) for the complete technical analysis.
+| Property | Value |
+|----------|-------|
+| Parameters | 361.82M |
+| Hidden Dimension | **960** (matches extractor input) |
+| Layers | 32 transformer blocks |
+| Attention | 15 query heads, 5 KV heads (GQA) |
+| MLP | SwiGLU (960→2560→960) |
+| Position Encoding | RoPE (theta=100k, max 8192) |
+**Key insight**: The hidden dimension of 960 exactly matches our extractor requirements—no projection layer needed.
+**Tokenization**: Digits are tokenized individually (`"47 + 86"` → `['4', '7', ' +', ' ', '8', '6']`), with digit token IDs following `token_id = 32 + digit_value`. This enables position-based operand extraction.
+**Hidden State Extraction**: Layer 31 (final, pre-LM-head) provides well-normalized representations (std=1.34) ideal for bit extraction. All 33 hidden state outputs are available (embedding + 32 layers).
 ### Architecture
 Standard MLP block with parallel circuit path:
 Augmented MLP forward pass:
 ```python
+def forward(x):  # x: [batch, seq, d_model=960]
     # Original MLP path (unchanged)
     mlp_out = self.down_proj(silu(self.gate_proj(x)) * self.up_proj(x))
 ### Interface Layers (Trainable)
+**Extractor** — Extracts operands and operation from LLM hidden states:
 ```python
+class Extractor(nn.Module):
+    """Attention pooling + per-bit extraction networks."""
+    def __init__(self, hidden_dim=960):
+        self.attention_pool = AttentionPooling(hidden_dim, num_heads=4)
+        self.a_extractor = MultiHeadBitExtractor(hidden_dim)  # 8 separate bit networks
+        self.b_extractor = MultiHeadBitExtractor(hidden_dim)
+        self.op_router = nn.Sequential(
+            nn.Linear(hidden_dim, 256), nn.GELU(),
+            nn.Linear(256, 6)  # 6 operations
+        )
+    def forward(self, hidden_states, attention_mask):
+        pooled = self.attention_pool(hidden_states, attention_mask)  # (batch, 960)
+        a_bits, _ = self.a_extractor(pooled)  # (batch, 8)
+        b_bits, _ = self.b_extractor(pooled)  # (batch, 8)
+        op_logits = self.op_router(pooled)    # (batch, 6)
+        return a_bits, b_bits, op_logits
 ```
+**MultiHeadBitExtractor** — 8 specialized networks, one per bit:
 ```python
+class MultiHeadBitExtractor(nn.Module):
+    def __init__(self, hidden_dim=960):
+        self.bit_extractors = nn.ModuleList([
+            nn.Sequential(nn.Linear(hidden_dim, 128), nn.GELU(), nn.Linear(128, 1))
+            for _ in range(8)
+        ])
+    def forward(self, x):
+        logits = torch.cat([ext(x) for ext in self.bit_extractors], dim=-1)
+        soft = torch.sigmoid(logits)
+        hard = heaviside_ste(logits)
+        return hard - soft.detach() + soft, logits  # STE
 ```
+**AttentionPooling** — Learns which token positions matter:
 ```python
+class AttentionPooling(nn.Module):
+    """CLS-token style pooling with learned attention."""
+    def __init__(self, hidden_dim=960, num_heads=4):
+        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim) * 0.02)
+        self.query = nn.Linear(hidden_dim, hidden_dim)
+        self.key = nn.Linear(hidden_dim, hidden_dim)
+        self.value = nn.Linear(hidden_dim, hidden_dim)
 ```
 ### Trainable Parameters
+For SmolLM2-360M (hidden_dim=960):
+| Component | Parameters | Description |
+|-----------|------------|-------------|
+| AttentionPooling | ~3.7M | 4-head attention over sequence |
+| MultiHeadBitExtractor (×2) | ~245K each | 8 per-bit MLPs for A and B |
+| OpRouter | ~246K | 960→256→6 MLP |
+| **Extractor Total** | ~4.4M | Full extraction module |
+**Alternative architectures**:
+- `PositionExtractor`: ~1.5M (position-specific, no attention)
+- `DigitExtractor`: ~1.2M (predicts digits 0-9 instead of bits)
+With `--unfreeze_layers 4`: Adds ~39.3M trainable params (top 4 transformer layers).
 ### Gradient Flow
 ### Training Strategy
+1. **Data**: Random 8-bit arithmetic problems (operands 0-255, 6 operations)
+2. **Loss**: Multi-component BCE + CE
+   - `result_loss`: BCE on output bits vs expected
+   - `a_loss`, `b_loss`: BCE on extracted bits vs ground truth (2× weight)
+   - `op_loss`: CE on operation classification
+3. **Optimizer**: AdamW, lr=3e-4, gradient clipping at 1.0
+4. **Curriculum**: Epoch-based range expansion (0-9 → 0-99 → 0-255)
+5. **Batching**: 256-4096 samples per batch (VRAM-dependent)
+```bash
+# Example training commands
+python train.py --mode router --epochs 100                    # Sanity check
+python train.py --mode llm --epochs 100 --batch_size 256      # Frozen LLM
+python train.py --mode llm --unfreeze_layers 4 --batch_size 4096  # Fine-tune top layers
+```
 ### Inference
+At inference, Heaviside is true step function—no approximation. If the Extractor correctly identifies operands, the circuit **will** output the correct result.
 ### Target Performance
+| Condition | Configuration | Accuracy |
+|-----------|---------------|----------|
+| Control | Vanilla SmolLM2-360M | 11.90% |
+| Circuits only | Ground truth bits | 100.00% |
+| Experimental | LLM + Extractor + Circuits | **Target: 100%** |
 The interface generalizes to **all** 65,536 8-bit additions once trained—no memorization, the circuits compute.
 The actual challenge: train an interface that extracts operands and operations from LLM hidden states (not from pre-formatted bit inputs).
 ```
+"47 + 86"
     ↓
+[SmolLM2 hidden states: (seq_len, 960)]
     ↓
+Extractor (must LEARN: hidden → a_bits, b_bits, op_logits)
     ↓
 [Frozen threshold circuits]
     ↓
+[Result bits] → 133
 ```
+**Training Infrastructure** (`train.py`):
+| Mode | Description | Status |
+|------|-------------|--------|
+| `--mode router` | Train OpRouter with ground truth bits | 100% achieved |
+| `--mode interface` | Train BitEncoder + OpRouter | Ready |
+| `--mode llm` | Train from LLM hidden states | Active development |
+**LLM Mode Options**:
+- `--unfreeze_layers N`: Fine-tune top N transformer layers
+- `--extract_layer N`: Extract from intermediate layer (-1 = final)
+- `--position_extract`: Position-specific extraction (uses token positions)
+- `--digit_pred`: Predict digits (0-9) instead of bits
+**Extraction Architectures** (`model.py`):
+- `Extractor`: Attention pooling + per-bit MLPs
+- `PositionExtractor`: Position-aware (operand A from positions 0-2, B from 5-7)
+- `DigitExtractor`: Predicts 3 digits per operand, converts to bits
+**Curriculum Learning**: Training progresses 0-9 → 0-99 → 0-255 over epochs.
 #### Proof of Concept Scope
 - **Six operations**: ADD, SUB, MUL, GT, LT, EQ
 - **Pure ALU profile** (no memory access)
+**Current Status**:
+- Circuit validation: Complete (100% on all operations)
+- LLM baseline: Measured (11.90%)
+- SmolLM2 architecture analysis: Complete (see `SMOLLM2_ARCHITECTURE.md`)
+- Extraction training: In progress
 ### Extension Roadmap
 ## Files
+### Core
 | File | Description |
 |------|-------------|
+| `neural_computer.safetensors` | Frozen threshold circuits (~8.29M params full, ~32K pure ALU) |
+| `eval.py` | Unified evaluation suite (GPU-batched, exhaustive testing) |
+| `build.py` | Circuit generator with configurable memory profiles |
 | `prune_weights.py` | Weight magnitude pruning (GPU-batched, binary search conflict resolution) |
+### LLM Integration (`llm_integration/`)
+| File | Description |
+|------|-------------|
+| `SMOLLM2_ARCHITECTURE.md` | Complete technical analysis of SmolLM2-360M (layers, weights, tokenization) |
+| `baseline.py` | SmolLM2-360M vanilla arithmetic evaluation (11.90% baseline) |
+| `circuits.py` | Frozen threshold circuit wrapper with STE gradients |
+| `fitness.py` | Shared fitness function (randomized arithmetic, no answer supervision) |
+| `model.py` | Interface layers: `BitEncoder`, `OpRouter`, `Extractor`, `ArithmeticModel` |
+| `train.py` | Unified training: `--mode router`, `--mode interface`, `--mode llm` |
+| `trained/router.pt` | Trained OpRouter checkpoint (100% with ground truth bits) |
 ### Build Tool Usage
 3. Siegelmann & Sontag (1995). "On the Computational Power of Neural Nets"
 4. Bengio et al. (2013). "Estimating or Propagating Gradients Through Stochastic Neurons"
 5. Ma et al. (2024). "The Era of 1-bit LLMs" (BitNet b1.58)
+6. HuggingFace (2024). "SmolLM2: Small Language Models" — [Model Card](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)
+7. Vaswani et al. (2017). "Attention Is All You Need" — Transformer architecture
+8. Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding" — RoPE

llm_integration/SMOLLM2_ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,456 @@

+# SmolLM2-360M-Instruct Architecture Analysis
+Technical reference document for the 8bit-threshold-computer LLM integration project.
+**Model**: `HuggingFaceTB/SmolLM2-360M-Instruct`
+**Architecture**: LlamaForCausalLM (Llama 2 variant)
+**Tokenizer**: GPT2TokenizerFast
+**Analysis Date**: 2026-01-21
+---
+## 1. Executive Summary
+SmolLM2-360M-Instruct is a 362M parameter causal language model using the Llama architecture. Key characteristics relevant to our bit extraction task:
+- **Hidden dimension: 960** (matches our extractor input requirement)
+- **32 transformer layers** providing multiple extraction points
+- **Digit-level tokenization** for numbers (each digit is a separate token)
+- **Grouped Query Attention (GQA)** with 15 query heads and 5 KV heads
+---
+## 2. Architecture Census
+### 2.1 Core Parameters
+| Parameter | Value |
+|-----------|-------|
+| Total Parameters | 361,821,120 (361.82M) |
+| Vocabulary Size | 49,152 |
+| Hidden Dimension | 960 |
+| Intermediate Dimension (MLP) | 2,560 |
+| Number of Layers | 32 |
+| Number of Attention Heads | 15 |
+| Number of KV Heads | 5 (Grouped Query Attention) |
+| Head Dimension | 64 |
+| Max Sequence Length | 8,192 |
+| Activation Function | SiLU |
+| Normalization | RMSNorm (eps=1e-05) |
+| Position Encoding | RoPE (theta=100,000) |
+| Word Embedding Tying | Yes (embed_tokens = lm_head) |
+### 2.2 Architecture Diagram
+```
+Input Token IDs
+       |
+       v
++------------------+
+| Embedding Layer  |  (49152, 960)
++------------------+
+       |
+       v
++------------------+
+| LlamaDecoderLayer| x 32
+|  +-------------+ |
+|  | RMSNorm     | |
+|  +-------------+ |
+|  | Self-Attn   | |  Q: (960, 960), K: (960, 320), V: (960, 320), O: (960, 960)
+|  +-------------+ |
+|  | Residual    | |
+|  +-------------+ |
+|  | RMSNorm     | |
+|  +-------------+ |
+|  | MLP (SwiGLU)| |  gate: (960, 2560), up: (960, 2560), down: (2560, 960)
+|  +-------------+ |
+|  | Residual    | |
++------------------+
+       |
+       v
++------------------+
+| Final RMSNorm    |  (960,)
++------------------+
+       |
+       v
++------------------+
+| LM Head          |  (960, 49152) - tied with embeddings
++------------------+
+       |
+       v
+Logits (batch, seq, 49152)
+```
+### 2.3 Parameter Distribution
+| Component | Parameters | Percentage |
+|-----------|-----------|------------|
+| Embedding | 47,185,920 | 13.04% |
+| All Attention Layers | 78,643,200 | 21.74% |
+| All MLP Layers | 235,929,600 | 65.19% |
+| All Layer Norms | 61,440 | 0.02% |
+| Final Norm | 960 | 0.00% |
+Per-layer breakdown (each of 32 layers):
+- Attention: 2,457,600 params (0.68%)
+- MLP: 7,372,800 params (2.04%)
+- Norms: 1,920 params (0.00%)
+---
+## 3. Weight Inventory
+### 3.1 Embedding and Output Layers
+| Parameter Name | Shape | Elements | Notes |
+|---------------|-------|----------|-------|
+| `model.embed_tokens.weight` | (49152, 960) | 47,185,920 | Token embeddings |
+| `model.norm.weight` | (960,) | 960 | Final layer norm |
+| `lm_head.weight` | (49152, 960) | (tied) | Tied to embed_tokens |
+### 3.2 Single Transformer Layer Structure
+Each of the 32 layers (`model.layers.{0-31}`) contains:
+**Attention Block:**
+| Parameter | Shape | Elements |
+|-----------|-------|----------|
+| `self_attn.q_proj.weight` | (960, 960) | 921,600 |
+| `self_attn.k_proj.weight` | (320, 960) | 307,200 |
+| `self_attn.v_proj.weight` | (320, 960) | 307,200 |
+| `self_attn.o_proj.weight` | (960, 960) | 921,600 |
+| **Attention Total** | | **2,457,600** |
+**MLP Block (SwiGLU):**
+| Parameter | Shape | Elements |
+|-----------|-------|----------|
+| `mlp.gate_proj.weight` | (2560, 960) | 2,457,600 |
+| `mlp.up_proj.weight` | (2560, 960) | 2,457,600 |
+| `mlp.down_proj.weight` | (960, 2560) | 2,457,600 |
+| **MLP Total** | | **7,372,800** |
+**Normalization:**
+| Parameter | Shape | Elements |
+|-----------|-------|----------|
+| `input_layernorm.weight` | (960,) | 960 |
+| `post_attention_layernorm.weight` | (960,) | 960 |
+| **Norms Total** | | **1,920** |
+**Layer Total: 9,832,320 parameters**
+### 3.3 Grouped Query Attention (GQA) Details
+SmolLM2 uses GQA with a 3:1 ratio:
+- 15 query heads (Q dimension: 960 = 15 x 64)
+- 5 key-value heads (KV dimension: 320 = 5 x 64)
+- Each KV head is shared by 3 query heads
+- This reduces KV cache memory by ~67% vs standard MHA
+---
+## 4. Tokenization Analysis
+### 4.1 Arithmetic Expression Tokenization
+Test input: `"47 + 86"`
+| Position | Token ID | Token | Description |
+|----------|----------|-------|-------------|
+| 0 | 36 | `'4'` | First digit of operand A |
+| 1 | 39 | `'7'` | Second digit of operand A |
+| 2 | 1232 | `' +'` | Space + plus sign |
+| 3 | 216 | `' '` | Trailing space |
+| 4 | 40 | `'8'` | First digit of operand B |
+| 5 | 38 | `'6'` | Second digit of operand B |
+### 4.2 Digit Token Mappings
+| Digit | Token ID |
+|-------|----------|
+| 0 | 32 |
+| 1 | 33 |
+| 2 | 34 |
+| 3 | 35 |
+| 4 | 36 |
+| 5 | 37 |
+| 6 | 38 |
+| 7 | 39 |
+| 8 | 40 |
+| 9 | 41 |
+Key observations:
+- **Digits are tokenized individually** (no multi-digit tokens like "47")
+- Digit tokens are sequential: ID = 32 + digit_value
+- Space-prefixed operators exist (e.g., `' +'` = 1232)
+- `'='` has token ID 45
+### 4.3 Implications for Bit Extraction
+The digit-by-digit tokenization means:
+1. For `"47 + 86"`, operand A spans positions [0,1] and operand B spans positions [4,5]
+2. The model must learn to:
+   - Recognize digit boundaries
+   - Compose multi-digit numbers from sequential tokens
+   - Perform arithmetic across token positions
+3. Hidden states at digit positions contain the numerical representation
+---
+## 5. Hidden State Analysis
+### 5.1 Hidden State Output Structure
+When running with `output_hidden_states=True`:
+- Returns **33 hidden states** (embedding + 32 layer outputs)
+- Each has shape: `(batch_size, seq_len, hidden_dim)`
+- For `"47 + 86"`: `(1, 6, 960)`
+### 5.2 Hidden State Statistics by Layer
+| Layer | Mean | Std Dev | Min | Max |
+|-------|------|---------|-----|-----|
+| Embedding | -0.001 | 0.105 | -0.44 | 1.77 |
+| Layer 0 | -0.127 | 2.55 | -80.8 | 19.0 |
+| Layer 1 | -0.171 | 3.70 | -161 | 39.7 |
+| Layer 2 | -0.151 | 3.67 | -102 | 61.4 |
+| Layer 3 | -1.13 | 327 | -21,722 | 11,856 |
+| Layer 4-12 | ~-1.3 | ~327 | ~-21,700 | ~11,800 |
+| Layer 13-26 | ~-1.5 | ~337 | ~-22,400 | ~12,100 |
+| Layer 27-30 | ~-1.8 | ~310 | ~-20,000 | ~11,800 |
+| Layer 31 | 0.017 | 1.34 | -18.9 | 34.3 |
+Key observations:
+1. **Dramatic variance explosion at Layer 3**: Std dev jumps from ~4 to ~327
+2. **Stable middle layers (4-26)**: Consistent statistics, suggesting numerical computation
+3. **Compression at final layer**: Std dev drops to 1.34 at Layer 31 (pre-softmax normalization)
+4. **Layer 31 is well-scaled** for downstream processing
+### 5.3 Extraction Point Candidates
+| Layer Range | Characteristics | Suitability |
+|-------------|-----------------|-------------|
+| 0-2 (Early) | Low variance, close to embeddings | Poor - minimal computation |
+| 3-12 (Early-Mid) | High variance, initial processing | Moderate - may contain raw numerical features |
+| 13-26 (Middle) | Stable high variance | Good - computation in progress |
+| 27-30 (Late) | Variance compression begins | Good - refined representations |
+| 31 (Final) | Well-normalized output | Best - final representation before LM head |
+---
+## 6. Relevance to 8bit-Threshold-Computer Project
+### 6.1 Hidden Dimension Match
+**The hidden dimension of 960 exactly matches our extractor input requirement.** This is fortuitous as it means:
+- No projection layer needed to interface with our bit extractor
+- Direct extraction from any layer's hidden states
+- Full utilization of the model's representational capacity
+### 6.2 Recommended Extraction Strategy
+```python
+def extract_hidden_state(model, tokenizer, expression, layer=-1):
+    """
+    Extract hidden state for bit extraction.
+    Args:
+        layer: Which layer to extract from (default: final layer)
+               -1 = Layer 31 (final, pre-LM-head)
+    Returns:
+        Tensor of shape (960,) for the last token position
+    """
+    inputs = tokenizer(expression, return_tensors="pt")
+    outputs = model(**inputs, output_hidden_states=True)
+    # hidden_states[0] = embedding, hidden_states[1] = layer 0, ...
+    # hidden_states[32] = layer 31 (final)
+    hidden = outputs.hidden_states[layer]  # (1, seq_len, 960)
+    # Extract last token position for autoregressive prediction
+    return hidden[0, -1, :]  # (960,)
+```
+### 6.3 Token Position Analysis
+For arithmetic expressions like `"A + B"`:
+```
+Tokens:    [d1] [d2] [ +] [ ] [d3] [d4]
+Positions:  0    1    2   3   4    5
+Operand A: positions 0 to (plus_pos - 1)
+Operator:  position where ' +' token appears
+Operand B: positions (plus_pos + 2) to end
+```
+Strategy for operand extraction:
+1. Find the `' +'` token (ID 1232) position
+2. Collect hidden states at digit positions before it (operand A)
+3. Collect hidden states at digit positions after it (operand B)
+4. Consider pooling (mean, max) or concatenating digit hidden states
+### 6.4 Attention Pattern Utilization
+With GQA (15 query heads, 5 KV heads), we can analyze attention patterns to:
+1. Identify which positions attend to operand digits
+2. Determine if the model explicitly links corresponding digit positions
+3. Find heads that specialize in numerical reasoning
+```python
+def get_attention_weights(model, tokenizer, expression):
+    inputs = tokenizer(expression, return_tensors="pt")
+    outputs = model(**inputs, output_attentions=True)
+    # attentions: tuple of (batch, num_heads, seq_len, seq_len) per layer
+    return outputs.attentions
+```
+### 6.5 Extraction Interface Specification
+For integration with the threshold computer:
+```python
+class SmolLM2Extractor:
+    """Interface between SmolLM2 and threshold-based bit extraction."""
+    def __init__(self, model, tokenizer, extraction_layer=31):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.layer = extraction_layer + 1  # +1 because index 0 is embedding
+    def get_hidden_state(self, text: str) -> torch.Tensor:
+        """
+        Returns: Tensor of shape (960,) ready for bit extractor
+        """
+        tokens = self.tokenizer(text, return_tensors="pt")
+        with torch.no_grad():
+            outputs = self.model(**tokens, output_hidden_states=True)
+        return outputs.hidden_states[self.layer][0, -1, :]
+    def get_all_position_states(self, text: str) -> torch.Tensor:
+        """
+        Returns: Tensor of shape (seq_len, 960) for all positions
+        """
+        tokens = self.tokenizer(text, return_tensors="pt")
+        with torch.no_grad():
+            outputs = self.model(**tokens, output_hidden_states=True)
+        return outputs.hidden_states[self.layer][0]
+```
+---
+## 7. Complete Weight Inventory Table
+### 7.1 All Named Parameters
+```
+EMBEDDING (47,185,920 params - 13.04%)
+  model.embed_tokens.weight                    (49152, 960)    47,185,920
+LAYER 0 (9,832,320 params - 2.72%)
+  Attention (2,457,600):
+    model.layers.0.self_attn.q_proj.weight     (960, 960)      921,600
+    model.layers.0.self_attn.k_proj.weight     (320, 960)      307,200
+    model.layers.0.self_attn.v_proj.weight     (320, 960)      307,200
+    model.layers.0.self_attn.o_proj.weight     (960, 960)      921,600
+  MLP (7,372,800):
+    model.layers.0.mlp.gate_proj.weight        (2560, 960)     2,457,600
+    model.layers.0.mlp.up_proj.weight          (2560, 960)     2,457,600
+    model.layers.0.mlp.down_proj.weight        (960, 2560)     2,457,600
+  Norms (1,920):
+    model.layers.0.input_layernorm.weight      (960,)          960
+    model.layers.0.post_attention_layernorm.weight (960,)      960
+[Layers 1-31 follow identical structure, each with 9,832,320 params]
+FINAL NORM (960 params - 0.00%)
+  model.norm.weight                            (960,)          960
+LM HEAD (tied with embed_tokens)
+  lm_head.weight                               (49152, 960)    [shared]
+```
+### 7.2 Summary by Component Type
+| Component Type | Count | Params Each | Total Params |
+|----------------|-------|-------------|--------------|
+| Embedding | 1 | 47,185,920 | 47,185,920 |
+| Q Projection | 32 | 921,600 | 29,491,200 |
+| K Projection | 32 | 307,200 | 9,830,400 |
+| V Projection | 32 | 307,200 | 9,830,400 |
+| O Projection | 32 | 921,600 | 29,491,200 |
+| Gate Projection | 32 | 2,457,600 | 78,643,200 |
+| Up Projection | 32 | 2,457,600 | 78,643,200 |
+| Down Projection | 32 | 2,457,600 | 78,643,200 |
+| Input LayerNorm | 32 | 960 | 30,720 |
+| Post-Attn LayerNorm | 32 | 960 | 30,720 |
+| Final LayerNorm | 1 | 960 | 960 |
+| **Total** | | | **361,821,120** |
+---
+## 8. Configuration Reference
+Complete model configuration from HuggingFace:
+```python
+{
+    "architectures": ["LlamaForCausalLM"],
+    "attention_bias": False,
+    "attention_dropout": 0.0,
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "pad_token_id": 2,
+    "head_dim": 64,
+    "hidden_act": "silu",
+    "hidden_size": 960,
+    "initializer_range": 0.02,
+    "intermediate_size": 2560,
+    "max_position_embeddings": 8192,
+    "mlp_bias": False,
+    "model_type": "llama",
+    "num_attention_heads": 15,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 5,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_interleaved": False,
+    "rope_theta": 100000,
+    "tie_word_embeddings": True,
+    "use_cache": True,
+    "vocab_size": 49152
+}
+```
+---
+## 9. Appendix: PyTorch Model Structure
+```
+LlamaForCausalLM(
+  (model): LlamaModel(
+    (embed_tokens): Embedding(49152, 960, padding_idx=2)
+    (layers): ModuleList(
+      (0-31): 32 x LlamaDecoderLayer(
+        (self_attn): LlamaAttention(
+          (q_proj): Linear(in_features=960, out_features=960, bias=False)
+          (k_proj): Linear(in_features=960, out_features=320, bias=False)
+          (v_proj): Linear(in_features=960, out_features=320, bias=False)
+          (o_proj): Linear(in_features=960, out_features=960, bias=False)
+        )
+        (mlp): LlamaMLP(
+          (gate_proj): Linear(in_features=960, out_features=2560, bias=False)
+          (up_proj): Linear(in_features=960, out_features=2560, bias=False)
+          (down_proj): Linear(in_features=2560, out_features=960, bias=False)
+          (act_fn): SiLUActivation()
+        )
+        (input_layernorm): LlamaRMSNorm((960,), eps=1e-05)
+        (post_attention_layernorm): LlamaRMSNorm((960,), eps=1e-05)
+      )
+    )
+    (norm): LlamaRMSNorm((960,), eps=1e-05)
+    (rotary_emb): LlamaRotaryEmbedding()
+  )
+  (lm_head): Linear(in_features=960, out_features=49152, bias=False)
+)
+```

llm_integration/analyze_smollm2.py ADDED Viewed

	@@ -0,0 +1,232 @@

+"""
+SmolLM2-360M-Instruct Architecture Analysis
+For 8bit-threshold-computer LLM Integration Project
+"""
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
+from collections import defaultdict
+import json
+def analyze_smollm2():
+    model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
+    print("=" * 80)
+    print("SmolLM2-360M-Instruct Architecture Analysis")
+    print("=" * 80)
+    # Load config first
+    print("\n[1] Loading model configuration...")
+    config = AutoConfig.from_pretrained(model_name)
+    print(f"Config loaded: {type(config).__name__}")
+    # Load tokenizer
+    print("\n[2] Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    print(f"Tokenizer loaded: {type(tokenizer).__name__}")
+    # Load model with hidden states output
+    print("\n[3] Loading model with output_hidden_states=True...")
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        output_hidden_states=True,
+        torch_dtype=torch.float32
+    )
+    model.eval()
+    print(f"Model loaded: {type(model).__name__}")
+    # ========================================================================
+    # ARCHITECTURE CENSUS
+    # ========================================================================
+    print("\n" + "=" * 80)
+    print("ARCHITECTURE CENSUS")
+    print("=" * 80)
+    print("\n--- Model Configuration ---")
+    config_dict = config.to_dict()
+    for key, value in sorted(config_dict.items()):
+        print(f"  {key}: {value}")
+    print("\n--- Key Architecture Parameters ---")
+    print(f"  Model type: {config.model_type}")
+    print(f"  Vocabulary size: {config.vocab_size}")
+    print(f"  Hidden size: {config.hidden_size}")
+    print(f"  Intermediate size: {config.intermediate_size}")
+    print(f"  Number of hidden layers: {config.num_hidden_layers}")
+    print(f"  Number of attention heads: {config.num_attention_heads}")
+    print(f"  Number of KV heads: {getattr(config, 'num_key_value_heads', config.num_attention_heads)}")
+    print(f"  Head dimension: {config.hidden_size // config.num_attention_heads}")
+    print(f"  Max position embeddings: {config.max_position_embeddings}")
+    print(f"  RMS norm epsilon: {getattr(config, 'rms_norm_eps', 'N/A')}")
+    print(f"  Rope theta: {getattr(config, 'rope_theta', 'N/A')}")
+    print(f"  Tie word embeddings: {getattr(config, 'tie_word_embeddings', 'N/A')}")
+    # ========================================================================
+    # WEIGHT INVENTORY
+    # ========================================================================
+    print("\n" + "=" * 80)
+    print("WEIGHT INVENTORY")
+    print("=" * 80)
+    total_params = 0
+    param_groups = defaultdict(list)
+    for name, param in model.named_parameters():
+        total_params += param.numel()
+        # Group by component
+        if "embed_tokens" in name:
+            group = "Embedding"
+        elif "lm_head" in name:
+            group = "LM Head"
+        elif "norm" in name and "layers" not in name:
+            group = "Final Norm"
+        elif "layers" in name:
+            layer_num = name.split(".")[2]
+            if "self_attn" in name:
+                group = f"Layer {layer_num} - Attention"
+            elif "mlp" in name:
+                group = f"Layer {layer_num} - MLP"
+            elif "norm" in name:
+                group = f"Layer {layer_num} - Norms"
+            else:
+                group = f"Layer {layer_num} - Other"
+        else:
+            group = "Other"
+        param_groups[group].append({
+            "name": name,
+            "shape": tuple(param.shape),
+            "numel": param.numel(),
+            "dtype": str(param.dtype)
+        })
+    print(f"\n--- Total Parameters: {total_params:,} ---")
+    print(f"    ({total_params / 1e6:.2f}M parameters)")
+    # Print by group
+    for group_name in sorted(param_groups.keys()):
+        params = param_groups[group_name]
+        group_total = sum(p["numel"] for p in params)
+        print(f"\n### {group_name} ({group_total:,} params, {group_total/total_params*100:.2f}%)")
+        for p in params:
+            print(f"    {p['name']}")
+            print(f"        Shape: {p['shape']}, Elements: {p['numel']:,}, Dtype: {p['dtype']}")
+    # ========================================================================
+    # TOKENIZATION ANALYSIS
+    # ========================================================================
+    print("\n" + "=" * 80)
+    print("TOKENIZATION ANALYSIS")
+    print("=" * 80)
+    test_input = "47 + 86"
+    print(f"\n--- Test Input: '{test_input}' ---")
+    tokens = tokenizer(test_input, return_tensors="pt")
+    input_ids = tokens["input_ids"][0]
+    print(f"\nInput IDs: {input_ids.tolist()}")
+    print(f"Number of tokens: {len(input_ids)}")
+    print("\nToken breakdown:")
+    for i, token_id in enumerate(input_ids):
+        token_str = tokenizer.decode([token_id])
+        print(f"  Position {i}: ID={token_id.item():5d}, Token='{token_str}'")
+    # Additional tokenization tests
+    print("\n--- Additional Tokenization Tests ---")
+    test_cases = ["0", "1", "47", "86", "133", " + ", "="]
+    for tc in test_cases:
+        ids = tokenizer.encode(tc, add_special_tokens=False)
+        decoded = [tokenizer.decode([i]) for i in ids]
+        print(f"  '{tc}' -> IDs: {ids}, Tokens: {decoded}")
+    # ========================================================================
+    # HIDDEN STATE ANALYSIS
+    # ========================================================================
+    print("\n" + "=" * 80)
+    print("HIDDEN STATE ANALYSIS")
+    print("=" * 80)
+    print(f"\n--- Running inference on '{test_input}' ---")
+    with torch.no_grad():
+        outputs = model(**tokens)
+    hidden_states = outputs.hidden_states
+    print(f"\nNumber of hidden state outputs: {len(hidden_states)}")
+    print("(This includes embedding output + each layer's output)")
+    print("\nHidden state shapes at each layer:")
+    for i, hs in enumerate(hidden_states):
+        layer_name = "Embedding" if i == 0 else f"Layer {i-1}"
+        print(f"  {layer_name}: {tuple(hs.shape)}")
+        if i == 0:
+            print(f"      (batch_size=1, seq_len={hs.shape[1]}, hidden_dim={hs.shape[2]})")
+    # Analyze hidden state statistics at different layers
+    print("\n--- Hidden State Statistics (per layer) ---")
+    for i, hs in enumerate(hidden_states):
+        layer_name = "Embedding" if i == 0 else f"Layer {i-1}"
+        hs_flat = hs.view(-1)
+        print(f"  {layer_name}:")
+        print(f"      Mean: {hs_flat.mean().item():.6f}")
+        print(f"      Std:  {hs_flat.std().item():.6f}")
+        print(f"      Min:  {hs_flat.min().item():.6f}")
+        print(f"      Max:  {hs_flat.max().item():.6f}")
+    # ========================================================================
+    # MODEL STRUCTURE DEEP DIVE
+    # ========================================================================
+    print("\n" + "=" * 80)
+    print("MODEL STRUCTURE DEEP DIVE")
+    print("=" * 80)
+    print("\n--- Model Architecture String ---")
+    print(model)
+    # ========================================================================
+    # SUMMARY DATA FOR REPORT
+    # ========================================================================
+    summary = {
+        "model_name": model_name,
+        "total_params": total_params,
+        "config": {
+            "vocab_size": config.vocab_size,
+            "hidden_size": config.hidden_size,
+            "intermediate_size": config.intermediate_size,
+            "num_hidden_layers": config.num_hidden_layers,
+            "num_attention_heads": config.num_attention_heads,
+            "num_kv_heads": getattr(config, 'num_key_value_heads', config.num_attention_heads),
+            "head_dim": config.hidden_size // config.num_attention_heads,
+            "max_position_embeddings": config.max_position_embeddings,
+            "rms_norm_eps": getattr(config, 'rms_norm_eps', None),
+            "rope_theta": getattr(config, 'rope_theta', None),
+            "tie_word_embeddings": getattr(config, 'tie_word_embeddings', None),
+        },
+        "tokenization": {
+            "test_input": test_input,
+            "token_ids": input_ids.tolist(),
+            "num_tokens": len(input_ids),
+            "tokens": [tokenizer.decode([tid]) for tid in input_ids]
+        },
+        "hidden_states": {
+            "num_outputs": len(hidden_states),
+            "shape": list(hidden_states[0].shape)
+        },
+        "param_groups": {k: {"count": len(v), "total": sum(p["numel"] for p in v)} for k, v in param_groups.items()}
+    }
+    # Save summary as JSON for report generation
+    with open("D:/8bit-threshold-computer/llm_integration/smollm2_analysis.json", "w") as f:
+        json.dump(summary, f, indent=2)
+    print("\n" + "=" * 80)
+    print("Analysis complete. Summary saved to smollm2_analysis.json")
+    print("=" * 80)
+    return summary, model, tokenizer, hidden_states, param_groups
+if __name__ == "__main__":
+    summary, model, tokenizer, hidden_states, param_groups = analyze_smollm2()

llm_integration/model.py CHANGED Viewed

@@ -351,76 +351,158 @@ class Extractor(nn.Module):
 class PositionExtractor(nn.Module):
     """
-    Position-specific extraction.
-    Extracts operand A from first token positions, operand B from later positions.
-    For "47 + 86": positions 0-2 for A, position 3-4 for op, positions 5-7 for B.
     """
     def __init__(self, hidden_dim: int = 960, intermediate_dim: int = 256):
         super().__init__()
         self.a_extractor = nn.Sequential(
-            nn.Linear(hidden_dim * 3, intermediate_dim),
             nn.GELU(),
-            nn.Linear(intermediate_dim, 8),
         )
         self.b_extractor = nn.Sequential(
-            nn.Linear(hidden_dim * 3, intermediate_dim),
             nn.GELU(),
-            nn.Linear(intermediate_dim, 8),
         )
-        self.op_router = nn.Sequential(
-            nn.Linear(hidden_dim * 2, intermediate_dim),
             nn.GELU(),
-            nn.Linear(intermediate_dim, len(OPERATIONS)),
         )
-    def forward(self, hidden: torch.Tensor, mask: torch.Tensor):
         """
         Args:
             hidden: [batch, seq_len, hidden_dim]
-            mask: [batch, seq_len]
         Returns:
-            a_bits, b_bits, op_logits
         """
-        batch_size, seq_len, hidden_dim = hidden.shape
-        seq_lens = mask.sum(dim=1).long()
         a_features = []
         b_features = []
         op_features = []
         for i in range(batch_size):
-            slen = seq_lens[i].item()
-            start = seq_len - slen
-            a_pos = hidden[i, start:start+3, :].reshape(-1)
-            if a_pos.shape[0] < hidden_dim * 3:
-                a_pos = F.pad(a_pos, (0, hidden_dim * 3 - a_pos.shape[0]))
-            op_pos = hidden[i, start+3:start+5, :].reshape(-1)
-            if op_pos.shape[0] < hidden_dim * 2:
-                op_pos = F.pad(op_pos, (0, hidden_dim * 2 - op_pos.shape[0]))
-            b_pos = hidden[i, start+5:start+8, :].reshape(-1)
-            if b_pos.shape[0] < hidden_dim * 3:
-                b_pos = F.pad(b_pos, (0, hidden_dim * 3 - b_pos.shape[0]))
-            a_features.append(a_pos)
-            b_features.append(b_pos)
-            op_features.append(op_pos)
         a_features = torch.stack(a_features)
         b_features = torch.stack(b_features)
         op_features = torch.stack(op_features)
         a_logits = self.a_extractor(a_features)
         b_logits = self.b_extractor(b_features)
-        op_logits = self.op_router(op_features)
         a_soft = torch.sigmoid(a_logits)
         b_soft = torch.sigmoid(b_logits)
@@ -429,7 +511,7 @@ class PositionExtractor(nn.Module):
         a_bits = a_hard - a_soft.detach() + a_soft
         b_bits = b_hard - b_soft.detach() + b_soft
-        return a_bits, b_bits, op_logits
 class DigitExtractor(nn.Module):
@@ -589,8 +671,15 @@ class ArithmeticModel(nn.Module):
         print(f"  Extractor params: {trainable_ext:,}", flush=True)
         print(f"  Total trainable: {total_trainable:,}", flush=True)
-    def get_hidden_states(self, texts: list[str]) -> tuple[torch.Tensor, torch.Tensor]:
-        """Get hidden states from specified layer."""
         inputs = self.tokenizer(
             texts,
             return_tensors='pt',
@@ -607,8 +696,9 @@ class ArithmeticModel(nn.Module):
         hidden = outputs.hidden_states[self.extract_layer].float()
         mask = inputs.attention_mask.float()
-        return hidden, mask
     def forward(self, texts: list[str]):
         """
@@ -617,16 +707,25 @@ class ArithmeticModel(nn.Module):
         Returns:
             result_bits, a_bits, b_bits, op_logits
             If digit_pred: also returns a_digit_logits, b_digit_logits
         """
-        hidden, mask = self.get_hidden_states(texts)
-        extractor_out = self.extractor(hidden, mask)
         if self.digit_pred:
             a_bits, b_bits, op_logits, a_digit_logits, b_digit_logits = extractor_out
         else:
             a_bits, b_bits, op_logits = extractor_out
             a_digit_logits, b_digit_logits = None, None
         op_probs = torch.softmax(op_logits, dim=-1)
@@ -634,6 +733,8 @@ class ArithmeticModel(nn.Module):
         if self.digit_pred:
             return result_bits, a_bits, b_bits, op_logits, a_digit_logits, b_digit_logits
         return result_bits, a_bits, b_bits, op_logits
     def trainable_parameters(self):

 class PositionExtractor(nn.Module):
     """
+    Position-specific extraction with dynamic operator detection.
+    Tokenization pattern for "A op B":
+        [A_digits...] [operator] [space] [B_digits...]
+    Examples:
+        "5 + 3"     -> ['5', ' +', ' ', '3']           (positions: A=0, op=1, B=3)
+        "47 + 86"   -> ['4', '7', ' +', ' ', '8', '6'] (positions: A=0-1, op=2, B=4-5)
+        "127 + 128" -> ['1','2','7',' +', ' ','1','2','8'] (positions: A=0-2, op=3, B=5-7)
+    Token IDs (SmolLM2):
+        Digits '0'-'9': 32-41
+        Operators: ' +'=1232, ' -'=731, ' *'=1672, ' >'=2986, ' <'=2067, ' =='=1758
+        Space: 216
     """
+    DIGIT_TOKENS = set(range(32, 42))
+    OPERATOR_TOKENS = {
+        1232: 0,   # ' +' -> add
+        731: 1,    # ' -' -> sub
+        1672: 2,   # ' *' -> mul
+        2986: 3,   # ' >' -> gt
+        2067: 4,   # ' <' -> lt
+        1758: 5,   # ' ==' -> eq
+    }
+    SPACE_TOKEN = 216
+    MAX_DIGITS = 3
     def __init__(self, hidden_dim: int = 960, intermediate_dim: int = 256):
         super().__init__()
+        self.hidden_dim = hidden_dim
         self.a_extractor = nn.Sequential(
+            nn.Linear(hidden_dim * self.MAX_DIGITS, intermediate_dim),
             nn.GELU(),
+            nn.Linear(intermediate_dim, intermediate_dim // 2),
+            nn.GELU(),
+            nn.Linear(intermediate_dim // 2, 8),
         )
         self.b_extractor = nn.Sequential(
+            nn.Linear(hidden_dim * self.MAX_DIGITS, intermediate_dim),
             nn.GELU(),
+            nn.Linear(intermediate_dim, intermediate_dim // 2),
+            nn.GELU(),
+            nn.Linear(intermediate_dim // 2, 8),
         )
+        self.op_extractor = nn.Sequential(
+            nn.Linear(hidden_dim, intermediate_dim // 2),
             nn.GELU(),
+            nn.Linear(intermediate_dim // 2, len(OPERATIONS)),
         )
+    def _find_operator_position(self, token_ids: torch.Tensor) -> tuple[int, int]:
+        """
+        Find operator token position and its operation index.
+        Args:
+            token_ids: [seq_len] tensor of token IDs
+        Returns:
+            (position, op_index) or (-1, -1) if not found
+        """
+        for pos, tid in enumerate(token_ids.tolist()):
+            if tid in self.OPERATOR_TOKENS:
+                return pos, self.OPERATOR_TOKENS[tid]
+        return -1, -1
+    def _extract_digit_features(self, hidden: torch.Tensor, start: int, end: int) -> torch.Tensor:
+        """
+        Extract and pad digit hidden states to fixed size.
+        Args:
+            hidden: [seq_len, hidden_dim]
+            start: start position (inclusive)
+            end: end position (exclusive)
+        Returns:
+            [hidden_dim * MAX_DIGITS] flattened features, zero-padded on the LEFT
+            (so units digit is always at the same position regardless of number length)
+        """
+        n_digits = end - start
+        features = torch.zeros(self.MAX_DIGITS * self.hidden_dim, device=hidden.device)
+        if n_digits > 0 and n_digits <= self.MAX_DIGITS:
+            digit_hidden = hidden[start:end, :].reshape(-1)
+            pad_size = (self.MAX_DIGITS - n_digits) * self.hidden_dim
+            features[pad_size:] = digit_hidden
+        return features
+    def forward(self, hidden: torch.Tensor, mask: torch.Tensor, token_ids: torch.Tensor = None):
         """
         Args:
             hidden: [batch, seq_len, hidden_dim]
+            mask: [batch, seq_len] attention mask
+            token_ids: [batch, seq_len] token IDs (required for operator detection)
         Returns:
+            a_bits: [batch, 8]
+            b_bits: [batch, 8]
+            op_logits: [batch, 6]
         """
+        if token_ids is None:
+            raise ValueError("PositionExtractor requires token_ids for operator detection")
+        batch_size, seq_len, hidden_dim = hidden.shape
+        device = hidden.device
         a_features = []
         b_features = []
         op_features = []
+        op_indices = []
         for i in range(batch_size):
+            seq_mask = mask[i].bool()
+            valid_len = seq_mask.sum().item()
+            start_pos = seq_len - valid_len
+            valid_tokens = token_ids[i, start_pos:]
+            valid_hidden = hidden[i, start_pos:, :]
+            op_pos, op_idx = self._find_operator_position(valid_tokens)
+            if op_pos == -1:
+                a_feat = torch.zeros(self.MAX_DIGITS * hidden_dim, device=device)
+                b_feat = torch.zeros(self.MAX_DIGITS * hidden_dim, device=device)
+                op_feat = torch.zeros(hidden_dim, device=device)
+                op_idx = 0
+            else:
+                a_feat = self._extract_digit_features(valid_hidden, 0, op_pos)
+                op_feat = valid_hidden[op_pos, :]
+                b_start = op_pos + 2 if (op_pos + 1 < valid_len and
+                                          valid_tokens[op_pos + 1].item() == self.SPACE_TOKEN) else op_pos + 1
+                b_feat = self._extract_digit_features(valid_hidden, b_start, valid_len)
+            a_features.append(a_feat)
+            b_features.append(b_feat)
+            op_features.append(op_feat)
+            op_indices.append(op_idx)
         a_features = torch.stack(a_features)
         b_features = torch.stack(b_features)
         op_features = torch.stack(op_features)
+        op_indices_tensor = torch.tensor(op_indices, device=device, dtype=torch.long)
         a_logits = self.a_extractor(a_features)
         b_logits = self.b_extractor(b_features)
+        op_logits = self.op_extractor(op_features)
         a_soft = torch.sigmoid(a_logits)
         b_soft = torch.sigmoid(b_logits)
         a_bits = a_hard - a_soft.detach() + a_soft
         b_bits = b_hard - b_soft.detach() + b_soft
+        return a_bits, b_bits, op_logits, op_indices_tensor
 class DigitExtractor(nn.Module):
         print(f"  Extractor params: {trainable_ext:,}", flush=True)
         print(f"  Total trainable: {total_trainable:,}", flush=True)
+    def get_hidden_states(self, texts: list[str]) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Get hidden states from specified layer.
+        Returns:
+            hidden: [batch, seq_len, hidden_dim] hidden states
+            mask: [batch, seq_len] attention mask
+            token_ids: [batch, seq_len] input token IDs
+        """
         inputs = self.tokenizer(
             texts,
             return_tensors='pt',
         hidden = outputs.hidden_states[self.extract_layer].float()
         mask = inputs.attention_mask.float()
+        token_ids = inputs.input_ids
+        return hidden, mask, token_ids
     def forward(self, texts: list[str]):
         """
         Returns:
             result_bits, a_bits, b_bits, op_logits
             If digit_pred: also returns a_digit_logits, b_digit_logits
+            If position_extract: also returns op_indices (ground truth from tokenization)
         """
+        hidden, mask, token_ids = self.get_hidden_states(texts)
+        if self.position_extract:
+            extractor_out = self.extractor(hidden, mask, token_ids)
+        else:
+            extractor_out = self.extractor(hidden, mask)
         if self.digit_pred:
             a_bits, b_bits, op_logits, a_digit_logits, b_digit_logits = extractor_out
+            op_indices_from_tokens = None
+        elif self.position_extract:
+            a_bits, b_bits, op_logits, op_indices_from_tokens = extractor_out
+            a_digit_logits, b_digit_logits = None, None
         else:
             a_bits, b_bits, op_logits = extractor_out
             a_digit_logits, b_digit_logits = None, None
+            op_indices_from_tokens = None
         op_probs = torch.softmax(op_logits, dim=-1)
         if self.digit_pred:
             return result_bits, a_bits, b_bits, op_logits, a_digit_logits, b_digit_logits
+        if self.position_extract:
+            return result_bits, a_bits, b_bits, op_logits, op_indices_from_tokens
         return result_bits, a_bits, b_bits, op_logits
     def trainable_parameters(self):

llm_integration/smollm2_analysis.json ADDED Viewed

	@@ -0,0 +1,439 @@

+{
+  "model_name": "HuggingFaceTB/SmolLM2-360M-Instruct",
+  "total_params": 361821120,
+  "config": {
+    "vocab_size": 49152,
+    "hidden_size": 960,
+    "intermediate_size": 2560,
+    "num_hidden_layers": 32,
+    "num_attention_heads": 15,
+    "num_kv_heads": 5,
+    "head_dim": 64,
+    "max_position_embeddings": 8192,
+    "rms_norm_eps": 1e-05,
+    "rope_theta": 100000,
+    "tie_word_embeddings": true
+  },
+  "tokenization": {
+    "test_input": "47 + 86",
+    "token_ids": [
+      36,
+      39,
+      1232,
+      216,
+      40,
+      38
+    ],
+    "num_tokens": 6,
+    "tokens": [
+      "4",
+      "7",
+      " +",
+      " ",
+      "8",
+      "6"
+    ]
+  },
+  "hidden_states": {
+    "num_outputs": 33,
+    "shape": [
+      1,
+      6,
+      960
+    ]
+  },
+  "param_groups": {
+    "Embedding": {
+      "count": 1,
+      "total": 47185920
+    },
+    "Layer 0 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 0 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 0 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 1 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 1 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 1 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 2 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 2 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 2 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 3 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 3 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 3 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 4 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 4 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 4 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 5 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 5 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 5 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 6 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 6 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 6 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 7 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 7 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 7 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 8 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 8 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 8 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 9 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 9 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 9 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 10 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 10 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 10 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 11 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 11 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 11 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 12 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 12 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 12 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 13 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 13 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 13 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 14 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 14 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 14 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 15 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 15 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 15 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 16 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 16 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 16 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 17 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 17 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 17 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 18 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 18 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 18 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 19 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 19 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 19 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 20 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 20 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 20 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 21 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 21 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 21 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 22 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 22 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 22 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 23 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 23 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 23 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 24 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 24 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 24 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 25 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 25 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 25 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 26 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 26 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 26 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 27 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 27 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 27 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 28 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 28 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 28 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 29 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 29 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 29 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 30 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 30 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 30 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Layer 31 - Attention": {
+      "count": 4,
+      "total": 2457600
+    },
+    "Layer 31 - MLP": {
+      "count": 3,
+      "total": 7372800
+    },
+    "Layer 31 - Norms": {
+      "count": 2,
+      "total": 1920
+    },
+    "Final Norm": {
+      "count": 1,
+      "total": 960
+    }
+  }
+}

llm_integration/train.py CHANGED Viewed

@@ -398,7 +398,9 @@ def evaluate_llm(model, n_samples: int = 500):
         text, a, b, op, expected = generate_problem()
         with torch.no_grad():
-            result_bits, a_bits, b_bits, op_logits = model([text])
         pred_result = bits_to_int(result_bits[0])
         pred_op = OPERATIONS[op_logits[0].argmax().item()]
@@ -502,7 +504,8 @@ def train_llm(epochs: int = 100, batch_size: int = 256, lr: float = 3e-4,
             optimizer.zero_grad()
-            pred_bits, a_bits, b_bits, op_logits = model(batch_texts)
             loss, losses = compute_llm_loss(
                 pred_bits, a_bits, b_bits, op_logits,
@@ -556,7 +559,8 @@ def train_llm(epochs: int = 100, batch_size: int = 256, lr: float = 3e-4,
     for _ in range(10):
         text, a, b, op, expected = generate_problem()
         with torch.no_grad():
-            result_bits, a_bits, b_bits, op_logits = model([text])
         pred = bits_to_int(result_bits[0])
         pred_a = bits_to_int(a_bits[0])
         pred_b = bits_to_int(b_bits[0])

         text, a, b, op, expected = generate_problem()
         with torch.no_grad():
+            outputs = model([text])
+            result_bits = outputs[0]
+            op_logits = outputs[3]
         pred_result = bits_to_int(result_bits[0])
         pred_op = OPERATIONS[op_logits[0].argmax().item()]
             optimizer.zero_grad()
+            outputs = model(batch_texts)
+            pred_bits, a_bits, b_bits, op_logits = outputs[0], outputs[1], outputs[2], outputs[3]
             loss, losses = compute_llm_loss(
                 pred_bits, a_bits, b_bits, op_logits,
     for _ in range(10):
         text, a, b, op, expected = generate_problem()
         with torch.no_grad():
+            outputs = model([text])
+            result_bits, a_bits, b_bits, op_logits = outputs[0], outputs[1], outputs[2], outputs[3]
         pred = bits_to_int(result_bits[0])
         pred_a = bits_to_int(a_bits[0])
         pred_b = bits_to_int(b_bits[0])