Spaces:

priyadip
/

Transformer-Visualizer

Sleeping

App Files Files Community

priyadip commited on Mar 19

Commit

dc138e1

0 Parent(s):

Fix: js in gr.Blocks(), event delegation for card clicks, SVG loss curve

Browse files

Files changed (7) hide show

README.md +91 -0
app.py +800 -0
inference.py +250 -0
requirements.txt +2 -0
training.py +261 -0
transformer.py +516 -0
vocab.py +146 -0

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+---
+title: Transformer Visualizer EN→BN
+emoji: 🔬
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 6.9.0
+app_file: app.py
+pinned: true
+license: mit
+---
+# 🔬 Transformer Visualizer — English → Bengali
+**See every single calculation inside a Transformer, live.**
+## What this Space does
+Type any English sentence and watch every number flow through the Transformer architecture step by step — from raw token IDs all the way to Bengali output.
+---
+## 🗂️ Tabs
+### 🏗️ Architecture
+- Full SVG diagram of encoder + decoder
+- Color-coded: self-attention / cross-attention / masked attention / FFN
+- Explains K,V flow from encoder to decoder
+### 🏋️ Train Model
+- Trains a small Transformer on 30 English→Bengali sentence pairs
+- Live loss curve rendered on canvas
+- Configurable epochs
+### 🔬 Training Step
+Shows a **single training forward pass** with teacher forcing:
+1. **Tokenization** — English + Bengali → token ID arrays
+2. **Embedding** — `token_id → vector × √d_model`
+3. **Positional Encoding** — `sin(pos/10000^(2i/d))` / `cos(...)` matrix shown
+4. **Encoder**:
+   - Q, K, V projection matrices shown
+   - `scores = Q·Kᵀ / √d_k` with actual numbers
+   - Softmax attention weights (heatmap)
+   - Residual + LayerNorm
+   - FFN: `max(0, xW₁+b₁)W₂+b₂`
+5. **Decoder**:
+   - Masked self-attention with causal mask matrix
+   - Cross-attention: Q from decoder, K/V from encoder
+6. **Loss** — label-smoothed cross-entropy, gradient norms, Adam update
+### ⚡ Inference
+Shows **auto-regressive decoding**:
+- No ground truth needed
+- Token generated one at a time
+- Top-5 candidates + probabilities at every step
+- Cross-attention heatmap: which Bengali token attends to which English word
+- Greedy vs Beam Search comparison
+---
+## 📁 File Structure
+```
+app.py          — Gradio UI + HTML/CSS/JS rendering
+transformer.py  — Full Transformer with CalcLog hooks
+training.py     — Training loop + single-step visualization
+inference.py    — Greedy & beam search with logging
+vocab.py        — English/Bengali vocabularies + parallel corpus
+requirements.txt
+```
+---
+## ⚙️ Model Config
+| Parameter | Value |
+|-----------|-------|
+| d_model | 64 |
+| num_heads | 4 |
+| num_layers | 2 |
+| d_ff | 128 |
+| vocab (EN) | ~100 |
+| vocab (BN) | ~90 |
+| Optimizer | Adam |
+| Loss | Label-smoothed CE |
+---
+*Built for educational purposes — every matrix operation is logged and displayed.*

app.py ADDED Viewed

	@@ -0,0 +1,800 @@

+"""
+app.py
+Gradio Space: Interactive Transformer Visualizer — English → Bengali
+"""
+import gradio as gr
+import torch
+import json
+import os
+import numpy as np
+from pathlib import Path
+from transformer import Transformer, CalcLog
+from training import build_model, run_training, visualize_training_step, collate_batch
+from inference import visualize_inference
+from vocab import get_vocabs, PARALLEL_DATA, PAD_IDX
+# ─────────────────────────────────────────────
+#  Global state
+# ─────────────────────────────────────────────
+DEVICE = "cpu"
+src_v, tgt_v = get_vocabs()
+MODEL: Transformer = None
+LOSS_HISTORY = []
+IS_TRAINED = False
+def get_or_init_model():
+    global MODEL
+    if MODEL is None:
+        MODEL = build_model(len(src_v), len(tgt_v), DEVICE)
+    return MODEL
+# ─────────────────────────────────────────────
+#  HTML renderer for calc log
+# ─────────────────────────────────────────────
+def render_matrix_html(val, max_rows=6, max_cols=8):
+    """Convert a nested list / scalar to an HTML matrix table."""
+    if isinstance(val, (int, float)):
+        return f'<span class="scalar-val">{val:.5f}</span>'
+    if isinstance(val, dict):
+        rows = "".join(
+            f'<tr><td class="dict-key">{k}</td><td class="dict-val">{v}</td></tr>'
+            for k, v in val.items()
+        )
+        return f'<table class="dict-table">{rows}</table>'
+    if isinstance(val, list):
+        # 0-D or scalar list
+        if len(val) == 0:
+            return "<em>empty</em>"
+        # 1-D
+        if not isinstance(val[0], list):
+            clipped = val[:max_cols*2]
+            cells = "".join(
+                f'<td class="mat-cell">{v:.4f}</td>'
+                if isinstance(v, float) else f'<td class="mat-cell">{v}</td>'
+                for v in clipped
+            )
+            suffix = f'<td class="mat-more">…+{len(val)-len(clipped)}</td>' if len(val) > len(clipped) else ""
+            return f'<table class="matrix-1d"><tr>{cells}{suffix}</tr></table>'
+        # 2-D
+        rows_html = ""
+        display_rows = val[:max_rows]
+        for row in display_rows:
+            display_cols = row[:max_cols]
+            cells = "".join(
+                f'<td class="mat-cell" style="--v:{min(max(float(c),-1),1):.3f}">'
+                f'{float(c):.3f}</td>'
+                if isinstance(c, (int, float)) else f'<td class="mat-cell">{c}</td>'
+                for c in display_cols
+            )
+            suffix = f'<td class="mat-more">…</td>' if len(row) > max_cols else ""
+            rows_html += f"<tr>{cells}{suffix}</tr>"
+        if len(val) > max_rows:
+            rows_html += f'<tr><td colspan="{max_cols+1}" class="mat-more">…{len(val)-max_rows} more rows</td></tr>'
+        return f'<table class="matrix-2d">{rows_html}</table>'
+    return f'<code>{str(val)[:200]}</code>'
+def calc_log_to_html(steps):
+    """Turn CalcLog steps into rich HTML accordion."""
+    if not steps:
+        return "<p style='color:#888'>No calculation log yet.</p>"
+    cards = []
+    for i, step in enumerate(steps):
+        name = step.get("name", f"step_{i}")
+        formula = step.get("formula", "")
+        note = step.get("note", "")
+        shape = step.get("shape")
+        val = step.get("value")
+        shape_badge = f'<span class="shape-badge">{shape}</span>' if shape else ""
+        formula_html = f'<div class="formula">⟨ {formula} ⟩</div>' if formula else ""
+        note_html = f'<div class="step-note">ℹ {note}</div>' if note else ""
+        matrix_html = render_matrix_html(val) if val is not None else ""
+        # Color category by name prefix
+        cat = "default"
+        n = name.upper()
+        if "EMBED" in n or "TOKEN" in n:        cat = "embed"
+        elif "PE" in n or "POSITIONAL" in n:    cat = "pe"
+        elif "SOFTMAX" in n or "ATTN" in n or "_Q" in n or "_K" in n or "_V" in n: cat = "attn"
+        elif "FFN" in n or "LINEAR" in n or "RELU" in n: cat = "ffn"
+        elif "NORM" in n or "RESIDUAL" in n:    cat = "norm"
+        elif "LOSS" in n or "GRAD" in n or "OPTIM" in n: cat = "loss"
+        elif "INFERENCE" in n or "GREEDY" in n or "BEAM" in n: cat = "infer"
+        elif "CROSS" in n:                      cat = "cross"
+        elif "MASK" in n:                       cat = "mask"
+        cards.append(f"""
+        <div class="calc-card cat-{cat}" data-idx="{i}">
+          <div class="calc-header">
+            <span class="step-num">#{i+1}</span>
+            <span class="step-name cat-label-{cat}">{name.replace('_',' ')}</span>
+            {shape_badge}
+            <span class="toggle-arrow">▶</span>
+          </div>
+          <div class="calc-body" style="display:none">
+            {formula_html}
+            {note_html}
+            <div class="matrix-wrap">{matrix_html}</div>
+          </div>
+        </div>""")
+    return "\n".join(cards)
+# ─────────────────────────────────────────────
+#  Attention heatmap HTML
+# ─────────────────────────────────────────────
+def attention_heatmap_html(weights, row_labels, col_labels, title="Attention"):
+    """weights: 2D list [tgt, src]"""
+    if not weights:
+        return ""
+    rows_html = ""
+    for i, row in enumerate(weights):
+        cells = ""
+        for j, w in enumerate(row):
+            alpha = min(float(w), 1.0)
+            cells += f'<td class="heat-cell" style="--a:{alpha:.3f}" title="{row_labels[i] if i<len(row_labels) else i}→{col_labels[j] if j<len(col_labels) else j}: {alpha:.3f}">{alpha:.2f}</td>'
+        lbl = row_labels[i] if i < len(row_labels) else str(i)
+        rows_html += f'<tr><td class="heat-label">{lbl}</td>{cells}</tr>'
+    header = '<tr><td></td>' + "".join(f'<td class="heat-col-label">{c}</td>' for c in col_labels) + '</tr>'
+    return f"""
+    <div class="heatmap-container">
+      <div class="heatmap-title">{title}</div>
+      <table class="heatmap">{header}{rows_html}</table>
+    </div>"""
+# ─────────────────────────────────────────────
+#  Decoding steps HTML
+# ─────────────────────────────────────────────
+def decode_steps_html(step_logs, src_tokens):
+    if not step_logs:
+        return ""
+    html = '<div class="decode-steps"><div class="decode-title">🔁 Auto-regressive Decoding Steps</div>'
+    for s in step_logs:
+        step = s.get("step", 0)
+        tokens_so_far = s.get("tokens_so_far", [])
+        top5 = s.get("top5", [])
+        chosen = s.get("chosen_token", "?")
+        prob = s.get("chosen_prob", 0)
+        bars = ""
+        if top5:
+            max_p = max(t["prob"] for t in top5) or 1
+            for t in top5:
+                pct = t["prob"] / max_p * 100
+                is_chosen = "chosen" if t["token"] == chosen else ""
+                bars += f"""<div class="bar-row {is_chosen}">
+                  <span class="bar-label">{t['token']}</span>
+                  <div class="bar" style="width:{pct:.1f}%"></div>
+                  <span class="bar-prob">{t['prob']:.3f}</span>
+                </div>"""
+        cross_heat = ""
+        if s.get("cross_attn") and src_tokens:
+            attn_mat = s["cross_attn"]  # [num_heads][T_q][T_src]
+            if attn_mat and attn_mat[0]:
+                # Take head-0, last decoded position → [T_src] floats
+                last_pos_attn = attn_mat[0][-1]          # [T_src]
+                last_row = [last_pos_attn]                # [[T_src]] — 2D for heatmap
+                cross_heat = attention_heatmap_html(
+                    last_row, [chosen], src_tokens,
+                    title=f"Cross-Attn: '{chosen}' → English"
+                )
+        html += f"""
+        <div class="decode-step">
+          <div class="decode-step-header">
+            <span class="step-badge">Step {step+1}</span>
+            <span class="step-ctx">Context: {' '.join(tokens_so_far)}</span>
+            <span class="step-arrow">→</span>
+            <span class="step-chosen">'{chosen}'</span>
+            <span class="step-prob">{prob:.3f}</span>
+          </div>
+          <div class="step-bars">{bars}</div>
+          {cross_heat}
+        </div>"""
+    html += "</div>"
+    return html
+# ─────────────────────────────────────────────
+#  Architecture SVG
+# ─────────────────────────────────────────────
+ARCH_SVG = """
+<div id="arch-diagram">
+<svg viewBox="0 0 820 900" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:820px;margin:auto;display:block">
+  <defs>
+    <marker id="arr" markerWidth="8" markerHeight="8" refX="6" refY="3" orient="auto">
+      <path d="M0,0 L0,6 L8,3 z" fill="#64ffda"/>
+    </marker>
+    <filter id="glow">
+      <feGaussianBlur stdDeviation="2" result="blur"/>
+      <feMerge><feMergeNode in="blur"/><feMergeNode in="SourceGraphic"/></feMerge>
+    </filter>
+  </defs>
+  <!-- Background -->
+  <rect width="820" height="900" fill="#0a0f1e" rx="12"/>
+  <!-- Title -->
+  <text x="410" y="35" text-anchor="middle" fill="#64ffda" font-size="16" font-family="monospace" font-weight="bold">Transformer Architecture — English → Bengali</text>
+  <!-- ── ENCODER (left) ── -->
+  <rect x="40" y="60" width="330" height="720" rx="10" fill="#0d1b2a" stroke="#1e4d6b" stroke-width="1.5"/>
+  <text x="205" y="90" text-anchor="middle" fill="#4fc3f7" font-size="13" font-weight="bold">ENCODER</text>
+  <!-- Input Embedding -->
+  <rect x="70" y="110" width="270" height="40" rx="6" fill="#1a3a5c" stroke="#4fc3f7" stroke-width="1.5"/>
+  <text x="205" y="135" text-anchor="middle" fill="#e0f7fa" font-size="11">Input Embedding + Positional Encoding</text>
+  <!-- Encoder Layer Box -->
+  <rect x="60" y="175" width="290" height="340" rx="8" fill="#112233" stroke="#1e4d6b" stroke-width="1" stroke-dasharray="4"/>
+  <text x="100" y="198" fill="#607d8b" font-size="10">Encoder Layer × N</text>
+  <!-- Multi-Head Self-Attention -->
+  <rect x="80" y="210" width="250" height="50" rx="6" fill="#1b3a4b" stroke="#26c6da" stroke-width="1.5"/>
+  <text x="205" y="232" text-anchor="middle" fill="#e0f7fa" font-size="11" font-weight="bold">Multi-Head Self-Attention</text>
+  <text x="205" y="248" text-anchor="middle" fill="#80deea" font-size="9">Q = K = V = encoder input</text>
+  <!-- Add & Norm 1 -->
+  <rect x="80" y="278" width="250" height="30" rx="5" fill="#1a2a3a" stroke="#607d8b" stroke-width="1"/>
+  <text x="205" y="298" text-anchor="middle" fill="#b0bec5" font-size="10">Add &amp; Norm</text>
+  <!-- FFN -->
+  <rect x="80" y="328" width="250" height="50" rx="6" fill="#1b3a4b" stroke="#26c6da" stroke-width="1.5"/>
+  <text x="205" y="350" text-anchor="middle" fill="#e0f7fa" font-size="11" font-weight="bold">Feed-Forward Network</text>
+  <text x="205" y="366" text-anchor="middle" fill="#80deea" font-size="9">FFN(x) = max(0, xW₁+b₁)W₂+b₂</text>
+  <!-- Add & Norm 2 -->
+  <rect x="80" y="396" width="250" height="30" rx="5" fill="#1a2a3a" stroke="#607d8b" stroke-width="1"/>
+  <text x="205" y="416" text-anchor="middle" fill="#b0bec5" font-size="10">Add &amp; Norm</text>
+  <!-- Encoder output arrow down -->
+  <line x1="205" y1="455" x2="205" y2="550" stroke="#64ffda" stroke-width="1.5" marker-end="url(#arr)"/>
+  <text x="215" y="510" fill="#64ffda" font-size="9">K, V to</text>
+  <text x="215" y="522" fill="#64ffda" font-size="9">decoder</text>
+  <!-- Encoder output box -->
+  <rect x="70" y="555" width="270" height="40" rx="6" fill="#0d2b1a" stroke="#00e676" stroke-width="1.5"/>
+  <text x="205" y="580" text-anchor="middle" fill="#a5d6a7" font-size="11">Encoder Output (K, V)</text>
+  <!-- ── DECODER (right) ── -->
+  <rect x="450" y="60" width="330" height="720" rx="10" fill="#1a0d2a" stroke="#4a1b6b" stroke-width="1.5"/>
+  <text x="615" y="90" text-anchor="middle" fill="#ce93d8" font-size="13" font-weight="bold">DECODER</text>
+  <!-- Target Embedding -->
+  <rect x="480" y="110" width="270" height="40" rx="6" fill="#3a1a5c" stroke="#ce93d8" stroke-width="1.5"/>
+  <text x="615" y="135" text-anchor="middle" fill="#f3e5f5" font-size="11">Target Embedding + Positional Encoding</text>
+  <!-- Decoder Layer Box -->
+  <rect x="470" y="175" width="290" height="460" rx="8" fill="#1a1133" stroke="#4a1b6b" stroke-width="1" stroke-dasharray="4"/>
+  <text x="510" y="198" fill="#607d8b" font-size="10">Decoder Layer × N</text>
+  <!-- Masked MHA -->
+  <rect x="490" y="210" width="250" height="50" rx="6" fill="#2b1b3a" stroke="#ab47bc" stroke-width="1.5"/>
+  <text x="615" y="232" text-anchor="middle" fill="#f3e5f5" font-size="11" font-weight="bold">Masked Multi-Head Self-Attention</text>
+  <text x="615" y="248" text-anchor="middle" fill="#ce93d8" font-size="9">Q = K = V = decoder input (causal mask)</text>
+  <!-- Add & Norm D1 -->
+  <rect x="490" y="278" width="250" height="30" rx="5" fill="#2a1a3a" stroke="#607d8b" stroke-width="1"/>
+  <text x="615" y="298" text-anchor="middle" fill="#b0bec5" font-size="10">Add &amp; Norm</text>
+  <!-- Cross-Attention -->
+  <rect x="490" y="328" width="250" height="60" rx="6" fill="#1b2b4b" stroke="#29b6f6" stroke-width="2" filter="url(#glow)"/>
+  <text x="615" y="350" text-anchor="middle" fill="#e1f5fe" font-size="11" font-weight="bold">Cross-Attention</text>
+  <text x="615" y="366" text-anchor="middle" fill="#81d4fa" font-size="9">Q = decoder | K, V = encoder</text>
+  <text x="615" y="380" text-anchor="middle" fill="#29b6f6" font-size="9" font-weight="bold">← KEY CONNECTION</text>
+  <!-- Add & Norm D2 -->
+  <rect x="490" y="408" width="250" height="30" rx="5" fill="#2a1a3a" stroke="#607d8b" stroke-width="1"/>
+  <text x="615" y="428" text-anchor="middle" fill="#b0bec5" font-size="10">Add &amp; Norm</text>
+  <!-- FFN Decoder -->
+  <rect x="490" y="458" width="250" height="50" rx="6" fill="#2b1b3a" stroke="#ab47bc" stroke-width="1.5"/>
+  <text x="615" y="480" text-anchor="middle" fill="#f3e5f5" font-size="11" font-weight="bold">Feed-Forward Network</text>
+  <text x="615" y="496" text-anchor="middle" fill="#ce93d8" font-size="9">FFN(x) = max(0, xW₁+b₁)W₂+b₂</text>
+  <!-- Add & Norm D3 -->
+  <rect x="490" y="526" width="250" height="30" rx="5" fill="#2a1a3a" stroke="#607d8b" stroke-width="1"/>
+  <text x="615" y="546" text-anchor="middle" fill="#b0bec5" font-size="10">Add &amp; Norm</text>
+  <!-- Output Linear + Softmax -->
+  <rect x="480" y="600" width="270" height="40" rx="6" fill="#2b1b0a" stroke="#ffb300" stroke-width="1.5"/>
+  <text x="615" y="625" text-anchor="middle" fill="#fff8e1" font-size="11">Linear + Softmax → Bengali Token</text>
+  <!-- Cross-attention arrow from encoder to decoder -->
+  <path d="M340,590 Q410,480 490,368" stroke="#29b6f6" stroke-width="2" fill="none"
+        stroke-dasharray="6,3" marker-end="url(#arr)"/>
+  <text x="390" y="500" fill="#29b6f6" font-size="9" transform="rotate(-50,390,500)">K, V flow</text>
+  <!-- Input arrow -->
+  <line x1="205" y1="840" x2="205" y2="780" stroke="#4fc3f7" stroke-width="1.5" marker-end="url(#arr)"/>
+  <text x="205" y="858" text-anchor="middle" fill="#4fc3f7" font-size="11">English Input</text>
+  <line x1="615" y1="840" x2="615" y2="660" stroke="#ce93d8" stroke-width="1.5" marker-end="url(#arr)"/>
+  <text x="615" y="858" text-anchor="middle" fill="#ce93d8" font-size="11">Bengali Output</text>
+  <!-- Legend -->
+  <rect x="60" y="870" width="700" height="20" rx="4" fill="#0a1520" stroke="#1e2d3d" stroke-width="1"/>
+  <circle cx="80" cy="880" r="4" fill="#26c6da"/><text x="88" y="884" fill="#80deea" font-size="8">Self-Attention</text>
+  <circle cx="160" cy="880" r="4" fill="#29b6f6"/><text x="168" y="884" fill="#81d4fa" font-size="8">Cross-Attention</text>
+  <circle cx="250" cy="880" r="4" fill="#ab47bc"/><text x="258" y="884" fill="#ce93d8" font-size="8">Masked Attn</text>
+  <circle cx="350" cy="880" r="4" fill="#00e676"/><text x="358" y="884" fill="#a5d6a7" font-size="8">Enc→Dec K,V</text>
+  <circle cx="450" cy="880" r="4" fill="#ffb300"/><text x="458" y="884" fill="#fff8e1" font-size="8">Output Layer</text>
+</svg>
+</div>
+"""
+# ─────────────────────────────────────────────
+#  CSS + JS
+# ─────────────────────────────────────────────
+CUSTOM_CSS = """
+/* ── fonts ── */
+@import url('https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@300;400;600&family=Syne:wght@400;700;800&display=swap');
+:root {
+  --bg: #07090f;
+  --bg2: #0d1120;
+  --bg3: #111827;
+  --card: #141c2e;
+  --border: #1e2d45;
+  --accent: #64ffda;
+  --accent2: #29b6f6;
+  --accent3: #ce93d8;
+  --accent4: #ffb300;
+  --text: #e2e8f0;
+  --muted: #64748b;
+  --embed: #4fc3f7;
+  --pe: #26c6da;
+  --attn: #f06292;
+  --ffn: #aed581;
+  --norm: #90a4ae;
+  --loss: #ef9a9a;
+  --infer: #80cbc4;
+  --cross: #29b6f6;
+  --mask: #ffb300;
+}
+body, .gradio-container { background: var(--bg) !important; color: var(--text) !important; font-family: 'JetBrains Mono', monospace !important; }
+h1, h2, h3 { font-family: 'Syne', sans-serif !important; }
+/* ── tabs ── */
+.tab-nav button { background: var(--bg3) !important; color: var(--muted) !important; border: 1px solid var(--border) !important; font-family: 'JetBrains Mono', monospace !important; letter-spacing: 1px; }
+.tab-nav button.selected { background: var(--card) !important; color: var(--accent) !important; border-color: var(--accent) !important; box-shadow: 0 0 8px rgba(100,255,218,0.2); }
+/* ── inputs ── */
+input[type=text], textarea { background: var(--bg3) !important; color: var(--text) !important; border: 1px solid var(--border) !important; border-radius: 6px !important; font-family: 'JetBrains Mono', monospace !important; }
+input[type=text]:focus, textarea:focus { border-color: var(--accent) !important; box-shadow: 0 0 6px rgba(100,255,218,0.2) !important; }
+button.primary { background: linear-gradient(135deg, #0d3d30, #0d3d4d) !important; color: var(--accent) !important; border: 1px solid var(--accent) !important; font-family: 'JetBrains Mono', monospace !important; font-weight: 600 !important; letter-spacing: 1px; transition: all 0.2s; }
+button.primary:hover { background: linear-gradient(135deg, #1a5c4a, #1a5c6d) !important; box-shadow: 0 0 12px rgba(100,255,218,0.3) !important; }
+/* ── calc cards ── */
+.calc-card { border-radius: 8px; margin: 4px 0; border: 1px solid var(--border); background: var(--card); overflow: hidden; }
+.calc-header { display: flex; align-items: center; gap: 8px; padding: 8px 12px; cursor: pointer; user-select: none; transition: background 0.15s; }
+.calc-header:hover { background: rgba(255,255,255,0.03); }
+.calc-body { padding: 10px 14px; background: var(--bg2); border-top: 1px solid var(--border); }
+.step-num { color: var(--muted); font-size: 11px; min-width: 28px; }
+.step-name { font-weight: 600; font-size: 12px; flex: 1; }
+.toggle-arrow { color: var(--muted); font-size: 10px; transition: transform 0.2s; }
+.toggle-arrow.open { transform: rotate(90deg); }
+.shape-badge { background: var(--bg3); color: var(--muted); font-size: 10px; padding: 1px 6px; border-radius: 4px; border: 1px solid var(--border); }
+.formula { color: var(--accent); font-size: 11px; font-style: italic; margin-bottom: 4px; background: rgba(100,255,218,0.05); padding: 4px 8px; border-radius: 4px; border-left: 2px solid var(--accent); }
+.step-note { color: var(--muted); font-size: 11px; margin-bottom: 6px; }
+/* category colors */
+.cat-label-embed { color: var(--embed); }
+.cat-label-pe     { color: var(--pe); }
+.cat-label-attn   { color: var(--attn); }
+.cat-label-ffn    { color: var(--ffn); }
+.cat-label-norm   { color: var(--norm); }
+.cat-label-loss   { color: var(--loss); }
+.cat-label-infer  { color: var(--infer); }
+.cat-label-cross  { color: var(--cross); }
+.cat-label-mask   { color: var(--mask); }
+.cat-label-default{ color: var(--text); }
+.cat-embed  { border-left: 3px solid var(--embed); }
+.cat-pe     { border-left: 3px solid var(--pe); }
+.cat-attn   { border-left: 3px solid var(--attn); }
+.cat-ffn    { border-left: 3px solid var(--ffn); }
+.cat-norm   { border-left: 3px solid var(--norm); }
+.cat-loss   { border-left: 3px solid var(--loss); }
+.cat-infer  { border-left: 3px solid var(--infer); }
+.cat-cross  { border-left: 3px solid var(--cross); }
+.cat-mask   { border-left: 3px solid var(--mask); }
+.cat-default{ border-left: 3px solid var(--border); }
+/* ── matrix tables ── */
+.matrix-wrap { overflow-x: auto; }
+.matrix-2d, .matrix-1d { border-collapse: collapse; font-size: 10px; font-family: 'JetBrains Mono', monospace; }
+.mat-cell {
+  padding: 2px 5px; text-align: right; min-width: 48px;
+  background: color-mix(in srgb, #29b6f6 calc((var(--v,0) + 1) * 30%), #0d1120 calc(100% - (var(--v,0) + 1) * 30%));
+  color: #e2e8f0; border: 1px solid rgba(255,255,255,0.05);
+}
+.mat-more { color: var(--muted); font-style: italic; font-size: 9px; padding: 2px 6px; }
+.dict-table { font-size: 11px; width: 100%; }
+.dict-key { color: var(--accent); padding: 2px 8px 2px 0; }
+.dict-val { color: var(--text); padding: 2px; }
+.scalar-val { color: var(--accent4); font-size: 13px; font-weight: 600; }
+/* ── heatmap ── */
+.heatmap-container { margin: 8px 0; }
+.heatmap-title { color: var(--accent2); font-size: 11px; margin-bottom: 4px; font-weight: 600; }
+.heatmap { border-collapse: collapse; font-size: 10px; }
+.heat-cell {
+  width: 36px; height: 24px; text-align: center;
+  background: rgba(41, 182, 246, calc(var(--a, 0)));
+  border: 1px solid rgba(255,255,255,0.04);
+  color: color-mix(in srgb, #fff calc(var(--a,0)*100%), #4a5568 calc(100% - var(--a,0)*100%));
+  font-size: 9px; cursor: default;
+}
+.heat-cell:hover { outline: 1px solid var(--accent); }
+.heat-label { color: var(--accent3); font-size: 10px; padding-right: 6px; white-space: nowrap; }
+.heat-col-label { color: var(--embed); font-size: 9px; text-align: center; padding-bottom: 2px; }
+/* ── decode steps ── */
+.decode-steps { margin-top: 12px; }
+.decode-title { color: var(--accent); font-size: 13px; font-weight: 700; margin-bottom: 10px; padding-bottom: 4px; border-bottom: 1px solid var(--border); }
+.decode-step { border: 1px solid var(--border); border-radius: 8px; margin: 6px 0; padding: 10px; background: var(--card); }
+.decode-step-header { display: flex; align-items: center; gap: 8px; flex-wrap: wrap; margin-bottom: 8px; }
+.step-badge { background: var(--accent); color: var(--bg); font-size: 10px; font-weight: 700; padding: 2px 8px; border-radius: 20px; }
+.step-ctx { color: var(--muted); font-size: 11px; }
+.step-arrow { color: var(--accent4); }
+.step-chosen { color: var(--accent3); font-size: 13px; font-weight: 700; }
+.step-prob { color: var(--accent4); font-size: 11px; }
+.step-bars { margin: 4px 0; }
+.bar-row { display: flex; align-items: center; gap: 6px; margin: 2px 0; }
+.bar-row.chosen .bar-label { color: var(--accent3); font-weight: 700; }
+.bar-row.chosen .bar { background: var(--accent3) !important; }
+.bar-label { width: 100px; text-align: right; font-size: 11px; color: var(--text); white-space: nowrap; overflow: hidden; text-overflow: ellipsis; }
+.bar { height: 14px; background: var(--accent2); border-radius: 2px; transition: width 0.4s; min-width: 2px; }
+.bar-prob { font-size: 10px; color: var(--muted); }
+/* ── loss chart ── */
+#loss-chart-container { background: var(--bg2); border: 1px solid var(--border); border-radius: 8px; padding: 12px; margin-top: 8px; }
+/* ── arch diagram ── */
+#arch-diagram { background: var(--bg2); border: 1px solid var(--border); border-radius: 10px; padding: 12px; margin: 8px 0; }
+/* ── result banner ── */
+.result-banner { background: linear-gradient(135deg, #0d3d30, #1a1a3d); border: 1px solid var(--accent); border-radius: 10px; padding: 16px 20px; margin: 10px 0; }
+.result-en { color: var(--embed); font-size: 14px; margin-bottom: 4px; }
+.result-bn { color: var(--accent3); font-size: 22px; font-weight: 700; letter-spacing: 1px; }
+.result-label { color: var(--muted); font-size: 10px; text-transform: uppercase; letter-spacing: 1px; }
+/* ── misc ── */
+.gradio-html { background: transparent !important; }
+.panel { background: var(--card) !important; border: 1px solid var(--border) !important; border-radius: 10px !important; }
+.log-container { max-height: 600px; overflow-y: auto; padding: 8px; scrollbar-width: thin; scrollbar-color: var(--border) transparent; }
+"""
+CUSTOM_JS = """
+// Card toggle
+window._toggleCard = function(header) {
+  const body = header.nextElementSibling;
+  const arrow = header.querySelector('.toggle-arrow');
+  if (!body) return;
+  const open = body.style.display === 'block';
+  body.style.display = open ? 'none' : 'block';
+  if (arrow) arrow.classList.toggle('open', !open);
+};
+window._expandAll = function() {
+  document.querySelectorAll('.calc-body').forEach(b => b.style.display='block');
+  document.querySelectorAll('.toggle-arrow').forEach(a => a.classList.add('open'));
+};
+window._collapseAll = function() {
+  document.querySelectorAll('.calc-body').forEach(b => b.style.display='none');
+  document.querySelectorAll('.toggle-arrow').forEach(a => a.classList.remove('open'));
+};
+window._filterCards = function(cat) {
+  document.querySelectorAll('.calc-card').forEach(c => {
+    c.style.display = (!cat || c.classList.contains('cat-'+cat)) ? '' : 'none';
+  });
+};
+// Event delegation — works even if Gradio strips onclick attrs
+document.addEventListener('click', function(e) {
+  const header = e.target.closest('.calc-header');
+  if (header) { window._toggleCard(header); return; }
+  const btn = e.target.closest('[data-ga]');
+  if (btn) {
+    const a = btn.dataset.ga;
+    if (a === 'expand') window._expandAll();
+    else if (a === 'collapse') window._collapseAll();
+    else if (a.startsWith('filter:')) window._filterCards(a.slice(7));
+  }
+}, true);
+"""
+# ─────────────────────────────────────────────
+#  Pure-SVG loss curve (no JS/canvas needed)
+# ─────────────────────────────────────────────
+def _loss_svg(losses):
+    if not losses:
+        return ""
+    W, H = 580, 200
+    pl, pr, pt, pb = 52, 16, 16, 36
+    pw, ph = W - pl - pr, H - pt - pb
+    mn, mx = min(losses), max(losses)
+    rng = mx - mn or 1
+    n = len(losses)
+    def px(i): return pl + (i / max(n - 1, 1)) * pw
+    def py(v): return pt + ph - ((v - mn) / rng) * ph
+    # Grid + Y labels
+    grid = ""
+    for k in range(5):
+        v = mn + (k / 4) * rng
+        y = py(v)
+        grid += f'<line x1="{pl}" y1="{y:.1f}" x2="{pl+pw}" y2="{y:.1f}" stroke="#1e2d45" stroke-width="0.5"/>'
+        grid += f'<text x="{pl-4}" y="{y+4:.1f}" text-anchor="end" fill="#64748b" font-size="9" font-family="monospace">{v:.3f}</text>'
+    # Polyline points
+    pts = " ".join(f"{px(i):.1f},{py(v):.1f}" for i, v in enumerate(losses))
+    fill_pts = f"{pl:.1f},{pt+ph:.1f} {pts} {pl+pw:.1f},{pt+ph:.1f}"
+    # X labels
+    xlabels = ""
+    for idx in ([0, n//4, n//2, 3*n//4, n-1] if n > 4 else range(n)):
+        xlabels += f'<text x="{px(idx):.1f}" y="{H-4}" text-anchor="middle" fill="#64748b" font-size="9" font-family="monospace">E{idx+1}</text>'
+    return f"""
+    <div style="background:#0d1120;border:1px solid #1e2d45;border-radius:8px;padding:12px;margin-top:8px">
+      <div style="color:#64ffda;font-size:13px;font-weight:700;margin-bottom:8px">📉 Training Loss Curve</div>
+      <svg width="{W}" height="{H}" style="display:block;max-width:100%">
+        <defs>
+          <linearGradient id="lcg" x1="0" y1="0" x2="{W}" y2="0" gradientUnits="userSpaceOnUse">
+            <stop offset="0%" stop-color="#64ffda"/><stop offset="100%" stop-color="#29b6f6"/>
+          </linearGradient>
+        </defs>
+        <rect width="{W}" height="{H}" fill="#0d1120"/>
+        {grid}
+        <polygon points="{fill_pts}" fill="rgba(100,255,218,0.08)"/>
+        <polyline points="{pts}" fill="none" stroke="url(#lcg)" stroke-width="2.5" stroke-linejoin="round"/>
+        {xlabels}
+      </svg>
+    </div>"""
+# ─────────────────────────────────────────────
+#  Gradio callbacks
+# ─────────────────────────────────────────────
+def do_train(epochs_str, progress=gr.Progress()):
+    global MODEL, LOSS_HISTORY, IS_TRAINED
+    try:
+        epochs = int(epochs_str)
+    except:
+        epochs = 30
+    losses = []
+    def cb(ep, total, loss):
+        losses.append(loss)
+        progress((ep/total), desc=f"Epoch {ep}/{total} — loss {loss:.4f}")
+    MODEL, LOSS_HISTORY = run_training(epochs=epochs, device=DEVICE, progress_cb=cb)
+    IS_TRAINED = True
+    chart_html = _loss_svg(LOSS_HISTORY)
+    return (
+        f"✅ Trained {epochs} epochs. Final loss: {LOSS_HISTORY[-1]:.4f}",
+        chart_html
+    )
+def do_training_viz(en_sentence, bn_sentence):
+    model = get_or_init_model()
+    if not en_sentence.strip():
+        return "<p style='color:red'>Please enter an English sentence.</p>", "", ""
+    if not bn_sentence.strip():
+        bn_sentence = "আমি তোমাকে ভালোবাসি"
+    result = visualize_training_step(model, en_sentence.strip(), bn_sentence.strip(), DEVICE)
+    # Attention heatmap (cross-attn layer 0, head 0)
+    meta = result.get("meta", {})
+    attn_html = ""
+    src_tokens = result.get("src_tokens", [])
+    tgt_tokens = result.get("tgt_tokens", [])
+    result_banner = f"""
+    <div class="result-banner">
+      <div class="result-label">English Input</div>
+      <div class="result-en">"{en_sentence}"</div>
+      <div class="result-label" style="margin-top:8px">Bengali (Teacher-forced)</div>
+      <div class="result-bn">{bn_sentence}</div>
+      <div style="margin-top:8px;color:var(--loss);font-size:13px">
+        📉 Loss: <strong>{result['loss']:.4f}</strong>
+      </div>
+    </div>"""
+    calc_html = f"""
+    <div style="margin-bottom:8px;display:flex;gap:6px;flex-wrap:wrap">
+      <button data-ga="expand" style="background:var(--card);color:var(--accent);border:1px solid var(--border);padding:3px 10px;border-radius:4px;cursor:pointer;font-size:11px">Expand All</button>
+      <button data-ga="collapse" style="background:var(--card);color:var(--muted);border:1px solid var(--border);padding:3px 10px;border-radius:4px;cursor:pointer;font-size:11px">Collapse All</button>
+      {"".join(f'<button data-ga="filter:{cat}" style="background:var(--card);color:var(--cat-{cat},var(--text));border:1px solid var(--border);padding:3px 10px;border-radius:4px;cursor:pointer;font-size:10px">{cat}</button>' for cat in ['embed','pe','attn','ffn','norm','loss','cross','mask'])}
+      <button data-ga="filter:" style="background:var(--card);color:var(--muted);border:1px solid var(--border);padding:3px 10px;border-radius:4px;cursor:pointer;font-size:10px">show all</button>
+    </div>
+    <div class="log-container">
+      {calc_log_to_html(result.get('calc_log', []))}
+    </div>"""
+    return result_banner, calc_html, attn_html
+def do_inference_viz(en_sentence, decode_method):
+    model = get_or_init_model()
+    if not en_sentence.strip():
+        return "<p style='color:red'>Please enter an English sentence.</p>", "", ""
+    result = visualize_inference(model, en_sentence.strip(), DEVICE, decode_method)
+    result_banner = f"""
+    <div class="result-banner">
+      <div class="result-label">English Input</div>
+      <div class="result-en">"{en_sentence}"</div>
+      <div class="result-label" style="margin-top:8px">Bengali Translation ({decode_method})</div>
+      <div class="result-bn">{result['translation'] or '(no output)'}</div>
+      <div style="margin-top:6px;color:var(--muted);font-size:11px">
+        Tokens: {' → '.join(result['output_tokens'])}
+      </div>
+    </div>"""
+    decode_html = decode_steps_html(result.get("step_logs", []), result.get("src_tokens", []))
+    calc_html = f"""
+    <div style="margin-bottom:8px;display:flex;gap:6px;flex-wrap:wrap">
+      <button data-ga="expand" style="background:var(--card);color:var(--accent);border:1px solid var(--border);padding:3px 10px;border-radius:4px;cursor:pointer;font-size:11px">Expand All</button>
+      <button data-ga="collapse" style="background:var(--card);color:var(--muted);border:1px solid var(--border);padding:3px 10px;border-radius:4px;cursor:pointer;font-size:11px">Collapse All</button>
+    </div>
+    <div class="log-container">
+      {calc_log_to_html(result.get('calc_log', []))}
+    </div>"""
+    return result_banner + decode_html, calc_html, ""
+# ─────────────────────────────────────────────
+#  Build UI
+# ─────────────────────────────────────────────
+def build_ui():
+    _theme = gr.themes.Base(primary_hue="teal", secondary_hue="purple", neutral_hue="slate")
+    with gr.Blocks(
+        title="Transformer Visualizer — EN→BN",
+        css=CUSTOM_CSS,
+        js=f"() => {{ {CUSTOM_JS} }}",
+        theme=_theme,
+    ) as demo:
+        gr.HTML("""
+        <div style="text-align:center;padding:24px 0 12px;border-bottom:1px solid #1e2d45;margin-bottom:16px">
+          <div style="font-family:'Syne',sans-serif;font-size:28px;font-weight:800;
+               background:linear-gradient(135deg,#64ffda,#29b6f6,#ce93d8);
+               -webkit-background-clip:text;-webkit-text-fill-color:transparent;letter-spacing:2px">
+            TRANSFORMER VISUALIZER
+          </div>
+          <div style="color:#64748b;font-size:12px;letter-spacing:3px;margin-top:4px;font-family:'JetBrains Mono',monospace">
+            ENGLISH → BENGALI · EVERY CALCULATION EXPOSED
+          </div>
+        </div>
+        """)
+        with gr.Tabs():
+            # ── TAB 0: Architecture ──────────────────
+            with gr.Tab("🏗️ Architecture"):
+                gr.HTML(ARCH_SVG)
+                gr.HTML("""
+                <div style="display:grid;grid-template-columns:1fr 1fr;gap:12px;margin-top:12px">
+                  <div style="background:#141c2e;border:1px solid #1e2d45;border-radius:8px;padding:14px">
+                    <div style="color:#4fc3f7;font-weight:700;margin-bottom:8px">📌 Encoder Flow</div>
+                    <div style="color:#94a3b8;font-size:12px;line-height:1.8">
+                      1. English tokens → Embedding (d_model=64)<br>
+                      2. + Positional Encoding (sin/cos)<br>
+                      3. Multi-Head Self-Attention (4 heads)<br>
+                      4. Add &amp; LayerNorm<br>
+                      5. Feed-Forward (64→128→64)<br>
+                      6. Add &amp; LayerNorm<br>
+                      7. Repeat × 2 layers<br>
+                      8. Output K, V for decoder
+                    </div>
+                  </div>
+                  <div style="background:#141c2e;border:1px solid #1e2d45;border-radius:8px;padding:14px">
+                    <div style="color:#ce93d8;font-weight:700;margin-bottom:8px">📌 Decoder Flow</div>
+                    <div style="color:#94a3b8;font-size:12px;line-height:1.8">
+                      1. Bengali tokens → Embedding<br>
+                      2. + Positional Encoding<br>
+                      3. Masked MHA (future tokens blocked)<br>
+                      4. Add &amp; LayerNorm<br>
+                      5. Cross-Attention: Q←decoder, K,V←encoder<br>
+                      6. Add &amp; LayerNorm<br>
+                      7. Feed-Forward<br>
+                      8. Linear → Softmax → Bengali token
+                    </div>
+                  </div>
+                </div>
+                """)
+            # ── TAB 1: Train ─────────────────────────
+            with gr.Tab("🏋️ Train Model"):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        gr.HTML('<div style="color:#64ffda;font-size:13px;font-weight:700;margin-bottom:8px">Quick Train</div>')
+                        epochs_in = gr.Textbox(value="50", label="Epochs", max_lines=1)
+                        train_btn = gr.Button("▶ Train on 30 parallel sentences", variant="primary")
+                        train_status = gr.HTML()
+                    with gr.Column(scale=2):
+                        loss_chart = gr.HTML()
+                train_btn.click(do_train, inputs=[epochs_in], outputs=[train_status, loss_chart])
+            # ── TAB 2: Training Step Viz ──────────────
+            with gr.Tab("🔬 Training Step"):
+                gr.HTML('<div style="color:#ef9a9a;font-size:12px;margin-bottom:12px">📚 Shows <strong>teacher-forcing</strong>: ground-truth Bengali tokens are fed to decoder, loss + gradients computed.</div>')
+                with gr.Row():
+                    en_in_t = gr.Textbox(label="English Sentence", placeholder="i love you", value="i love you")
+                    bn_in_t = gr.Textbox(label="Bengali (ground truth)", placeholder="আমি তোমাকে ভালোবাসি", value="আমি তোমাকে ভালোবাসি")
+                run_train_viz = gr.Button("🔬 Run Training Step & Show All Calculations", variant="primary")
+                result_html_t = gr.HTML()
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        calc_html_t = gr.HTML()
+                    with gr.Column(scale=1):
+                        attn_html_t = gr.HTML()
+                run_train_viz.click(do_training_viz,
+                    inputs=[en_in_t, bn_in_t],
+                    outputs=[result_html_t, calc_html_t, attn_html_t])
+            # ── TAB 3: Inference Viz ──────────────────
+            with gr.Tab("⚡ Inference"):
+                gr.HTML('<div style="color:#80cbc4;font-size:12px;margin-bottom:12px">🤖 Shows <strong>auto-regressive decoding</strong>: model generates Bengali token by token, no ground truth needed.</div>')
+                with gr.Row():
+                    en_in_i = gr.Textbox(label="English Sentence", placeholder="i love you", value="i love you")
+                    decode_radio = gr.Radio(["greedy", "beam"], value="greedy", label="Decode Method")
+                run_infer = gr.Button("⚡ Translate & Show All Calculations", variant="primary")
+                result_html_i = gr.HTML()
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        calc_html_i = gr.HTML()
+                    with gr.Column(scale=1):
+                        attn_html_i = gr.HTML()
+                run_infer.click(do_inference_viz,
+                    inputs=[en_in_i, decode_radio],
+                    outputs=[result_html_i, calc_html_i, attn_html_i])
+            # ── TAB 4: Examples ──────────────────────
+            with gr.Tab("📖 Examples"):
+                gr.HTML("""
+                <div style="background:#141c2e;border:1px solid #1e2d45;border-radius:8px;padding:16px">
+                  <div style="color:#64ffda;font-weight:700;margin-bottom:12px">Try these sentences:</div>
+                  <div style="display:grid;grid-template-columns:1fr 1fr;gap:8px">
+                """ + "".join(
+                    f'<div style="background:#0d1120;border:1px solid #1e2d45;border-radius:6px;padding:8px">'
+                    f'<div style="color:#4fc3f7;font-size:12px">{en}</div>'
+                    f'<div style="color:#ce93d8;font-size:13px;font-weight:600">{bn}</div>'
+                    f'</div>'
+                    for en, bn in PARALLEL_DATA[:12]
+                ) + "</div></div>")
+    return demo
+demo = build_ui()
+demo.launch(server_name="0.0.0.0")

inference.py ADDED Viewed

	@@ -0,0 +1,250 @@

+"""
+inference.py
+Inference (translation) for English→Bengali with full calculation logging.
+Supports greedy decoding and beam search, showing every step.
+"""
+import torch
+import torch.nn.functional as F
+import numpy as np
+import math
+from typing import Dict, List, Tuple, Optional
+from transformer import Transformer, CalcLog
+from vocab import get_vocabs, PAD_IDX, BOS_IDX, EOS_IDX
+# ─────────────────────────────────────────────
+#  Greedy decoding with full logging
+# ─────────────────────────────────────────────
+def greedy_decode(
+    model: Transformer,
+    src: torch.Tensor,
+    max_len: int = 20,
+    device: str = "cpu",
+    log: Optional[CalcLog] = None,
+) -> Tuple[List[int], List[Dict]]:
+    model.eval()
+    src_v, tgt_v = get_vocabs()
+    with torch.no_grad():
+        src_mask = model.make_src_mask(src)
+        # ── Encode once ──────────────────────
+        src_emb = model.src_embed(src) * math.sqrt(model.d_model)
+        enc_x = model.src_pe(src_emb, log=log)
+        enc_attn_weights = []
+        for i, layer in enumerate(model.encoder_layers):
+            enc_x, ew = layer(enc_x, src_mask=src_mask,
+                              log=log if i == 0 else None, layer_idx=i)
+            enc_attn_weights.append(ew.cpu().numpy())
+        if log:
+            log.log("INFERENCE_ENCODER_done", enc_x[0, :, :8],
+                    note="Encoder finished. Output K,V will be reused for every decoder step.")
+        # ── Auto-regressive decode ────────────
+        generated = [BOS_IDX]
+        step_logs = []
+        for step in range(max_len):
+            tgt_so_far = torch.tensor([generated], dtype=torch.long, device=device)
+            tgt_mask = model.make_tgt_mask(tgt_so_far)
+            tgt_emb = model.tgt_embed(tgt_so_far) * math.sqrt(model.d_model)
+            dec_x = model.tgt_pe(tgt_emb)
+            step_dec_cross = []
+            for i, layer in enumerate(model.decoder_layers):
+                do_log = (log is not None) and (step < 3) and (i == 0)
+                if do_log:
+                    log.log(f"INFERENCE_step{step}_dec_input", dec_x[0, :, :8],
+                            note=f"Decoder input at step {step}: tokens so far = "
+                                 f"{tgt_v.tokens(generated)}")
+                dec_x, mw, cw = layer(
+                    dec_x, enc_x,
+                    tgt_mask=tgt_mask, src_mask=src_mask,
+                    log=log if do_log else None,
+                    layer_idx=i,
+                )
+                step_dec_cross.append(cw.cpu().numpy())
+            # Only look at last position
+            last_logits = model.output_linear(dec_x[:, -1, :])   # (1, V)
+            probs = F.softmax(last_logits, dim=-1)[0]
+            # Top-5 predictions
+            top5_probs, top5_ids = probs.topk(5)
+            top5 = [
+                {"token": tgt_v.idx2token.get(idx.item(), "?"),
+                 "id": idx.item(),
+                 "prob": round(prob.item(), 4)}
+                for prob, idx in zip(top5_probs, top5_ids)
+            ]
+            # Greedy: pick highest
+            next_token = top5_ids[0].item()
+            step_info = {
+                "step": step,
+                "tokens_so_far": tgt_v.tokens(generated),
+                "top5": top5,
+                "chosen_token": tgt_v.idx2token.get(next_token, "?"),
+                "chosen_id": next_token,
+                "chosen_prob": round(top5_probs[0].item(), 4),
+                "cross_attn": step_dec_cross[0][0].tolist()
+                    if step_dec_cross else None,
+            }
+            step_logs.append(step_info)
+            if log and step < 3:
+                log.log(f"INFERENCE_step{step}_top5", top5,
+                        formula="P(next_token) = softmax(W_out · dec_out[-1])",
+                        note=f"Step {step}: top-5 candidates. Chosen: {step_info['chosen_token']} ({step_info['chosen_prob']:.4f})")
+            generated.append(next_token)
+            if next_token == EOS_IDX:
+                break
+    return generated, step_logs
+# ─────────────────────────────────────────────
+#  Beam search
+# ─────────────────────────────────────────────
+def beam_search(
+    model: Transformer,
+    src: torch.Tensor,
+    beam_size: int = 3,
+    max_len: int = 20,
+    device: str = "cpu",
+    log: Optional[CalcLog] = None,
+) -> Tuple[List[int], List[Dict]]:
+    model.eval()
+    src_v, tgt_v = get_vocabs()
+    with torch.no_grad():
+        src_mask = model.make_src_mask(src)
+        # Encode
+        src_emb = model.src_embed(src) * math.sqrt(model.d_model)
+        enc_x = model.src_pe(src_emb)
+        for i, layer in enumerate(model.encoder_layers):
+            enc_x, _ = layer(enc_x, src_mask=src_mask)
+        # Beams: list of (score, token_ids)
+        beams = [(0.0, [BOS_IDX])]
+        completed = []
+        beam_step_logs = []
+        for step in range(max_len):
+            if not beams:
+                break
+            candidates = []
+            for beam_idx, (score, tokens) in enumerate(beams):
+                tgt_t = torch.tensor([tokens], dtype=torch.long, device=device)
+                tgt_mask = model.make_tgt_mask(tgt_t)
+                tgt_emb = model.tgt_embed(tgt_t) * math.sqrt(model.d_model)
+                dec_x = model.tgt_pe(tgt_emb)
+                for i, layer in enumerate(model.decoder_layers):
+                    dec_x, _, _ = layer(dec_x, enc_x,
+                                        tgt_mask=tgt_mask, src_mask=src_mask)
+                last_logits = model.output_linear(dec_x[:, -1, :])
+                log_probs = F.log_softmax(last_logits, dim=-1)[0]
+                top_lp, top_id = log_probs.topk(beam_size)
+                for lp, tid in zip(top_lp, top_id):
+                    new_score = score + lp.item()
+                    new_tokens = tokens + [tid.item()]
+                    candidates.append((new_score, new_tokens))
+            # Keep top beam_size
+            candidates.sort(key=lambda x: x[0], reverse=True)
+            beams = []
+            step_info = {"step": step, "beams": []}
+            for sc, toks in candidates[:beam_size * 2]:
+                if toks[-1] == EOS_IDX:
+                    completed.append((sc / len(toks), toks))
+                else:
+                    beams.append((sc, toks))
+                    step_info["beams"].append({
+                        "score": round(sc, 4),
+                        "tokens": tgt_v.tokens(toks),
+                        "text": tgt_v.decode(toks),
+                    })
+                if len(beams) == beam_size:
+                    break
+            beam_step_logs.append(step_info)
+            if len(completed) >= beam_size:
+                break
+        if completed:
+            best = max(completed, key=lambda x: x[0])
+            return best[1], beam_step_logs
+        elif beams:
+            return beams[0][1] + [EOS_IDX], beam_step_logs
+        else:
+            return [BOS_IDX, EOS_IDX], beam_step_logs
+# ─────────────────────────────────────────────
+#  Full inference pipeline with visualization
+# ─────────────────────────────────────────────
+def visualize_inference(
+    model: Transformer,
+    en_sentence: str,
+    device: str = "cpu",
+    decode_method: str = "greedy",
+) -> Dict:
+    src_v, tgt_v = get_vocabs()
+    log = CalcLog()
+    src_ids = src_v.encode(en_sentence)
+    log.log("INFERENCE_TOKENIZATION", {
+        "sentence": en_sentence,
+        "tokens": en_sentence.lower().split(),
+        "ids": src_ids,
+    }, formula="word → vocab_id lookup",
+        note="No ground-truth Bengali needed — model generates from scratch")
+    src = torch.tensor([src_ids], dtype=torch.long, device=device)
+    if decode_method == "beam":
+        output_ids, step_logs = beam_search(model, src, beam_size=3,
+                                            device=device, log=log)
+        log.log("BEAM_SEARCH_complete", {
+            "method": "beam search (beam=3)",
+            "note": "Explores multiple hypotheses simultaneously — generally better quality"
+        })
+    else:
+        output_ids, step_logs = greedy_decode(model, src, device=device, log=log)
+        log.log("GREEDY_complete", {
+            "method": "greedy decoding",
+            "note": "Always picks highest probability token — fast but can miss optimal sequences"
+        })
+    translation = tgt_v.decode(output_ids)
+    output_tokens = tgt_v.tokens(output_ids)
+    log.log("FINAL_TRANSLATION", {
+        "input": en_sentence,
+        "output_ids": output_ids,
+        "output_tokens": output_tokens,
+        "translation": translation,
+    }, note="Complete English→Bengali translation")
+    return {
+        "en_sentence": en_sentence,
+        "translation": translation,
+        "output_tokens": output_tokens,
+        "output_ids": output_ids,
+        "src_tokens": src_v.tokens(src_ids),
+        "step_logs": step_logs,
+        "calc_log": log.to_dict(),
+        "decode_method": decode_method,
+    }

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ torch>=2.0.0
2	+ numpy>=1.24.0

training.py ADDED Viewed

	@@ -0,0 +1,261 @@

+"""
+training.py
+Training loop for English→Bengali transformer with full calculation capture.
+"""
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import numpy as np
+import math
+from typing import Dict, List, Tuple, Optional
+from transformer import Transformer, CalcLog
+from vocab import get_vocabs, PARALLEL_DATA, PAD_IDX, BOS_IDX, EOS_IDX
+# ─────────────────────────────────────────────
+#  Data helpers
+# ─────────────────────────────────────────────
+def collate_batch(pairs: List[Tuple[str, str]], src_v, tgt_v, device: str = "cpu"):
+    src_seqs, tgt_seqs = [], []
+    for en, bn in pairs:
+        src_seqs.append(src_v.encode(en))
+        tgt_seqs.append(tgt_v.encode(bn))
+    def pad(seqs):
+        max_len = max(len(s) for s in seqs)
+        padded = [s + [PAD_IDX] * (max_len - len(s)) for s in seqs]
+        return torch.tensor(padded, dtype=torch.long, device=device)
+    return pad(src_seqs), pad(tgt_seqs)
+# ─────────────────────────────────────────────
+#  Label-smoothed cross-entropy
+# ─────────────────────────────────────────────
+class LabelSmoothingLoss(nn.Module):
+    def __init__(self, vocab_size: int, pad_idx: int, smoothing: float = 0.1):
+        super().__init__()
+        self.vocab_size = vocab_size
+        self.pad_idx = pad_idx
+        self.smoothing = smoothing
+        self.confidence = 1.0 - smoothing
+    def forward(self, logits: torch.Tensor, target: torch.Tensor,
+                log: Optional[CalcLog] = None) -> torch.Tensor:
+        B, T, V = logits.shape
+        logits_flat = logits.reshape(-1, V)
+        target_flat = target.reshape(-1)
+        log_probs = torch.log_softmax(logits_flat, dim=-1)
+        with torch.no_grad():
+            smooth_dist = torch.full_like(log_probs, self.smoothing / (V - 2))
+            smooth_dist.scatter_(1, target_flat.unsqueeze(1), self.confidence)
+            smooth_dist[:, self.pad_idx] = 0
+            mask = (target_flat == self.pad_idx)
+            smooth_dist[mask] = 0
+        loss = -(smooth_dist * log_probs).sum(dim=-1)
+        non_pad = (~mask).sum()
+        loss = loss.sum() / non_pad.clamp(min=1)
+        if log:
+            probs_sample = torch.exp(log_probs[:4])
+            log.log("LOSS_log_probs_sample", probs_sample,
+                    formula="log P(token) = log_softmax(logits)",
+                    note="Softmax probabilities for first 4 target positions")
+            log.log("LOSS_smooth_dist_sample", smooth_dist[:4],
+                    formula=f"smooth: correct={self.confidence:.2f}, others={self.smoothing/(V-2):.5f}",
+                    note="Label-smoothed target distribution")
+            log.log("LOSS_value", loss.item(),
+                    formula="L = -Σ smooth_dist · log_probs / n_tokens",
+                    note=f"Label-smoothed cross-entropy loss = {loss.item():.4f}")
+        return loss
+# ─────────────────────────────────────────────
+#  Build model
+# ─────────────────────────────────────────────
+def build_model(src_vocab_size: int, tgt_vocab_size: int,
+                device: str = "cpu") -> Transformer:
+    model = Transformer(
+        src_vocab_size=src_vocab_size,
+        tgt_vocab_size=tgt_vocab_size,
+        d_model=64,
+        num_heads=4,
+        num_layers=2,
+        d_ff=128,
+        max_len=32,
+        dropout=0.1,
+        pad_idx=PAD_IDX,
+    ).to(device)
+    return model
+# ─────────────────────────────────────────────
+#  Single training step (with full logging)
+# ─────────────────────────────────────────────
+def training_step(
+    model: Transformer,
+    src: torch.Tensor,
+    tgt: torch.Tensor,
+    criterion: LabelSmoothingLoss,
+    optimizer: optim.Optimizer,
+    log: CalcLog,
+    step_num: int = 0,
+) -> Dict:
+    model.train()
+    log.clear()
+    # Teacher forcing: decoder input = [BOS, token_1, ..., token_{T-1}]
+    tgt_input = tgt[:, :-1]
+    tgt_target = tgt[:, 1:]
+    log.log("TRAINING_SETUP", {
+        "mode": "TRAINING",
+        "step": step_num,
+        "src_shape": list(src.shape),
+        "tgt_input_shape": list(tgt_input.shape),
+        "tgt_target_shape": list(tgt_target.shape),
+    }, formula="Teacher Forcing: feed ground-truth Bengali tokens as decoder input",
+        note="During training, decoder sees actual Bengali tokens (not its own predictions)")
+    log.log("SRC_sentence_ids", src[0].tolist(),
+            note="Source (English) token IDs fed to encoder")
+    log.log("TGT_input_ids", tgt_input[0].tolist(),
+            note="Target input to decoder (shifted right — starts with <BOS>)")
+    log.log("TGT_target_ids", tgt_target[0].tolist(),
+            note="What decoder must predict (shifted left — ends with <EOS>)")
+    # Forward
+    logits, meta = model(src, tgt_input, log=log)
+    # Loss
+    loss = criterion(logits, tgt_target, log=log)
+    log.log("LOSS_final", loss.item(),
+            formula="Total loss = label-smoothed cross-entropy averaged over tokens",
+            note=f"Loss = {loss.item():.4f}  (lower = better prediction)")
+    # Backward
+    optimizer.zero_grad()
+    loss.backward()
+    # Gradient stats
+    grad_norms = {}
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            gn = param.grad.norm().item()
+            grad_norms[name] = round(gn, 6)
+    log.log("GRADIENTS_norm_sample", dict(list(grad_norms.items())[:8]),
+            formula="∂L/∂W via backpropagation (chain rule)",
+            note="Gradient norms for first 8 parameter tensors")
+    # Gradient clipping
+    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+    optimizer.step()
+    log.log("OPTIMIZER_step", {
+        "algorithm": "Adam",
+        "lr": optimizer.param_groups[0]["lr"],
+        "note": "W = W - lr × (m̂ / (√v̂ + ε))   (Adam update rule)",
+    }, formula="Adam: adaptive learning rate with momentum",
+        note="Weights updated — model slightly improved")
+    return {
+        "loss": loss.item(),
+        "calc_log": log.to_dict(),
+        "meta": {k: v.tolist() if hasattr(v, "tolist") else v for k, v in meta.items()
+                 if k != "enc_attn"},
+    }
+# ─────────────────────────────────────────────
+#  Full training run (quick demo)
+# ─────────────────────────────────────────────
+def run_training(
+    epochs: int = 30,
+    device: str = "cpu",
+    progress_cb=None,
+) -> Tuple[Transformer, List[float]]:
+    src_v, tgt_v = get_vocabs()
+    model = build_model(len(src_v), len(tgt_v), device)
+    criterion = LabelSmoothingLoss(len(tgt_v), PAD_IDX, smoothing=0.1)
+    optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.98), eps=1e-9)
+    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
+    losses = []
+    src_batch, tgt_batch = collate_batch(PARALLEL_DATA, src_v, tgt_v, device)
+    for epoch in range(1, epochs + 1):
+        model.train()
+        tgt_input = tgt_batch[:, :-1]
+        tgt_target = tgt_batch[:, 1:]
+        logits, _ = model(src_batch, tgt_input, log=None)
+        loss = criterion(logits, tgt_target)
+        optimizer.zero_grad()
+        loss.backward()
+        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        optimizer.step()
+        scheduler.step(loss.item())
+        losses.append(loss.item())
+        if progress_cb:
+            progress_cb(epoch, epochs, loss.item())
+    return model, losses
+# ─────────────────────────────────────────────
+#  Single-sample step for visualization
+# ─────────────────────────────────────────────
+def visualize_training_step(
+    model: Transformer,
+    en_sentence: str,
+    bn_sentence: str,
+    device: str = "cpu",
+) -> Dict:
+    src_v, tgt_v = get_vocabs()
+    log = CalcLog()
+    src_ids = src_v.encode(en_sentence)
+    tgt_ids = tgt_v.encode(bn_sentence)
+    log.log("TOKENIZATION_EN", {
+        "sentence": en_sentence,
+        "tokens": en_sentence.lower().split(),
+        "ids": src_ids,
+        "vocab_size": len(src_v),
+    }, formula="token_id = vocab[word]",
+        note="English → token IDs (BOS prepended, EOS appended)")
+    log.log("TOKENIZATION_BN", {
+        "sentence": bn_sentence,
+        "tokens": bn_sentence.split(),
+        "ids": tgt_ids,
+        "vocab_size": len(tgt_v),
+    }, note="Bengali → token IDs (teacher-forced during training)")
+    src = torch.tensor([src_ids], dtype=torch.long, device=device)
+    tgt = torch.tensor([tgt_ids], dtype=torch.long, device=device)
+    criterion = LabelSmoothingLoss(len(tgt_v), PAD_IDX)
+    optimizer = optim.Adam(model.parameters(), lr=1e-3)
+    result = training_step(model, src, tgt, criterion, optimizer, log)
+    src_v_obj, tgt_v_obj = get_vocabs()
+    result["src_tokens"] = src_v_obj.tokens(src_ids)
+    result["tgt_tokens"] = tgt_v_obj.tokens(tgt_ids)
+    result["en_sentence"] = en_sentence
+    result["bn_sentence"] = bn_sentence
+    return result

transformer.py ADDED Viewed

	@@ -0,0 +1,516 @@

+"""
+transformer.py
+Full Transformer implementation for English → Bengali translation
+with complete calculation tracking at every step.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import math
+from typing import Optional, Tuple, Dict, List
+# ─────────────────────────────────────────────
+#  Calculation Logger
+# ─────────────────────────────────────────────
+class CalcLog:
+    """Captures every intermediate tensor for visualization."""
+    def __init__(self):
+        self.steps: List[Dict] = []
+    def log(self, name: str, data, formula: str = "", note: str = ""):
+        entry = {
+            "name": name,
+            "formula": formula,
+            "note": note,
+            "shape": None,
+            "value": None,
+        }
+        if isinstance(data, torch.Tensor):
+            entry["shape"] = list(data.shape)
+            entry["value"] = data.detach().cpu().numpy().tolist()
+        elif isinstance(data, np.ndarray):
+            entry["shape"] = list(data.shape)
+            entry["value"] = data.tolist()
+        else:
+            entry["value"] = data
+        self.steps.append(entry)
+        return data
+    def clear(self):
+        self.steps = []
+    def to_dict(self):
+        return self.steps
+# ─────────────────────────────────────────────
+#  Positional Encoding
+# ─────────────────────────────────────────────
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model: int, max_len: int = 512, dropout: float = 0.1):
+        super().__init__()
+        self.d_model = d_model
+        self.dropout = nn.Dropout(dropout)
+        pe = torch.zeros(max_len, d_model)
+        position = torch.arange(0, max_len).unsqueeze(1).float()
+        div_term = torch.exp(
+            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe.unsqueeze(0))  # (1, max_len, d_model)
+    def forward(self, x: torch.Tensor, log: Optional[CalcLog] = None) -> torch.Tensor:
+        seq_len = x.size(1)
+        pe_slice = self.pe[:, :seq_len, :]
+        if log:
+            log.log("PE_matrix", pe_slice[0, :seq_len, :8],
+                    formula="PE(pos,2i)=sin(pos/10000^(2i/d)), PE(pos,2i+1)=cos(...)",
+                    note=f"Showing first 8 dims for {seq_len} positions")
+            log.log("Embedding_before_PE", x[0, :, :8],
+                    note="Token embeddings (first 8 dims)")
+        x = x + pe_slice
+        if log:
+            log.log("Embedding_after_PE", x[0, :, :8],
+                    formula="X = Embedding + PE",
+                    note="After adding positional encoding")
+        return self.dropout(x)
+# ─────────────────────────────────────────────
+#  Scaled Dot-Product Attention
+# ─────────────────────────────────────────────
+def scaled_dot_product_attention(
+    Q: torch.Tensor,
+    K: torch.Tensor,
+    V: torch.Tensor,
+    mask: Optional[torch.Tensor] = None,
+    log: Optional[CalcLog] = None,
+    head_idx: int = 0,
+    layer_idx: int = 0,
+    attn_type: str = "self",
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    d_k = Q.size(-1)
+    prefix = f"L{layer_idx}_H{head_idx}_{attn_type}"
+    # Raw scores
+    scores = torch.matmul(Q, K.transpose(-2, -1))
+    if log:
+        log.log(f"{prefix}_Q", Q[0],
+                formula="Q = X · Wq",
+                note=f"Query matrix head {head_idx}")
+        log.log(f"{prefix}_K", K[0],
+                formula="K = X · Wk",
+                note=f"Key matrix head {head_idx}")
+        log.log(f"{prefix}_V", V[0],
+                formula="V = X · Wv",
+                note=f"Value matrix head {head_idx}")
+        log.log(f"{prefix}_QKt", scores[0],
+                formula="scores = Q · Kᵀ",
+                note=f"Raw attention scores (before scaling)")
+    # Scale
+    scale = math.sqrt(d_k)
+    scores = scores / scale
+    if log:
+        log.log(f"{prefix}_QKt_scaled", scores[0],
+                formula=f"scores = Q·Kᵀ / √{d_k} = Q·Kᵀ / {scale:.3f}",
+                note="Scaled scores — prevents vanishing gradients")
+    # Mask
+    # masks arrive as (B,1,1,T) or (B,1,T,T) from make_src/tgt_mask;
+    # scores here are 3-D (B,T_q,T_k) because we loop per-head,
+    # so squeeze the head dim to avoid (B,B,...) broadcasting.
+    if mask is not None:
+        if mask.dim() == 4:
+            mask = mask.squeeze(1)   # (B,1,T,T) or (B,1,1,T) → (B,T,T) or (B,1,T)
+        scores = scores.masked_fill(mask == 0, float("-inf"))
+        if log:
+            log.log(f"{prefix}_mask", mask[0].float(),
+                    formula="mask[i,j]=0 → score=-inf (future token blocked)",
+                    note="Causal mask (training decoder) or padding mask")
+            log.log(f"{prefix}_scores_masked", scores[0],
+                    note="Scores after masking (-inf will become 0 after softmax)")
+    # Softmax
+    attn_weights = F.softmax(scores, dim=-1)
+    # replace nan from -inf rows with 0 (edge case)
+    attn_weights = torch.nan_to_num(attn_weights, nan=0.0)
+    if log:
+        log.log(f"{prefix}_softmax", attn_weights[0],
+                formula="α = softmax(scores, dim=-1)",
+                note="Attention weights — each row sums to 1.0")
+    # Weighted sum
+    output = torch.matmul(attn_weights, V)
+    if log:
+        log.log(f"{prefix}_output", output[0],
+                formula="Attention = α · V",
+                note="Weighted sum of values")
+    return output, attn_weights
+# ─────────────────────────────────────────────
+#  Multi-Head Attention
+# ─────────────────────────────────────────────
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model: int, num_heads: int):
+        super().__init__()
+        assert d_model % num_heads == 0
+        self.d_model = d_model
+        self.num_heads = num_heads
+        self.d_k = d_model // num_heads
+        self.W_q = nn.Linear(d_model, d_model, bias=False)
+        self.W_k = nn.Linear(d_model, d_model, bias=False)
+        self.W_v = nn.Linear(d_model, d_model, bias=False)
+        self.W_o = nn.Linear(d_model, d_model, bias=False)
+    def split_heads(self, x: torch.Tensor) -> torch.Tensor:
+        B, T, D = x.shape
+        return x.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
+        # → (B, num_heads, T, d_k)
+    def forward(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        mask: Optional[torch.Tensor] = None,
+        log: Optional[CalcLog] = None,
+        layer_idx: int = 0,
+        attn_type: str = "self",
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        B = query.size(0)
+        prefix = f"L{layer_idx}_{attn_type}_MHA"
+        # Linear projections
+        Q = self.W_q(query)
+        K = self.W_k(key)
+        V = self.W_v(value)
+        if log:
+            log.log(f"{prefix}_Wq", self.W_q.weight[:4, :4],
+                    formula="Wq shape: (d_model, d_model)",
+                    note=f"Query weight matrix (first 4×4 shown)")
+            log.log(f"{prefix}_Q_full", Q[0, :, :8],
+                    formula="Q = input · Wq",
+                    note=f"Full Q projection (first 8 dims shown)")
+        # Split into heads
+        Q = self.split_heads(Q)  # (B, h, T, d_k)
+        K = self.split_heads(K)
+        V = self.split_heads(V)
+        if log:
+            log.log(f"{prefix}_Q_head0", Q[0, 0, :, :],
+                    formula=f"Split: (B,T,D) → (B,{self.num_heads},T,{self.d_k})",
+                    note=f"Head 0 queries — d_k={self.d_k}")
+        # Per-head attention (log only first 2 heads to avoid bloat)
+        all_attn = []
+        all_weights = []
+        for h in range(self.num_heads):
+            h_log = log if h < 2 else None
+            out_h, w_h = scaled_dot_product_attention(
+                Q[:, h], K[:, h], V[:, h],
+                mask=mask,
+                log=h_log,
+                head_idx=h,
+                layer_idx=layer_idx,
+                attn_type=attn_type,
+            )
+            all_attn.append(out_h)
+            all_weights.append(w_h)
+        # Concat heads
+        concat = torch.stack(all_attn, dim=1)          # (B, h, T, d_k)
+        concat = concat.transpose(1, 2).contiguous()    # (B, T, h, d_k)
+        concat = concat.view(B, -1, self.d_model)        # (B, T, D)
+        if log:
+            log.log(f"{prefix}_concat", concat[0, :, :8],
+                    formula="concat = [head_1; head_2; ...; head_h]",
+                    note=f"Concatenated heads (first 8 dims)")
+        # Final projection
+        output = self.W_o(concat)
+        if log:
+            log.log(f"{prefix}_output", output[0, :, :8],
+                    formula="MHA_out = concat · Wo",
+                    note="Final multi-head attention output")
+        # Stack all attention weights: (B, h, T_q, T_k)
+        attn_weights = torch.stack(all_weights, dim=1)
+        return output, attn_weights
+# ─────────────────────────────────────────────
+#  Feed-Forward Network
+# ────────────────────────────���────────────────
+class FeedForward(nn.Module):
+    def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
+        super().__init__()
+        self.linear1 = nn.Linear(d_model, d_ff)
+        self.linear2 = nn.Linear(d_ff, d_model)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x: torch.Tensor, log: Optional[CalcLog] = None,
+                layer_idx: int = 0, loc: str = "enc") -> torch.Tensor:
+        prefix = f"L{layer_idx}_{loc}_FFN"
+        h = self.linear1(x)
+        if log:
+            log.log(f"{prefix}_linear1", h[0, :, :8],
+                    formula="h = X · W1 + b1",
+                    note=f"First linear (d_model→d_ff), showing first 8 dims")
+        h = F.relu(h)
+        if log:
+            log.log(f"{prefix}_relu", h[0, :, :8],
+                    formula="h = ReLU(h) = max(0, h)",
+                    note="Negative values zeroed out")
+        h = self.dropout(h)
+        out = self.linear2(h)
+        if log:
+            log.log(f"{prefix}_linear2", out[0, :, :8],
+                    formula="out = h · W2 + b2",
+                    note=f"Second linear (d_ff→d_model)")
+        return out
+# ─────────────────────────────────────────────
+#  Layer Norm + Residual
+# ─────────────────────────────────────────────
+class AddNorm(nn.Module):
+    def __init__(self, d_model: int, eps: float = 1e-6):
+        super().__init__()
+        self.norm = nn.LayerNorm(d_model, eps=eps)
+    def forward(self, x: torch.Tensor, sublayer_out: torch.Tensor,
+                log: Optional[CalcLog] = None, tag: str = "") -> torch.Tensor:
+        residual = x + sublayer_out
+        out = self.norm(residual)
+        if log:
+            log.log(f"{tag}_residual", residual[0, :, :8],
+                    formula="residual = x + sublayer(x)",
+                    note="Residual (skip) connection")
+            log.log(f"{tag}_layernorm", out[0, :, :8],
+                    formula="LayerNorm(x) = γ·(x−μ)/σ + β",
+                    note="Layer normalization output")
+        return out
+# ─────────────────────────────────────────────
+#  Encoder Layer
+# ─────────────────────────────────────────────
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
+        super().__init__()
+        self.self_attn = MultiHeadAttention(d_model, num_heads)
+        self.ffn = FeedForward(d_model, d_ff, dropout)
+        self.add_norm1 = AddNorm(d_model)
+        self.add_norm2 = AddNorm(d_model)
+    def forward(self, x: torch.Tensor, src_mask: Optional[torch.Tensor] = None,
+                log: Optional[CalcLog] = None, layer_idx: int = 0):
+        attn_out, attn_w = self.self_attn(
+            x, x, x, mask=src_mask, log=log,
+            layer_idx=layer_idx, attn_type="enc_self"
+        )
+        x = self.add_norm1(x, attn_out, log=log, tag=f"L{layer_idx}_enc_self")
+        ffn_out = self.ffn(x, log=log, layer_idx=layer_idx, loc="enc")
+        x = self.add_norm2(x, ffn_out, log=log, tag=f"L{layer_idx}_enc_ffn")
+        return x, attn_w
+# ─────────────────────────────────────────────
+#  Decoder Layer
+# ─────────────────────────────────────────────
+class DecoderLayer(nn.Module):
+    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
+        super().__init__()
+        self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
+        self.cross_attn = MultiHeadAttention(d_model, num_heads)
+        self.ffn = FeedForward(d_model, d_ff, dropout)
+        self.add_norm1 = AddNorm(d_model)
+        self.add_norm2 = AddNorm(d_model)
+        self.add_norm3 = AddNorm(d_model)
+    def forward(
+        self,
+        x: torch.Tensor,
+        enc_out: torch.Tensor,
+        tgt_mask: Optional[torch.Tensor] = None,
+        src_mask: Optional[torch.Tensor] = None,
+        log: Optional[CalcLog] = None,
+        layer_idx: int = 0,
+    ):
+        # 1. Masked self-attention
+        m_attn_out, m_attn_w = self.masked_self_attn(
+            x, x, x, mask=tgt_mask, log=log,
+            layer_idx=layer_idx, attn_type="dec_masked"
+        )
+        x = self.add_norm1(x, m_attn_out, log=log, tag=f"L{layer_idx}_dec_masked")
+        # 2. Cross-attention: Q from decoder, K/V from encoder
+        if log:
+            log.log(f"L{layer_idx}_cross_Q_source", x[0, :, :8],
+                    note="Cross-attn Q comes from DECODER (Bengali context)")
+            log.log(f"L{layer_idx}_cross_KV_source", enc_out[0, :, :8],
+                    note="Cross-attn K,V come from ENCODER (English context)")
+        c_attn_out, c_attn_w = self.cross_attn(
+            query=x, key=enc_out, value=enc_out,
+            mask=src_mask, log=log,
+            layer_idx=layer_idx, attn_type="dec_cross"
+        )
+        x = self.add_norm2(x, c_attn_out, log=log, tag=f"L{layer_idx}_dec_cross")
+        # 3. FFN
+        ffn_out = self.ffn(x, log=log, layer_idx=layer_idx, loc="dec")
+        x = self.add_norm3(x, ffn_out, log=log, tag=f"L{layer_idx}_dec_ffn")
+        return x, m_attn_w, c_attn_w
+# ─────────────────────────────────────────────
+#  Full Transformer
+# ─────────────────────────────────────────────
+class Transformer(nn.Module):
+    def __init__(
+        self,
+        src_vocab_size: int,
+        tgt_vocab_size: int,
+        d_model: int = 128,
+        num_heads: int = 4,
+        num_layers: int = 2,
+        d_ff: int = 256,
+        max_len: int = 64,
+        dropout: float = 0.1,
+        pad_idx: int = 0,
+    ):
+        super().__init__()
+        self.d_model = d_model
+        self.pad_idx = pad_idx
+        self.num_layers = num_layers
+        self.src_embed = nn.Embedding(src_vocab_size, d_model, padding_idx=pad_idx)
+        self.tgt_embed = nn.Embedding(tgt_vocab_size, d_model, padding_idx=pad_idx)
+        self.src_pe = PositionalEncoding(d_model, max_len, dropout)
+        self.tgt_pe = PositionalEncoding(d_model, max_len, dropout)
+        self.encoder_layers = nn.ModuleList(
+            [EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
+        )
+        self.decoder_layers = nn.ModuleList(
+            [DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]
+        )
+        self.output_linear = nn.Linear(d_model, tgt_vocab_size)
+        self._init_weights()
+    def _init_weights(self):
+        for p in self.parameters():
+            if p.dim() > 1:
+                nn.init.xavier_uniform_(p)
+    # ── mask helpers ──────────────────────────
+    def make_src_mask(self, src: torch.Tensor) -> torch.Tensor:
+        # (B, 1, 1, T_src) — 1 where not pad
+        return (src != self.pad_idx).unsqueeze(1).unsqueeze(2)
+    def make_tgt_mask(self, tgt: torch.Tensor) -> torch.Tensor:
+        T = tgt.size(1)
+        pad_mask = (tgt != self.pad_idx).unsqueeze(1).unsqueeze(2)  # (B,1,1,T)
+        causal = torch.tril(torch.ones(T, T, device=tgt.device)).bool()  # (T,T)
+        return pad_mask & causal  # (B,1,T,T)
+    # ── forward ───────────────────────────────
+    def forward(
+        self,
+        src: torch.Tensor,
+        tgt: torch.Tensor,
+        log: Optional[CalcLog] = None,
+    ) -> Tuple[torch.Tensor, Dict]:
+        src_mask = self.make_src_mask(src)
+        tgt_mask = self.make_tgt_mask(tgt)
+        # ── Encoder ──────────────────────────
+        src_emb = self.src_embed(src) * math.sqrt(self.d_model)
+        if log:
+            log.log("SRC_tokens", src[0],
+                    note="Source token IDs (English)")
+            log.log("SRC_embedding_raw", src_emb[0, :, :8],
+                    formula=f"emb = Embedding(token_id) × √{self.d_model}",
+                    note="Token embeddings (first 8 dims)")
+        enc_x = self.src_pe(src_emb, log=log)
+        enc_attn_weights = []
+        for i, layer in enumerate(self.encoder_layers):
+            enc_x, ew = layer(enc_x, src_mask=src_mask, log=log, layer_idx=i)
+            enc_attn_weights.append(ew.detach().cpu().numpy())
+        if log:
+            log.log("ENCODER_output", enc_x[0, :, :8],
+                    note="Final encoder output — passed as K,V to every decoder cross-attention")
+        # ── Decoder ──────────────────────────
+        tgt_emb = self.tgt_embed(tgt) * math.sqrt(self.d_model)
+        if log:
+            log.log("TGT_tokens", tgt[0],
+                    note="Target token IDs (Bengali, teacher-forced in training)")
+            log.log("TGT_embedding_raw", tgt_emb[0, :, :8],
+                    formula=f"emb = Embedding(token_id) × √{self.d_model}",
+                    note="Bengali token embeddings")
+        dec_x = self.tgt_pe(tgt_emb, log=log)
+        dec_self_attn_w = []
+        dec_cross_attn_w = []
+        for i, layer in enumerate(self.decoder_layers):
+            dec_x, mw, cw = layer(
+                dec_x, enc_x,
+                tgt_mask=tgt_mask, src_mask=src_mask,
+                log=log, layer_idx=i,
+            )
+            dec_self_attn_w.append(mw.detach().cpu().numpy())
+            dec_cross_attn_w.append(cw.detach().cpu().numpy())
+        # ── Output projection ─────────────────
+        logits = self.output_linear(dec_x)   # (B, T, vocab)
+        if log:
+            log.log("LOGITS", logits[0, :, :16],
+                    formula="logits = dec_out · W_out  (first 16 vocab entries shown)",
+                    note=f"Raw scores over vocab of {logits.size(-1)} Bengali tokens")
+            probs = F.softmax(logits[0], dim=-1)
+            log.log("SOFTMAX_probs", probs[:, :16],
+                    formula="P(token) = exp(logit) / Σ exp(logits)",
+                    note="Probability distribution over Bengali vocabulary")
+        meta = {
+            "enc_attn": enc_attn_weights,
+            "dec_self_attn": dec_self_attn_w,
+            "dec_cross_attn": dec_cross_attn_w,
+            "src_mask": src_mask.cpu().numpy(),
+            "tgt_mask": tgt_mask.cpu().numpy(),
+        }
+        return logits, meta

vocab.py ADDED Viewed

	@@ -0,0 +1,146 @@

+"""
+vocab.py
+Simple character/subword tokenizer for English→Bengali demo.
+"""
+import json
+import re
+from pathlib import Path
+from typing import List, Dict, Tuple
+# ── Special tokens ───────────────────────────────────────────────────────────
+PAD_TOKEN  = "<pad>"
+BOS_TOKEN  = "<bos>"
+EOS_TOKEN  = "<eos>"
+UNK_TOKEN  = "<unk>"
+PAD_IDX = 0
+BOS_IDX = 1
+EOS_IDX = 2
+UNK_IDX = 3
+# ── English word-level vocab ─────────────────────────────────────────────────
+EN_WORDS = [
+    "i", "you", "he", "she", "we", "they", "it",
+    "love", "like", "eat", "drink", "go", "come", "see", "know",
+    "want", "need", "have", "am", "is", "are", "was", "were",
+    "do", "does", "did", "will", "can", "may", "should",
+    "a", "an", "the", "this", "that", "my", "your", "his", "her",
+    "good", "bad", "happy", "sad", "big", "small", "new", "old",
+    "food", "water", "home", "work", "school", "book", "name",
+    "rice", "fish", "milk", "tea", "coffee",
+    "hello", "bye", "yes", "no", "please", "thank", "thanks",
+    "how", "what", "where", "when", "why", "who",
+    "today", "tomorrow", "now", "always", "never", "very",
+    "bengal", "india", "english", "bengali",
+    "beautiful", "wonderful", "great", "nice", "fine",
+]
+# ── Bengali word-level vocab ──────────────────────────────────────────────────
+BN_WORDS = [
+    "আমি", "তুমি", "তুই", "সে", "আমরা", "তারা", "এটা",
+    "ভালোবাসি", "পছন্দ", "খাই", "পান", "যাই", "আসি", "দেখি", "জানি",
+    "চাই", "দরকার", "আছে", "হই", "হয়", "ছিলাম", "ছিল",
+    "করি", "করে", "করেছি", "করব", "পারি", "পারে", "উচিত",
+    "একটা", "একটি", "এই", "সেই", "আমার", "তোমার", "তার",
+    "ভালো", "খারাপ", "খুশি", "দুঃখী", "বড়", "ছোট", "নতুন", "পুরনো",
+    "খাবার", "পানি", "বাড়ি", "কাজ", "স্কুল", "বই", "নাম",
+    "ভাত", "মাছ", "দুধ", "চা", "কফি",
+    "হ্যালো", "বিদায়", "হ্যাঁ", "না", "দয়া", "ধন্যবাদ",
+    "কেমন", "কি", "কোথায়", "কখন", "কেন", "কে",
+    "আজ", "আগামীকাল", "এখন", "সবসময়", "কখনো", "খুব",
+    "বাংলা", "ভারত", "ইংরেজি",
+    "সুন্দর", "চমৎকার", "দারুণ",
+    "তোমাকে", "আপনাকে", "তাকে", "আমাকে",
+    "করছি", "করছে", "হচ্ছে", "পড়ি", "লিখি", "বলি",
+    "আছ", "সকাল", "জানে", "দেখে",
+]
+class Vocab:
+    def __init__(self, words: List[str], name: str = "vocab"):
+        self.name = name
+        self.token2idx: Dict[str, int] = {
+            PAD_TOKEN: PAD_IDX,
+            BOS_TOKEN: BOS_IDX,
+            EOS_TOKEN: EOS_IDX,
+            UNK_TOKEN: UNK_IDX,
+        }
+        for w in words:
+            if w not in self.token2idx:
+                self.token2idx[w] = len(self.token2idx)
+        self.idx2token: Dict[int, str] = {v: k for k, v in self.token2idx.items()}
+    def __len__(self):
+        return len(self.token2idx)
+    def encode(self, sentence: str) -> List[int]:
+        tokens = sentence.lower().strip().split()
+        ids = [self.token2idx.get(t, UNK_IDX) for t in tokens]
+        return [BOS_IDX] + ids + [EOS_IDX]
+    def decode(self, ids: List[int], skip_special: bool = True) -> str:
+        skip = {PAD_IDX, BOS_IDX, EOS_IDX} if skip_special else set()
+        return " ".join(
+            self.idx2token.get(i, UNK_TOKEN)
+            for i in ids
+            if i not in skip
+        )
+    def tokens(self, ids: List[int]) -> List[str]:
+        return [self.idx2token.get(i, UNK_TOKEN) for i in ids]
+# ── Toy parallel corpus ──────────────────────────────────────────────────────
+PARALLEL_DATA: List[Tuple[str, str]] = [
+    ("i love you", "আমি তোমাকে ভালোবাসি"),
+    ("i like food", "আমি খাবার পছন্দ"),
+    ("you are good", "তুমি ভালো"),
+    ("he is happy", "সে খুশি"),
+    ("i want water", "আমি পানি চাই"),
+    ("she is beautiful", "সে সুন্দর"),
+    ("i eat rice", "আমি ভাত খাই"),
+    ("i drink tea", "আমি চা পান"),
+    ("i know you", "আমি তোমাকে জানি"),
+    ("we are happy", "আমরা খুশি"),
+    ("i see you", "আমি তোমাকে দেখি"),
+    ("you are beautiful", "তুমি সুন্দর"),
+    ("i love bengali", "আমি বাংলা ভালোবাসি"),
+    ("hello how are you", "হ্যালো কেমন আছ"),  # আছ added to vocab
+    ("thank you very much", "ধন্যবাদ খুব"),
+    ("i need you", "আমি তোমাকে দরকার"),
+    ("he knows bengali", "সে বাংলা জানে"),  # জানে added to vocab
+    ("she drinks milk", "সে দুধ পান"),
+    ("we go home", "আমরা বাড়ি যাই"),
+    ("i am happy today", "আমি আজ খুশি"),
+    ("i love my home", "আমি আমার বাড়ি ভালোবাসি"),
+    ("she sees you", "সে তোমাকে দেখে"),
+    ("they eat rice", "তারা ভাত খাই"),
+    ("i want tea", "আমি চা চাই"),
+    ("this is good", "এটা ভালো"),
+    ("he is sad", "সে দুঃখী"),
+    ("i like school", "আমি স্কুল পছন্দ"),
+    ("you know me", "তুমি আমাকে জানে"),
+    ("good morning", "ভালো সকাল"),
+    ("i love india", "আমি ভারত ভালোবাসি"),
+]
+def build_vocabs() -> Tuple[Vocab, Vocab]:
+    src_v = Vocab(EN_WORDS, "english")
+    tgt_v = Vocab(BN_WORDS, "bengali")
+    return src_v, tgt_v
+# singleton
+_src_vocab, _tgt_vocab = build_vocabs()
+def get_vocabs() -> Tuple[Vocab, Vocab]:
+    return _src_vocab, _tgt_vocab