File size: 2,469 Bytes
dc138e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: Transformer Visualizer EN→BN
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: true
license: mit
---

# 🔬 Transformer Visualizer — English → Bengali

**See every single calculation inside a Transformer, live.**

## What this Space does

Type any English sentence and watch every number flow through the Transformer architecture step by step — from raw token IDs all the way to Bengali output.

---

## 🗂️ Tabs

### 🏗️ Architecture
- Full SVG diagram of encoder + decoder
- Color-coded: self-attention / cross-attention / masked attention / FFN
- Explains K,V flow from encoder to decoder

### 🏋️ Train Model
- Trains a small Transformer on 30 English→Bengali sentence pairs
- Live loss curve rendered on canvas
- Configurable epochs

### 🔬 Training Step
Shows a **single training forward pass** with teacher forcing:

1. **Tokenization** — English + Bengali → token ID arrays
2. **Embedding**`token_id → vector × √d_model`
3. **Positional Encoding**`sin(pos/10000^(2i/d))` / `cos(...)` matrix shown
4. **Encoder**:
   - Q, K, V projection matrices shown
   - `scores = Q·Kᵀ / √d_k` with actual numbers
   - Softmax attention weights (heatmap)
   - Residual + LayerNorm
   - FFN: `max(0, xW₁+b₁)W₂+b₂`
5. **Decoder**:
   - Masked self-attention with causal mask matrix
   - Cross-attention: Q from decoder, K/V from encoder
6. **Loss** — label-smoothed cross-entropy, gradient norms, Adam update

### ⚡ Inference
Shows **auto-regressive decoding**:

- No ground truth needed
- Token generated one at a time
- Top-5 candidates + probabilities at every step
- Cross-attention heatmap: which Bengali token attends to which English word
- Greedy vs Beam Search comparison

---

## 📁 File Structure

```
app.py          — Gradio UI + HTML/CSS/JS rendering
transformer.py  — Full Transformer with CalcLog hooks
training.py     — Training loop + single-step visualization
inference.py    — Greedy & beam search with logging
vocab.py        — English/Bengali vocabularies + parallel corpus
requirements.txt
```

---

## ⚙️ Model Config

| Parameter | Value |
|-----------|-------|
| d_model | 64 |
| num_heads | 4 |
| num_layers | 2 |
| d_ff | 128 |
| vocab (EN) | ~100 |
| vocab (BN) | ~90 |
| Optimizer | Adam |
| Loss | Label-smoothed CE |

---

*Built for educational purposes — every matrix operation is logged and displayed.*