Spaces:
Build error
Build error
Update src/streamlit_app.py
Browse files- src/streamlit_app.py +172 -4
src/streamlit_app.py
CHANGED
|
@@ -30,7 +30,6 @@
|
|
| 30 |
# print("✅ Downloaded and saved GPT-2 to models")
|
| 31 |
|
| 32 |
|
| 33 |
-
|
| 34 |
import streamlit as st
|
| 35 |
st.set_page_config(page_title="GPT-2 Attention Explorer", layout="wide")
|
| 36 |
|
|
@@ -43,8 +42,8 @@ import pandas as pd
|
|
| 43 |
|
| 44 |
@st.cache_resource
|
| 45 |
def load_model():
|
| 46 |
-
tokenizer = GPT2TokenizerFast.from_pretrained("models")
|
| 47 |
-
model = GPT2Model.from_pretrained("models", output_attentions=True, attn_implementation="eager")
|
| 48 |
model.eval()
|
| 49 |
return tokenizer, model
|
| 50 |
|
|
@@ -57,11 +56,82 @@ with st.expander("📊 GPT-2 Model Architecture Summary"):
|
|
| 57 |
- **Vocabulary size (V):** `50257`
|
| 58 |
- **Embedding dimension (d):** `768`
|
| 59 |
- **Max Position Length (L):** `1024`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
- **Transformer Layers:** `12`
|
| 61 |
- **Attention Heads per Layer:** `12`
|
| 62 |
- **Per-head Dimension (dₖ):** `64`
|
| 63 |
- **Feedforward Hidden Layer Size:** `3072`
|
| 64 |
- **Total Parameters:** ~117 million
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
""")
|
| 66 |
|
| 67 |
|
|
@@ -520,7 +590,105 @@ print(decoded)
|
|
| 520 |
| `'Ġthe'` | `' the'` |
|
| 521 |
| `'Ġmat'` | `' mat'` |
|
| 522 |
|
| 523 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 524 |
|
| 525 |
|
| 526 |
""")
|
|
|
|
|
|
| 30 |
# print("✅ Downloaded and saved GPT-2 to models")
|
| 31 |
|
| 32 |
|
|
|
|
| 33 |
import streamlit as st
|
| 34 |
st.set_page_config(page_title="GPT-2 Attention Explorer", layout="wide")
|
| 35 |
|
|
|
|
| 42 |
|
| 43 |
@st.cache_resource
|
| 44 |
def load_model():
|
| 45 |
+
tokenizer = GPT2TokenizerFast.from_pretrained("./models")
|
| 46 |
+
model = GPT2Model.from_pretrained("./models", output_attentions=True, attn_implementation="eager")
|
| 47 |
model.eval()
|
| 48 |
return tokenizer, model
|
| 49 |
|
|
|
|
| 56 |
- **Vocabulary size (V):** `50257`
|
| 57 |
- **Embedding dimension (d):** `768`
|
| 58 |
- **Max Position Length (L):** `1024`
|
| 59 |
+
- This is sometimes also called:
|
| 60 |
+
- n_positions in config
|
| 61 |
+
- max sequence length
|
| 62 |
+
- context length
|
| 63 |
+
- max context window
|
| 64 |
- **Transformer Layers:** `12`
|
| 65 |
- **Attention Heads per Layer:** `12`
|
| 66 |
- **Per-head Dimension (dₖ):** `64`
|
| 67 |
- **Feedforward Hidden Layer Size:** `3072`
|
| 68 |
- **Total Parameters:** ~117 million
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Question: Transformer Layers: 12 means each layer has 12 Attention Heads?
|
| 73 |
+
|
| 74 |
+
## 🧠 Quick Answer:
|
| 75 |
+
|
| 76 |
+
> ✅ **No**, 12 Transformer Layers ≠ 12 Heads per Layer
|
| 77 |
+
> 🔁 But in **GPT-2 (small)**, both happen to be **12** — **by design coincidence**, not definition.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## 🔍 Breakdown of GPT-2’s Architecture
|
| 82 |
+
|
| 83 |
+
| Component | GPT-2 (small) default |
|
| 84 |
+
| ----------------------------- | --------------------- |
|
| 85 |
+
| Embedding size (`d_model`) | 768 |
|
| 86 |
+
| **Transformer layers** | 12 |
|
| 87 |
+
| **Attention heads per layer** | 12 |
|
| 88 |
+
| Hidden feedforward size | 3072 |
|
| 89 |
+
| Max position embeddings | 1024 |
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
### ✅ So in GPT-2:
|
| 94 |
+
|
| 95 |
+
* Each of the **12 transformer layers** has:
|
| 96 |
+
|
| 97 |
+
* **Multi-head attention**
|
| 98 |
+
* With **12 heads per layer**
|
| 99 |
+
* Each head has `64` dimensions (`768 ÷ 12 = 64`)
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## 📌 Why this Confusion Happens
|
| 104 |
+
|
| 105 |
+
The number of **layers** and **heads per layer** are:
|
| 106 |
+
|
| 107 |
+
* Configured independently in the model
|
| 108 |
+
* But **coincidentally** both set to 12 in GPT-2 small
|
| 109 |
+
|
| 110 |
+
In other models:
|
| 111 |
+
|
| 112 |
+
| Model | Layers | Heads per Layer |
|
| 113 |
+
| ------------ | ------ | --------------- |
|
| 114 |
+
| GPT-2 Medium | 24 | 16 |
|
| 115 |
+
| GPT-2 Large | 36 | 20 |
|
| 116 |
+
| GPT-3 | 96 | 96 |
|
| 117 |
+
| LLaMA 2 7B | 32 | 32 |
|
| 118 |
+
|
| 119 |
+
So again:
|
| 120 |
+
|
| 121 |
+
> 🔁 **12 layers ≠ 12 heads** in general — it's just a choice in GPT-2 small.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## 💡 Want a table in your app to explain this too?
|
| 126 |
+
|
| 127 |
+
I can give you a section like:
|
| 128 |
+
|
| 129 |
+
> "🧩 Layers vs Heads — What's the Difference?"
|
| 130 |
+
|
| 131 |
+
Let me know and I’ll drop in that Streamlit code too.
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
|
| 135 |
""")
|
| 136 |
|
| 137 |
|
|
|
|
| 590 |
| `'Ġthe'` | `' the'` |
|
| 591 |
| `'Ġmat'` | `' mat'` |
|
| 592 |
|
| 593 |
+
|
| 594 |
+
---
|
| 595 |
+
|
| 596 |
+
## ✅ What is `@` in Python?
|
| 597 |
+
|
| 598 |
+
In Python 3.5+, the `@` operator means:
|
| 599 |
+
|
| 600 |
+
> **Matrix multiplication** (also called **dot product** or **tensor contraction** depending on context)
|
| 601 |
+
|
| 602 |
+
---
|
| 603 |
+
|
| 604 |
+
### ✅ Equivalent to:
|
| 605 |
+
|
| 606 |
+
```python
|
| 607 |
+
A @ B ⟺ np.matmul(A, B)
|
| 608 |
+
```
|
| 609 |
+
|
| 610 |
+
Or if both are 1D/2D NumPy arrays:
|
| 611 |
+
|
| 612 |
+
```python
|
| 613 |
+
A @ B ⟺ np.dot(A, B)
|
| 614 |
+
```
|
| 615 |
+
|
| 616 |
+
---
|
| 617 |
+
|
| 618 |
+
## 🔍 In your case:
|
| 619 |
+
|
| 620 |
+
```python
|
| 621 |
+
Output = W_qkv @ x + b
|
| 622 |
+
```
|
| 623 |
+
|
| 624 |
+
### Let’s say:
|
| 625 |
+
|
| 626 |
+
* `x` = shape **(3,)**
|
| 627 |
+
* `W_qkv` = shape **(6, 3)**
|
| 628 |
+
* `b` = shape **(6,)**
|
| 629 |
+
|
| 630 |
+
---
|
| 631 |
+
|
| 632 |
+
### Then:
|
| 633 |
+
|
| 634 |
+
* `W_qkv @ x` → matrix–vector multiplication
|
| 635 |
+
→ shape: **(6,)**
|
| 636 |
+
|
| 637 |
+
* Adding `b` → element-wise vector addition
|
| 638 |
+
→ final shape: **(6,)**
|
| 639 |
+
|
| 640 |
+
---
|
| 641 |
+
|
| 642 |
+
### So this line:
|
| 643 |
+
|
| 644 |
+
```python
|
| 645 |
+
Output = W_qkv @ x + b
|
| 646 |
+
```
|
| 647 |
+
|
| 648 |
+
Means:
|
| 649 |
+
|
| 650 |
+
1. Multiply the **input vector `x`** with the **projection matrix `W_qkv`**
|
| 651 |
+
2. Add a **bias vector `b`**
|
| 652 |
+
3. Result = combined **\[Q | K | V]** output
|
| 653 |
+
|
| 654 |
+
---
|
| 655 |
+
|
| 656 |
+
## ✅ Example:
|
| 657 |
+
|
| 658 |
+
```python
|
| 659 |
+
x = np.array([1, 2, 3])
|
| 660 |
+
W_qkv = np.array([
|
| 661 |
+
[0.1, 0.2, 0.3], # Q1
|
| 662 |
+
[0.4, 0.5, 0.6], # Q2
|
| 663 |
+
[0.7, 0.8, 0.9], # K1
|
| 664 |
+
[1.0, 1.1, 1.2], # K2
|
| 665 |
+
[1.3, 1.4, 1.5], # V1
|
| 666 |
+
[1.6, 1.7, 1.8], # V2
|
| 667 |
+
])
|
| 668 |
+
b = np.array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06])
|
| 669 |
+
|
| 670 |
+
output = W_qkv @ x + b
|
| 671 |
+
```
|
| 672 |
+
|
| 673 |
+
Manually:
|
| 674 |
+
|
| 675 |
+
* `W_qkv @ x` = `[1.4, 3.2, 5.0, 6.8, 8.6, 10.4]`
|
| 676 |
+
* After adding `b` → `[1.41, 3.22, 5.03, 6.84, 8.65, 10.46]`
|
| 677 |
+
|
| 678 |
+
---
|
| 679 |
+
|
| 680 |
+
## ✅ Summary
|
| 681 |
+
|
| 682 |
+
| Expression | Meaning |
|
| 683 |
+
| ------------- | ----------------------------- |
|
| 684 |
+
| `@` | Matrix multiplication (`dot`) |
|
| 685 |
+
| `W @ x + b` | Linear transformation |
|
| 686 |
+
| Shape `W @ x` | `(m, n) @ (n,) = (m,)` |
|
| 687 |
+
|
| 688 |
+
Would you like to include this in your Streamlit visualizer as an expandable note or equation section?
|
| 689 |
+
|
| 690 |
+
|
| 691 |
|
| 692 |
|
| 693 |
""")
|
| 694 |
+
|