embedding-positional

Build error

App Files Files Community

schoginitoys commited on May 15, 2025

Commit

0ed5ac0

verified ·

1 Parent(s): df9dc24

Update src/streamlit_app.py

Browse files

Files changed (1) hide show

src/streamlit_app.py +172 -4

src/streamlit_app.py CHANGED Viewed

@@ -30,7 +30,6 @@
 # print("✅ Downloaded and saved GPT-2 to models")
 import streamlit as st
 st.set_page_config(page_title="GPT-2 Attention Explorer", layout="wide")
@@ -43,8 +42,8 @@ import pandas as pd
 @st.cache_resource
 def load_model():
-    tokenizer = GPT2TokenizerFast.from_pretrained("models")
-    model = GPT2Model.from_pretrained("models", output_attentions=True, attn_implementation="eager")
     model.eval()
     return tokenizer, model
@@ -57,11 +56,82 @@ with st.expander("📊 GPT-2 Model Architecture Summary"):
     - **Vocabulary size (V):** `50257`
     - **Embedding dimension (d):** `768`
     - **Max Position Length (L):** `1024`
     - **Transformer Layers:** `12`
     - **Attention Heads per Layer:** `12`
     - **Per-head Dimension (dₖ):** `64`
     - **Feedforward Hidden Layer Size:** `3072`
     - **Total Parameters:** ~117 million
     """)
@@ -520,7 +590,105 @@ print(decoded)
 | `'Ġthe'` | `' the'`                  |
 | `'Ġmat'` | `' mat'`                  |
-Would you like to include this as an educational block in your Streamlit app too?
         """)

 # print("✅ Downloaded and saved GPT-2 to models")
 import streamlit as st
 st.set_page_config(page_title="GPT-2 Attention Explorer", layout="wide")
 @st.cache_resource
 def load_model():
+    tokenizer = GPT2TokenizerFast.from_pretrained("./models")
+    model = GPT2Model.from_pretrained("./models", output_attentions=True, attn_implementation="eager")
     model.eval()
     return tokenizer, model
     - **Vocabulary size (V):** `50257`
     - **Embedding dimension (d):** `768`
     - **Max Position Length (L):** `1024`
+      - This is sometimes also called:
+          - n_positions in config
+          - max sequence length
+          - context length
+          - max context window
     - **Transformer Layers:** `12`
     - **Attention Heads per Layer:** `12`
     - **Per-head Dimension (dₖ):** `64`
     - **Feedforward Hidden Layer Size:** `3072`
     - **Total Parameters:** ~117 million
+---
+## Question: Transformer Layers: 12 means each layer has 12 Attention Heads?
+## 🧠 Quick Answer:
+> ✅ **No**, 12 Transformer Layers ≠ 12 Heads per Layer
+> 🔁 But in **GPT-2 (small)**, both happen to be **12** — **by design coincidence**, not definition.
+---
+## 🔍 Breakdown of GPT-2’s Architecture
+| Component                     | GPT-2 (small) default |
+| ----------------------------- | --------------------- |
+| Embedding size (`d_model`)    | 768                   |
+| **Transformer layers**        | 12                    |
+| **Attention heads per layer** | 12                    |
+| Hidden feedforward size       | 3072                  |
+| Max position embeddings       | 1024                  |
+---
+### ✅ So in GPT-2:
+* Each of the **12 transformer layers** has:
+  * **Multi-head attention**
+  * With **12 heads per layer**
+  * Each head has `64` dimensions (`768 ÷ 12 = 64`)
+---
+## 📌 Why this Confusion Happens
+The number of **layers** and **heads per layer** are:
+* Configured independently in the model
+* But **coincidentally** both set to 12 in GPT-2 small
+In other models:
+| Model        | Layers | Heads per Layer |
+| ------------ | ------ | --------------- |
+| GPT-2 Medium | 24     | 16              |
+| GPT-2 Large  | 36     | 20              |
+| GPT-3        | 96     | 96              |
+| LLaMA 2 7B   | 32     | 32              |
+So again:
+> 🔁 **12 layers ≠ 12 heads** in general — it's just a choice in GPT-2 small.
+---
+## 💡 Want a table in your app to explain this too?
+I can give you a section like:
+> "🧩 Layers vs Heads — What's the Difference?"
+Let me know and I’ll drop in that Streamlit code too.
     """)
 | `'Ġthe'` | `' the'`                  |
 | `'Ġmat'` | `' mat'`                  |
+---
+## ✅ What is `@` in Python?
+In Python 3.5+, the `@` operator means:
+> **Matrix multiplication** (also called **dot product** or **tensor contraction** depending on context)
+---
+### ✅ Equivalent to:
+```python
+A @ B    ⟺    np.matmul(A, B)
+```
+Or if both are 1D/2D NumPy arrays:
+```python
+A @ B    ⟺    np.dot(A, B)
+```
+---
+## 🔍 In your case:
+```python
+Output = W_qkv @ x + b
+```
+### Let’s say:
+* `x` = shape **(3,)**
+* `W_qkv` = shape **(6, 3)**
+* `b` = shape **(6,)**
+---
+### Then:
+* `W_qkv @ x` → matrix–vector multiplication
+  → shape: **(6,)**
+* Adding `b` → element-wise vector addition
+  → final shape: **(6,)**
+---
+### So this line:
+```python
+Output = W_qkv @ x + b
+```
+Means:
+1. Multiply the **input vector `x`** with the **projection matrix `W_qkv`**
+2. Add a **bias vector `b`**
+3. Result = combined **\[Q | K | V]** output
+---
+## ✅ Example:
+```python
+x = np.array([1, 2, 3])
+W_qkv = np.array([
+  [0.1, 0.2, 0.3],  # Q1
+  [0.4, 0.5, 0.6],  # Q2
+  [0.7, 0.8, 0.9],  # K1
+  [1.0, 1.1, 1.2],  # K2
+  [1.3, 1.4, 1.5],  # V1
+  [1.6, 1.7, 1.8],  # V2
+])
+b = np.array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06])
+output = W_qkv @ x + b
+```
+Manually:
+* `W_qkv @ x` = `[1.4, 3.2, 5.0, 6.8, 8.6, 10.4]`
+* After adding `b` → `[1.41, 3.22, 5.03, 6.84, 8.65, 10.46]`
+---
+## ✅ Summary
+| Expression    | Meaning                       |
+| ------------- | ----------------------------- |
+| `@`           | Matrix multiplication (`dot`) |
+| `W @ x + b`   | Linear transformation         |
+| Shape `W @ x` | `(m, n) @ (n,) = (m,)`        |
+Would you like to include this in your Streamlit visualizer as an expandable note or equation section?
         """)