embed-and-pos-encode-code

Runtime error

App Files Files Community

schoginitoys commited on May 25, 2025

Commit

500c16d

verified ·

1 Parent(s): f4958c1

Update src/streamlit_app.py

Browse files

Files changed (1) hide show

src/streamlit_app.py +130 -1

src/streamlit_app.py CHANGED Viewed

@@ -7,6 +7,8 @@ import types
 import torch               # now safe to import
 import streamlit as st
 import numpy as np
 # Prevent Streamlit from trying to walk torch.classes' non-standard __path__
 if isinstance(getattr(sys.modules.get("torch"), "classes", None), types.ModuleType):
@@ -28,7 +30,7 @@ embedding_dim = st.slider("Embedding Dimension (even only)", min_value=4, max_va
 # --- Load tokenizer ---
 # Set custom cache directory within your app's working directory (which is writable on Spaces)
-os.environ['TRANSFORMERS_CACHE'] = './hf_cache'
 # Load the tokenizer using the custom cache path
 # tokenizer = GPT2TokenizerFast.from_pretrained("gpt2", cache_dir="./hf_cache")
@@ -294,3 +296,130 @@ We then compare this with reference positional encodings to estimate token posit
 | **PE Recovery** | Recover position using similarity |
     """, unsafe_allow_html=True)

 import torch               # now safe to import
 import streamlit as st
 import numpy as np
+import matplotlib.pyplot as plt
+import numpy as np
 # Prevent Streamlit from trying to walk torch.classes' non-standard __path__
 if isinstance(getattr(sys.modules.get("torch"), "classes", None), types.ModuleType):
 # --- Load tokenizer ---
 # Set custom cache directory within your app's working directory (which is writable on Spaces)
+# os.environ['TRANSFORMERS_CACHE'] = './hf_cache'
 # Load the tokenizer using the custom cache path
 # tokenizer = GPT2TokenizerFast.from_pretrained("gpt2", cache_dir="./hf_cache")
 | **PE Recovery** | Recover position using similarity |
     """, unsafe_allow_html=True)
+st.markdown("### 🤖 Transformer Internals: Key Concepts")
+with st.expander("🔁 Multi-Head Attention: Q, K, V Projections"):
+    st.markdown(r"""
+Each token embedding $\mathbf{x}_i$ is linearly projected into:
+- Query vector: $Q_i = \mathbf{x}_i W^Q$
+- Key vector: $K_i = \mathbf{x}_i W^K$
+- Value vector: $V_i = \mathbf{x}_i W^V$
+All of shape: $\mathbb{R}^{d_{model} \times d_{head}}$
+Multiple such projections (heads) run in parallel:
+$$
+\text{MultiHead}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O
+$$
+Each head does:
+$$
+\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V
+$$
+""", unsafe_allow_html=True)
+with st.expander("🧠 Contextualized Representations"):
+    st.markdown(r"""
+The attention mechanism lets each token **attend to others**, allowing the output for each token to contain **context**.
+For example:
+- Token "fun" gets influenced by "is" and "learning"
+- The output is no longer static but dynamic, depending on sentence context
+This is what makes Transformers powerful for understanding relationships between tokens.
+""")
+with st.expander("🛠 Feed-Forward Neural Network (FFN)"):
+    st.markdown(r"""
+After attention, each token’s vector goes through a two-layer feed-forward network applied independently:
+$$
+\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2
+$$
+This allows deeper transformations on each token representation.
+""")
+with st.expander("📊 Softmax Over Vocabulary"):
+    st.markdown(r"""
+The final output layer transforms each token representation to **logits** for the full vocabulary.
+Then, softmax is applied to convert them into probabilities:
+$$
+P(w_i \mid \text{context}) = \frac{\exp(\text{logit}_i)}{\sum_j \exp(\text{logit}_j)}
+$$
+The token with the highest probability is typically selected as the **predicted next word**.
+""")
+with st.expander("🔮 Predicted Next Token"):
+    st.markdown(r"""
+By chaining all steps (embedding → attention → FFN → softmax), the model predicts the **next token**:
+E.g.,
+Input: `"Learning is"`
+Predicted next token: `"fun"`
+This is how autoregressive models like GPT-2 **generate text** one token at a time.
+""")
+st.markdown("### 🎨 Visualizations: Transformer Internals")
+# ---- 1. Attention Heatmap ----
+with st.expander("🔁 Multi-Head Attention Score Heatmap (QKᵀ / √d)"):
+    st.markdown("""
+This heatmap shows how the attention mechanism scores each query against all keys.
+Brighter color = higher attention weight.
+$$
+\\text{Attention}(Q, K, V) = \\text{softmax}\\left( \\frac{QK^T}{\\sqrt{d_k}} \\right)V
+$$
+""", unsafe_allow_html=True)
+    tokens = ["Learning", "is", "fun"]
+    Q = np.array([[1, 0], [0.5, 0.5], [0, 1]])
+    K = np.array([[1, 0], [0.5, 0.5], [0, 1]])
+    scores = np.dot(Q, K.T) / np.sqrt(2)
+    softmax_scores = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
+    fig1, ax1 = plt.subplots()
+    cax = ax1.matshow(softmax_scores, cmap="Blues")
+    fig1.colorbar(cax)
+    ax1.set_xticks(np.arange(len(tokens)))
+    ax1.set_xticklabels(tokens)
+    ax1.set_yticks(np.arange(len(tokens)))
+    ax1.set_yticklabels(tokens)
+    ax1.set_xlabel("Key Tokens (K)")
+    ax1.set_ylabel("Query Tokens (Q)")
+    ax1.set_title("Attention Score Heatmap")
+    st.pyplot(fig1)
+# ---- 2. Softmax Curve ----
+with st.expander("📊 Softmax Curve for Vocabulary Logits"):
+    st.markdown("""
+This curve shows how softmax converts logits into probabilities.
+Higher logits result in higher predicted probabilities:
+$$
+\\text{Softmax}(x_i) = \\frac{e^{x_i}}{\\sum_j e^{x_j}}
+$$
+""", unsafe_allow_html=True)
+    x = np.linspace(-4, 4, 100)
+    logits = np.vstack([x, x + 1, x - 1])
+    exps = np.exp(logits)
+    softmax = exps / np.sum(exps, axis=0)
+    fig2, ax2 = plt.subplots()
+    ax2.plot(x, softmax[0], label='Token A')
+    ax2.plot(x, softmax[1], label='Token B')
+    ax2.plot(x, softmax[2], label='Token C')
+    ax2.set_title("Softmax Output vs Logit Value")
+    ax2.set_xlabel("Logit")
+    ax2.set_ylabel("Probability")
+    ax2.legend()
+    st.pyplot(fig2)