File size: 8,863 Bytes
57b4738
 
 
2a5ba2a
f8b7cc9
 
 
81a22fa
4329495
 
 
81a22fa
f8b7cc9
6a865fc
f8b7cc9
53319e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
584603b
f8b7cc9
3da4ee6
 
ef2d6c6
f8b7cc9
 
 
 
 
53319e1
3adf2d7
 
 
 
53319e1
3adf2d7
 
 
ef2d6c6
3adf2d7
f8b7cc9
3adf2d7
 
 
 
 
 
 
4329495
53319e1
 
 
 
 
 
 
 
 
 
 
 
 
 
4329495
53319e1
f8b7cc9
57b4738
53319e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# LLM Token & Attention Explorer with Streamlit
# Features: Tokenization, OpenAI Embeddings, Positional Encoding, Final Tensor, Multi-Head Attention Simulation

import streamlit as st
import numpy as np
import tiktoken
import os
from openai import OpenAI
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

st.set_page_config(page_title="LLM Token Explorer", layout="centered")
st.title("๐Ÿง  LLM Attention Explorer: Tokens, Embeddings, Positional Encoding, and Multi-Head Visualization")

# Introductory Explanations
with st.expander("โ„น๏ธ About This App", expanded=True):
    st.markdown("""
    This interactive app lets you explore how Large Language Models (LLMs) like GPT-3/4 work internally.
    You'll learn about tokenization, embeddings, positional encoding, and multi-head self-attention through
    real-time visualizations and simulations.
    """)

with st.expander("๐Ÿงพ What is a Token?"):
    st.markdown("""
    A token is a basic unit of text. It could be as small as a character or as large as a word depending on the tokenizer.
    GPT models use subword tokenization (like Byte-Pair Encoding), meaning common patterns get their own token.
    For example:
    - "apple" โ†’ might be 1 token
    - "unhappiness" โ†’ might be split into ["un", "happiness"]
    """)

with st.expander("๐Ÿ“Œ What Are Embeddings?"):
    st.markdown("""
    Embeddings are high-dimensional vectors that represent the meaning of each token.
    Similar tokens (like 'cat' and 'dog') have embeddings that are close in space.
    They're used by the model to perform mathematical operations on language.
    """)

with st.expander("๐Ÿ“ Why Positional Encoding?"):
    st.markdown("""
    Since transformers process all tokens in parallel and not sequentially, they need to know token positions.
    Positional encodings are added to token embeddings to give each token a unique place in the sequence.
    """)

with st.expander("๐Ÿง  What is Self-Attention?"):
    st.markdown("""
    Self-attention allows the model to weigh the importance of each token in a sentence when encoding a specific token.
    For example, in "The cat sat because it was tired", attention helps "it" focus more on "cat" than other words.
    """)

with st.expander("๐Ÿ” Understanding Multi-Head Attention"):
    st.markdown("""
    Each attention head learns different aspects of language.
    For example:
    - One head might learn grammar structure.
    - Another might learn long-distance relationships.

    Heads run in parallel and their outputs are concatenated to form a rich representation of each token.
    """)

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

st.text(f"OpenAI key found: {'Yes' if os.getenv('OPENAI_API_KEY') else 'No'}")

st.header("โœ๏ธ Input Text")
input_text = st.text_area("Enter your text:", height=150)

tokenizer_name = st.selectbox("Choose tokenizer:", ["cl100k_base", "p50k_base", "r50k_base", "gpt2"])

if input_text:
    st.subheader("๐Ÿ”ค Tokenization")
    enc = tiktoken.get_encoding(tokenizer_name)
    tokens = enc.encode(input_text)
    token_strings = [enc.decode([t]) for t in tokens]

    with st.expander("๐Ÿงพ Token IDs", expanded=True):
        st.write(tokens)
    with st.expander("๐Ÿ“– Decoded Tokens", expanded=True):
        st.write(token_strings)

    st.info(f"Token count: {len(tokens)}")

    fig, ax = plt.subplots()
    ax.bar(range(len(tokens)), tokens, tick_label=token_strings)
    ax.set_xlabel("Token")
    ax.set_ylabel("Token ID")
    ax.set_title("Token IDs for Input Text")
    plt.xticks(rotation=45, ha='right')
    st.pyplot(fig)

    st.subheader("๐Ÿ”— OpenAI Token Embeddings")
    embeddings = []
    for tok in token_strings:
        response = client.embeddings.create(input=[tok], model="text-embedding-ada-002")
        embedding = response.data[0].embedding
        embeddings.append(embedding)

        with st.expander(f"๐Ÿ”ธ '{tok}' Embedding", expanded=True):
            st.write(embedding)
            fig, ax = plt.subplots(figsize=(8, 1))
            sns.heatmap(np.array(embedding).reshape(1, -1), cmap="viridis", cbar=True, ax=ax)
            ax.set_title("Embedding Heatmap")
            ax.axis('off')
            st.pyplot(fig)

    st.success("Generated embeddings for all tokens.")

    st.subheader("๐Ÿ“ Positional Encoding")
    def get_positional_encoding(seq_len, dim):
        PE = np.zeros((seq_len, dim))
        for pos in range(seq_len):
            for i in range(0, dim, 2):
                div_term = np.exp(i * -np.log(10000.0) / dim)
                PE[pos, i] = np.sin(pos * div_term)
                if i+1 < dim:
                    PE[pos, i+1] = np.cos(pos * div_term)
        return PE

    dim = len(embeddings[0])
    PE = get_positional_encoding(len(tokens), dim)

    with st.expander("๐Ÿ“ Positional Encoding Matrix", expanded=True):
        st.write(PE)

    st.subheader("๐Ÿงฎ Final Input Tensor (Embedding + PE)")
    embedded = np.array(embeddings)
    combined = embedded + PE
    with st.expander("๐Ÿงพ Final Tensor", expanded=True):
        st.write(combined)

    st.subheader("๐Ÿง  Simulated Multi-Head Self-Attention")
    if st.button("Simulate Attention"):
        embed_dim = 32
        num_heads = 4
        head_dim = embed_dim // num_heads

        x = np.random.randn(len(tokens), embed_dim)
        W_q, W_k, W_v = [np.random.randn(embed_dim, embed_dim) for _ in range(3)]

        Q = x @ W_q
        K = x @ W_k
        V = x @ W_v

        def split_heads(t):
            return t.reshape(len(tokens), num_heads, head_dim).transpose(1, 0, 2)

        Qh, Kh, Vh = split_heads(Q), split_heads(K), split_heads(V)

        def attention(q, k, v):
            scores = q @ k.T / np.sqrt(k.shape[-1])
            weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
            weights /= np.sum(weights, axis=-1, keepdims=True)
            return weights @ v, weights

        outputs = []
        for i in range(num_heads):
            out, weights = attention(Qh[i], Kh[i], Vh[i])
            with st.expander(f"Head {i+1}"):
                st.write("Q:", Qh[i])
                st.write("K:", Kh[i])
                st.write("V:", Vh[i])
                st.write("Attention Weights:", weights)
                fig, ax = plt.subplots()
                sns.heatmap(weights, cmap="Blues", ax=ax)
                ax.set_title("Attention Weights Heatmap")
                st.pyplot(fig)
            outputs.append(out)

        final = np.concatenate(outputs, axis=-1)
        with st.expander("๐Ÿงฉ Concatenated Output"):
            st.write(final)

with st.expander("๐Ÿ“Š Transformer and GPT Model Component Comparison (Table)", expanded=True):
    st.markdown("""
    | Parameter                         | Original Transformer (2017) | GPT-2 (2019)           | GPT-3 (2020)            | GPT-4 (2023, est.)         |
    |----------------------------------|------------------------------|-------------------------|--------------------------|----------------------------|
    | **Max Context Length (tokens)**  | 512                          | 1024                    | 2048                     | 8192 / 32,768              |
    | **Vocab Size**                   | ~37,000 (BPE)                | 50,257                  | 50,257                   | ~100,000 (multimodal-aware) |
    | **Embedding Dimension (D)**      | 512                          | 768 โ€“ 1600              | 12,288                   | 12,288+                    |
    | **Layers / Transformer Blocks**  | 6 (base), 12 (large)         | 12 โ€“ 48 (XL)            | 96                       | ~120 โ€“ 160 (est.)         |
    | **Self-Attention Heads**         | 8                            | 12 โ€“ 25                 | 96                       | 120 โ€“ 128+ (est.)         |
    | **Dim per Attention Head**       | 64                           | 64                      | 128                      | ~128                      |
    | **Batch Size (training)**        | ~25k tokens                  | ~512 โ€“ 2048 tokens      | ~3.2M tokens             | Multi-million tokens (est.) |
    | **Tensor Shape**                 | [Batch, Tokens, Dim]         | Same                    | Same                     | Same                      |
    | **Parameters (Total)**           | ~65M                         | 124M โ€“ 1.5B             | 175B                     | ~500B โ€“ 1T+ (speculative)  |

    **Explanations:**
    - **Context Length**: Max number of tokens the model can see at once.
    - **Embedding Dim**: Size of token vectors.
    - **Layers**: Depth of the network (attention + FFN).
    - **Heads**: Parallel attention mechanisms.
    - **Dim per Head**: Each head gets a slice of the full embedding.
    - **Tensor Shape**: Internal model shape: [Batch, Tokens, Embedding].
    """)