Spaces:
Running
Running
Commit ·
3be1eeb
1
Parent(s): 4145e94
feat: Deep-dive expansion of LLM Fundamentals, Transformers, and Fine-Tuning modules
Browse files- LLM Fundamentals: 8 concept sections, 6 code examples, 6 interview Qs
- Transformers: MoE, FlashAttention, GQA/MQA, RoPE, PyTorch MHA code
- Fine-Tuning: QLoRA math, SFTTrainer, DPO training, adapter merging, 5 Qs
- GenAI-AgenticAI/app.js +327 -71
GenAI-AgenticAI/app.js
CHANGED
|
@@ -19,46 +19,99 @@ const MODULE_CONTENT = {
|
|
| 19 |
'llm-fundamentals': {
|
| 20 |
concepts: `
|
| 21 |
<div class="section">
|
| 22 |
-
<h2>LLM Fundamentals —
|
| 23 |
-
<h3>🧠 What is a Language Model?</h3>
|
| 24 |
<div class="info-box">
|
| 25 |
<div class="box-title">⚡ The Core Idea</div>
|
| 26 |
<div class="box-content">
|
| 27 |
A language model is a probability distribution over sequences of tokens: <strong>P(token_n | token_1, token_2, ..., token_n-1)</strong>. LLMs are trained to predict the next token. During inference, they sample repeatedly from this distribution to generate text. Everything — creativity, reasoning, hallucination — emerges from this single objective.
|
| 28 |
</div>
|
| 29 |
</div>
|
| 30 |
-
|
| 31 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
<table>
|
| 33 |
-
<tr><th>
|
| 34 |
-
<tr><td>
|
| 35 |
-
<tr><td>
|
| 36 |
-
<tr><td>
|
| 37 |
-
<tr><td>
|
|
|
|
| 38 |
</table>
|
| 39 |
-
<
|
| 40 |
-
|
| 41 |
-
<h3>
|
| 42 |
-
<
|
| 43 |
-
<
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
</
|
| 47 |
-
<
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
</
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
</div>`,
|
| 53 |
code: `
|
| 54 |
<div class="section">
|
| 55 |
-
<h2>💻 LLM Fundamentals — Code Examples</h2>
|
| 56 |
-
|
|
|
|
| 57 |
<div class="code-block"><span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI
|
| 58 |
|
| 59 |
-
client = OpenAI()
|
| 60 |
|
| 61 |
-
<span class="comment"># Basic
|
| 62 |
response = client.chat.completions.create(
|
| 63 |
model=<span class="string">"gpt-4o"</span>,
|
| 64 |
messages=[
|
|
@@ -68,66 +121,213 @@ response = client.chat.completions.create(
|
|
| 68 |
temperature=<span class="number">0.7</span>,
|
| 69 |
max_tokens=<span class="number">512</span>
|
| 70 |
)
|
| 71 |
-
<span class="function">print</span>(response.choices[<span class="number">0</span>].message.content)
|
| 72 |
-
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
stream = client.chat.completions.create(
|
| 75 |
model=<span class="string">"gpt-4o"</span>,
|
| 76 |
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Write a haiku about neural nets"</span>}],
|
| 77 |
stream=<span class="keyword">True</span>
|
| 78 |
)
|
|
|
|
|
|
|
| 79 |
<span class="keyword">for</span> chunk <span class="keyword">in</span> stream:
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
<div class="code-block"><span class="keyword">import</span> tiktoken
|
| 84 |
|
|
|
|
| 85 |
enc = tiktoken.encoding_for_model(<span class="string">"gpt-4o"</span>)
|
| 86 |
text = <span class="string">"The transformer architecture changed everything."</span>
|
| 87 |
tokens = enc.encode(text)
|
| 88 |
-
<span class="function">print</span>(<span class="string">f"Token count: {len(tokens)}"</span>)
|
| 89 |
-
<span class="function">print</span>(<span class="string">f"Tokens: {[enc.decode([t]) for t in tokens]}"</span>)
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
interview: `
|
| 92 |
-
|
| 93 |
-
<h2>🎯 LLM Interview Questions</h2>
|
| 94 |
-
<div class="interview-box"><strong>Q1: What happens when temperature = 0?</strong><p><strong>Answer:</strong> The model becomes <strong>deterministic</strong>, always picking the highest-probability token (greedy decoding). Use for tasks requiring consistency (
|
| 95 |
-
<div class="interview-box"><strong>Q2: Why do LLMs hallucinate?</strong><p><strong>Answer:</strong> LLMs don't "know" facts — they model <strong>token probabilities</strong>. When asked about something rare or
|
| 96 |
-
<div class="interview-box"><strong>Q3: What's the difference between context window and memory?</strong><p><strong>Answer:</strong> Context window is the tokens the model can process in a <strong>single inference pass</strong> — it's stateless. There is no persistent memory between calls. "Memory" in frameworks like LangChain is implemented externally by storing past conversation turns in a database
|
| 97 |
-
<div class="interview-box"><strong>Q4: What is RLHF and
|
| 98 |
-
|
|
|
|
|
|
|
| 99 |
},
|
| 100 |
'transformers': {
|
| 101 |
concepts: `
|
| 102 |
<div class="section">
|
| 103 |
-
<h2>Transformer Architecture —
|
| 104 |
<div class="info-box">
|
| 105 |
-
<div class="box-title">⚡ "Attention Is All You Need" (2017)</div>
|
| 106 |
-
<div class="box-content">
|
| 107 |
</div>
|
| 108 |
-
|
| 109 |
-
<
|
|
|
|
|
|
|
| 110 |
<table>
|
| 111 |
<tr><th>Component</th><th>Role</th><th>Analogy</th></tr>
|
| 112 |
-
<tr><td>Query (Q)</td><td>What this token is looking for</td><td>Search query</td></tr>
|
| 113 |
-
<tr><td>Key (K)</td><td>What each token offers</td><td>
|
| 114 |
-
<tr><td>Value (V)</td><td>Actual content to retrieve</td><td>
|
| 115 |
-
<tr><td>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
</table>
|
| 117 |
-
<h3>Multi-Head Attention</h3>
|
| 118 |
-
<p>Run h independent attention heads in parallel, each learning different types of relationships (syntax, semantics, coreference). Concatenate outputs and project. GPT-4 likely uses ~96 heads. Each head specializes: head 1 may track subject-verb agreement, head 2 may track pronoun references.</p>
|
| 119 |
-
<h3>Positional Encoding</h3>
|
| 120 |
-
<p>Attention has no notion of order (it's a set operation). Positional encodings inject position information. Original Transformers used sinusoidal functions. Modern LLMs use <strong>RoPE (Rotary Position Embedding)</strong> — LLaMA, Mistral, Gemma all use RoPE, which enables better length generalization.</p>
|
| 121 |
-
<h3>Decoder-Only vs Encoder-Decoder</h3>
|
| 122 |
-
<div class="comparison">
|
| 123 |
-
<div class="comparison-bad"><strong>Decoder-Only (GPT-style)</strong><br>Causal (left-to-right) attention. Can only see past tokens. Optimized for text generation. Examples: GPT-4, LLaMA, Gemma, Mistral.</div>
|
| 124 |
-
<div class="comparison-good"><strong>Encoder-Decoder (T5-style)</strong><br>Encoder sees full input. Decoder generates output attending to encoder. Better for seq2seq tasks (translation, summarization). Examples: T5, BART, mT5.</div>
|
| 125 |
-
</div>
|
| 126 |
</div>`,
|
| 127 |
code: `
|
| 128 |
<div class="section">
|
| 129 |
-
<h2>💻 Transformer Architecture — Code</h2>
|
| 130 |
-
|
|
|
|
| 131 |
<div class="code-block"><span class="keyword">import</span> numpy <span class="keyword">as</span> np
|
| 132 |
|
| 133 |
<span class="keyword">def</span> <span class="function">scaled_dot_product_attention</span>(Q, K, V, mask=<span class="keyword">None</span>):
|
|
@@ -143,8 +343,42 @@ Q = np.random.randn(<span class="number">3</span>, <span class="number">4</span>
|
|
| 143 |
K = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
|
| 144 |
V = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
|
| 145 |
output, attn_weights = scaled_dot_product_attention(Q, K, V)
|
| 146 |
-
<span class="function">print</span>(<span class="string">
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
<div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer
|
| 149 |
|
| 150 |
model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>, output_attentions=<span class="keyword">True</span>)
|
|
@@ -154,21 +388,43 @@ inputs = tokenizer(<span class="string">"The cat sat on the"</span>, return_tens
|
|
| 154 |
outputs = model(**inputs)
|
| 155 |
|
| 156 |
<span class="comment"># outputs.attentions: tuple of (batch, heads, seq, seq) per layer</span>
|
| 157 |
-
|
| 158 |
-
<span class="function">print</span>(<span class="string">f"Layers: {len(outputs.attentions)}, Heads: {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
</div>`,
|
| 160 |
interview: `
|
| 161 |
<div class="section">
|
| 162 |
-
<h2>🎯 Transformer Interview Questions</h2>
|
| 163 |
-
<div class="interview-box"><strong>Q1: Why divide by √d_k in attention?</strong><p><strong>Answer:</strong> For large d_k, dot products grow large in magnitude, pushing softmax into regions with
|
| 164 |
-
<div class="interview-box"><strong>Q2: What is KV Cache and why is it
|
| 165 |
-
<div class="interview-box"><strong>Q3: What's the difference between MHA
|
| 166 |
-
<div class="interview-box"><strong>Q4: What is RoPE and why
|
|
|
|
|
|
|
| 167 |
</div>`
|
| 168 |
},
|
| 169 |
'huggingface': {
|
| 170 |
concepts: `
|
| 171 |
-
|
| 172 |
<h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
|
| 173 |
<div class="info-box">
|
| 174 |
<div class="box-title">⚡ The GitHub of AI</div>
|
|
|
|
| 19 |
'llm-fundamentals': {
|
| 20 |
concepts: `
|
| 21 |
<div class="section">
|
| 22 |
+
<h2>🧠 LLM Fundamentals — Complete Deep Dive</h2>
|
|
|
|
| 23 |
<div class="info-box">
|
| 24 |
<div class="box-title">⚡ The Core Idea</div>
|
| 25 |
<div class="box-content">
|
| 26 |
A language model is a probability distribution over sequences of tokens: <strong>P(token_n | token_1, token_2, ..., token_n-1)</strong>. LLMs are trained to predict the next token. During inference, they sample repeatedly from this distribution to generate text. Everything — creativity, reasoning, hallucination — emerges from this single objective.
|
| 27 |
</div>
|
| 28 |
</div>
|
| 29 |
+
|
| 30 |
+
<h3>1. How Language Models Actually Work</h3>
|
| 31 |
+
<p>An LLM is fundamentally a <strong>next-token predictor</strong>. Given a sequence of tokens, it outputs a probability distribution over the entire vocabulary (~32K-128K tokens). The training objective is to minimize <strong>cross-entropy loss</strong> between the predicted distribution and the actual next token across billions of text examples. The model learns grammar, facts, reasoning patterns, and even code — all as statistical regularities in token sequences.</p>
|
| 32 |
+
<div class="callout insight">
|
| 33 |
+
<div class="callout-title">🔑 Key Insight: Emergent Abilities</div>
|
| 34 |
+
<p>Below ~10B parameters, models just predict tokens. Above ~50B, new abilities <strong>emerge</strong> that weren't explicitly trained: chain-of-thought reasoning, few-shot learning, code generation, translation between languages never paired in training data. This is why scale matters and is the foundation of the "scaling laws" (Chinchilla, Kaplan et al.).</p>
|
| 35 |
+
</div>
|
| 36 |
+
|
| 37 |
+
<h3>2. Tokenization — The Hidden Layer</h3>
|
| 38 |
+
<p>Text is never fed directly to an LLM. It's first converted to <strong>tokens</strong> (sub-word units). Understanding tokenization is critical because:</p>
|
| 39 |
<table>
|
| 40 |
+
<tr><th>Aspect</th><th>Why It Matters</th><th>Example</th></tr>
|
| 41 |
+
<tr><td>Cost</td><td>API pricing is per-token, not per-word</td><td>"unbelievable" = 3 tokens = 3x cost vs 1 word</td></tr>
|
| 42 |
+
<tr><td>Context limits</td><td>128K tokens ≠ 128K words (~75K words)</td><td>1 token ≈ 0.75 English words on average</td></tr>
|
| 43 |
+
<tr><td>Non-English penalty</td><td>Languages like Hindi/Chinese use 2-3x more tokens per word</td><td>"नमस्ते" might be 6 tokens vs "hello" = 1 token</td></tr>
|
| 44 |
+
<tr><td>Code tokenization</td><td>Whitespace and syntax consume tokens</td><td>4 spaces of indentation = 1 token wasted per line</td></tr>
|
| 45 |
+
<tr><td>Number handling</td><td>Numbers tokenize unpredictably</td><td>"1234567" might split as ["123", "45", "67"] — why LLMs are bad at math</td></tr>
|
| 46 |
</table>
|
| 47 |
+
<p><strong>Algorithms:</strong> BPE (GPT, LLaMA) — merges frequent byte pairs iteratively. WordPiece (BERT) — maximizes likelihood. SentencePiece/Unigram (T5) — statistical segmentation. Modern LLMs use vocabularies of 32K-128K tokens.</p>
|
| 48 |
+
|
| 49 |
+
<h3>3. Inference Parameters — Controlling Output</h3>
|
| 50 |
+
<table>
|
| 51 |
+
<tr><th>Parameter</th><th>What it controls</th><th>Range</th><th>When to change</th></tr>
|
| 52 |
+
<tr><td><strong>Temperature</strong></td><td>Sharpens/flattens the probability distribution</td><td>0.0 – 2.0</td><td>0 for extraction/code, 0.7 for chat, 1.2+ for creative writing</td></tr>
|
| 53 |
+
<tr><td><strong>Top-p (nucleus)</strong></td><td>Cumulative probability cutoff — only consider tokens within top-p mass</td><td>0.7 – 1.0</td><td>Use 0.9 as default; lower for focused, higher for diverse</td></tr>
|
| 54 |
+
<tr><td><strong>Top-k</strong></td><td>Hard limit on candidate tokens</td><td>10 – 100</td><td>Rarely needed if using top-p; useful as safety net</td></tr>
|
| 55 |
+
<tr><td><strong>Frequency penalty</strong></td><td>Penalizes repeated tokens proportionally</td><td>0.0 – 2.0</td><td>Increase to reduce repetitive output</td></tr>
|
| 56 |
+
<tr><td><strong>Presence penalty</strong></td><td>Flat penalty for any repeated token</td><td>0.0 – 2.0</td><td>Increase to encourage topic diversity</td></tr>
|
| 57 |
+
<tr><td><strong>Max tokens</strong></td><td>Generation length limit</td><td>1 – 128K</td><td>Set to expected output length + margin; never use -1 for safety</td></tr>
|
| 58 |
+
<tr><td><strong>Stop sequences</strong></td><td>Strings that stop generation</td><td>Any text</td><td>Essential for structured output: stop at "}" for JSON</td></tr>
|
| 59 |
+
</table>
|
| 60 |
+
<div class="callout warning">
|
| 61 |
+
<div class="callout-title">⚠️ Common Mistake</div>
|
| 62 |
+
<p>Don't combine temperature=0 with top_p=0.1 — they interact. Use <strong>either</strong> temperature OR top-p for sampling control, not both. OpenAI recommends changing one and leaving the other at default.</p>
|
| 63 |
</div>
|
| 64 |
+
|
| 65 |
+
<h3>4. Context Window — The LLM's Working Memory</h3>
|
| 66 |
+
<p>The context window determines how many tokens the model can process in a single call (input + output combined).</p>
|
| 67 |
+
<table>
|
| 68 |
+
<tr><th>Model</th><th>Context Window</th><th>Approx. Pages</th></tr>
|
| 69 |
+
<tr><td>GPT-4o</td><td>128K tokens</td><td>~200 pages</td></tr>
|
| 70 |
+
<tr><td>Claude 3.5 Sonnet</td><td>200K tokens</td><td>~350 pages</td></tr>
|
| 71 |
+
<tr><td>Gemini 1.5 Pro</td><td>2M tokens</td><td>~3,000 pages</td></tr>
|
| 72 |
+
<tr><td>LLaMA 3.1</td><td>128K tokens</td><td>~200 pages</td></tr>
|
| 73 |
+
<tr><td>Mistral Large</td><td>128K tokens</td><td>~200 pages</td></tr>
|
| 74 |
+
</table>
|
| 75 |
+
<p><strong>"Lost in the Middle"</strong> (Liu et al., 2023): Performance degrades for information placed in the middle of very long contexts. Models attend most to the <strong>beginning and end</strong> of prompts. Strategy: put the most important content at the start or end; use retrieval to avoid stuffing the entire context.</p>
|
| 76 |
+
|
| 77 |
+
<h3>5. Pre-training Pipeline</h3>
|
| 78 |
+
<p>Training an LLM from scratch involves: (1) <strong>Data collection</strong> — crawl the web (Common Crawl, ~1 trillion tokens), books, code (GitHub), conversations. (2) <strong>Data cleaning</strong> — deduplication, quality filtering, toxicity removal, PII scrubbing. (3) <strong>Tokenizer training</strong> — build BPE vocabulary from the corpus. (4) <strong>Pre-training</strong> — next-token prediction on massive GPU clusters (thousands of A100s/H100s for weeks). Cost: $2M-$100M+ for frontier models.</p>
|
| 79 |
+
|
| 80 |
+
<h3>6. Alignment: RLHF, DPO, and Constitutional AI</h3>
|
| 81 |
+
<p>A base model predicts tokens but doesn't follow instructions or refuse harmful content. Alignment methods bridge this gap:</p>
|
| 82 |
+
<table>
|
| 83 |
+
<tr><th>Method</th><th>How It Works</th><th>Used By</th></tr>
|
| 84 |
+
<tr><td><strong>SFT (Supervised Fine-Tuning)</strong></td><td>Train on (instruction, response) pairs from human annotators</td><td>All models (Step 1)</td></tr>
|
| 85 |
+
<tr><td><strong>RLHF</strong></td><td>Train a reward model on human preferences, then optimize policy via PPO</td><td>GPT-4, Claude (early)</td></tr>
|
| 86 |
+
<tr><td><strong>DPO (Direct Preference Optimization)</strong></td><td>Skip the reward model — directly optimize from preference pairs, simpler and more stable</td><td>LLaMA 3, Zephyr, Gemma</td></tr>
|
| 87 |
+
<tr><td><strong>Constitutional AI</strong></td><td>Model critiques and revises its own outputs against a set of principles</td><td>Claude (Anthropic)</td></tr>
|
| 88 |
+
<tr><td><strong>RLAIF</strong></td><td>Use an AI model (not humans) to generate preference data</td><td>Gemini, some open models</td></tr>
|
| 89 |
+
</table>
|
| 90 |
+
|
| 91 |
+
<h3>7. The Modern LLM Landscape (2024-2025)</h3>
|
| 92 |
+
<table>
|
| 93 |
+
<tr><th>Provider</th><th>Flagship Model</th><th>Strengths</th><th>Best For</th></tr>
|
| 94 |
+
<tr><td>OpenAI</td><td>GPT-4o, o1, o3</td><td>Best all-around, strong coding, reasoning chains (o1/o3)</td><td>General purpose, production</td></tr>
|
| 95 |
+
<tr><td>Anthropic</td><td>Claude 3.5 Sonnet</td><td>Best for long documents, coding, safety-conscious</td><td>Enterprise, agents, analysis</td></tr>
|
| 96 |
+
<tr><td>Google</td><td>Gemini 1.5 Pro/2.0</td><td>Massive context (2M), multi-modal, grounding</td><td>Document processing, multi-modal</td></tr>
|
| 97 |
+
<tr><td>Meta</td><td>LLaMA 3.1/3.2</td><td>Best open-source, fine-tunable, commercially free</td><td>Self-hosting, fine-tuning</td></tr>
|
| 98 |
+
<tr><td>Mistral</td><td>Mistral Large, Mixtral</td><td>Strong open models, MoE efficiency</td><td>European market, cost-effective</td></tr>
|
| 99 |
+
<tr><td>DeepSeek</td><td>DeepSeek V3, R1</td><td>Exceptional reasoning, competitive with o1</td><td>Math, coding, research</td></tr>
|
| 100 |
+
</table>
|
| 101 |
+
|
| 102 |
+
<h3>8. Scaling Laws — Why Bigger Models Get Smarter</h3>
|
| 103 |
+
<p><strong>Chinchilla Scaling Law</strong> (Hoffmann et al., 2022): For a compute-optimal model, training tokens should scale proportionally to model parameters. A 70B model should be trained on ~1.4 trillion tokens. Key insight: many earlier models were <strong>undertrained</strong> (too many params, not enough data). LLaMA showed smaller, well-trained models can match larger undertrained ones.</p>
|
| 104 |
</div>`,
|
| 105 |
code: `
|
| 106 |
<div class="section">
|
| 107 |
+
<h2>💻 LLM Fundamentals — Comprehensive Code Examples</h2>
|
| 108 |
+
|
| 109 |
+
<h3>1. OpenAI API — Complete Patterns</h3>
|
| 110 |
<div class="code-block"><span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI
|
| 111 |
|
| 112 |
+
client = OpenAI() <span class="comment"># Uses OPENAI_API_KEY env var</span>
|
| 113 |
|
| 114 |
+
<span class="comment"># ─── Basic Chat Completion ───</span>
|
| 115 |
response = client.chat.completions.create(
|
| 116 |
model=<span class="string">"gpt-4o"</span>,
|
| 117 |
messages=[
|
|
|
|
| 121 |
temperature=<span class="number">0.7</span>,
|
| 122 |
max_tokens=<span class="number">512</span>
|
| 123 |
)
|
| 124 |
+
<span class="function">print</span>(response.choices[<span class="number">0</span>].message.content)
|
| 125 |
+
|
| 126 |
+
<span class="comment"># ─── Multi-turn Conversation ───</span>
|
| 127 |
+
messages = [
|
| 128 |
+
{<span class="string">"role"</span>: <span class="string">"system"</span>, <span class="string">"content"</span>: <span class="string">"You are a Python tutor."</span>},
|
| 129 |
+
{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"What is a decorator?"</span>},
|
| 130 |
+
{<span class="string">"role"</span>: <span class="string">"assistant"</span>, <span class="string">"content"</span>: <span class="string">"A decorator is a function that wraps another function..."</span>},
|
| 131 |
+
{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Show me an example with arguments."</span>}
|
| 132 |
+
]
|
| 133 |
+
resp = client.chat.completions.create(model=<span class="string">"gpt-4o"</span>, messages=messages)</div>
|
| 134 |
+
|
| 135 |
+
<h3>2. Streaming Responses</h3>
|
| 136 |
+
<div class="code-block"><span class="comment"># Streaming for real-time output — essential for UX</span>
|
| 137 |
stream = client.chat.completions.create(
|
| 138 |
model=<span class="string">"gpt-4o"</span>,
|
| 139 |
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Write a haiku about neural nets"</span>}],
|
| 140 |
stream=<span class="keyword">True</span>
|
| 141 |
)
|
| 142 |
+
|
| 143 |
+
full_response = <span class="string">""</span>
|
| 144 |
<span class="keyword">for</span> chunk <span class="keyword">in</span> stream:
|
| 145 |
+
token = chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">or</span> <span class="string">""</span>
|
| 146 |
+
full_response += token
|
| 147 |
+
<span class="function">print</span>(token, end=<span class="string">""</span>, flush=<span class="keyword">True</span>)
|
| 148 |
+
|
| 149 |
+
<span class="comment"># Async streaming (for FastAPI/web apps)</span>
|
| 150 |
+
<span class="keyword">async def</span> <span class="function">stream_chat</span>(prompt):
|
| 151 |
+
<span class="keyword">from</span> openai <span class="keyword">import</span> AsyncOpenAI
|
| 152 |
+
aclient = AsyncOpenAI()
|
| 153 |
+
stream = <span class="keyword">await</span> aclient.chat.completions.create(
|
| 154 |
+
model=<span class="string">"gpt-4o"</span>, stream=<span class="keyword">True</span>,
|
| 155 |
+
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: prompt}]
|
| 156 |
+
)
|
| 157 |
+
<span class="keyword">async for</span> chunk <span class="keyword">in</span> stream:
|
| 158 |
+
<span class="keyword">yield</span> chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">or</span> <span class="string">""</span></div>
|
| 159 |
+
|
| 160 |
+
<h3>3. Token Counting & Cost Estimation</h3>
|
| 161 |
<div class="code-block"><span class="keyword">import</span> tiktoken
|
| 162 |
|
| 163 |
+
<span class="comment"># Count tokens for any model</span>
|
| 164 |
enc = tiktoken.encoding_for_model(<span class="string">"gpt-4o"</span>)
|
| 165 |
text = <span class="string">"The transformer architecture changed everything."</span>
|
| 166 |
tokens = enc.encode(text)
|
| 167 |
+
<span class="function">print</span>(<span class="string">f"Token count: {len(tokens)}"</span>)
|
| 168 |
+
<span class="function">print</span>(<span class="string">f"Tokens: {[enc.decode([t]) for t in tokens]}"</span>)
|
| 169 |
+
|
| 170 |
+
<span class="comment"># Cost estimation helper</span>
|
| 171 |
+
<span class="keyword">def</span> <span class="function">estimate_cost</span>(text, model=<span class="string">"gpt-4o"</span>):
|
| 172 |
+
enc = tiktoken.encoding_for_model(model)
|
| 173 |
+
token_count = <span class="function">len</span>(enc.encode(text))
|
| 174 |
+
prices = {
|
| 175 |
+
<span class="string">"gpt-4o"</span>: (<span class="number">2.50</span>, <span class="number">10.00</span>), <span class="comment"># (input, output) per 1M tokens</span>
|
| 176 |
+
<span class="string">"gpt-4o-mini"</span>: (<span class="number">0.15</span>, <span class="number">0.60</span>),
|
| 177 |
+
<span class="string">"claude-3-5-sonnet"</span>: (<span class="number">3.00</span>, <span class="number">15.00</span>),
|
| 178 |
+
}
|
| 179 |
+
input_price = prices.get(model, (<span class="number">1</span>,<span class="number">1</span>))[<span class="number">0</span>]
|
| 180 |
+
cost = (token_count / <span class="number">1_000_000</span>) * input_price
|
| 181 |
+
<span class="keyword">return</span> <span class="string">f"{token_count} tokens = $\\{cost:.4f}"</span>
|
| 182 |
+
|
| 183 |
+
print(estimate_cost(<span class="string">"Explain AI in 500 words"</span>))</div>
|
| 184 |
+
|
| 185 |
+
<h3>4. Structured Output (JSON Mode)</h3>
|
| 186 |
+
<div class="code-block"><span class="comment"># Force JSON output — essential for pipelines</span>
|
| 187 |
+
response = client.chat.completions.create(
|
| 188 |
+
model=<span class="string">"gpt-4o"</span>,
|
| 189 |
+
response_format={<span class="string">"type"</span>: <span class="string">"json_object"</span>},
|
| 190 |
+
messages=[{
|
| 191 |
+
<span class="string">"role"</span>: <span class="string">"user"</span>,
|
| 192 |
+
<span class="string">"content"</span>: <span class="string">"Extract entities from: 'Elon Musk founded SpaceX in 2002'. Return JSON with fields: persons, orgs, dates."</span>
|
| 193 |
+
}]
|
| 194 |
+
)
|
| 195 |
+
<span class="keyword">import</span> json
|
| 196 |
+
data = json.loads(response.choices[<span class="number">0</span>].message.content)
|
| 197 |
+
<span class="function">print</span>(data) <span class="comment"># {"persons": ["Elon Musk"], "orgs": ["SpaceX"], "dates": ["2002"]}</span>
|
| 198 |
+
|
| 199 |
+
<span class="comment"># Pydantic structured output (newest API)</span>
|
| 200 |
+
<span class="keyword">from</span> pydantic <span class="keyword">import</span> BaseModel
|
| 201 |
+
|
| 202 |
+
<span class="keyword">class</span> <span class="function">Entity</span>(BaseModel):
|
| 203 |
+
persons: list[str]
|
| 204 |
+
organizations: list[str]
|
| 205 |
+
dates: list[str]
|
| 206 |
+
|
| 207 |
+
completion = client.beta.chat.completions.parse(
|
| 208 |
+
model=<span class="string">"gpt-4o"</span>,
|
| 209 |
+
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Extract from: 'Google was founded in 1998'"</span>}],
|
| 210 |
+
response_format=Entity
|
| 211 |
+
)
|
| 212 |
+
entity = completion.choices[<span class="number">0</span>].message.parsed <span class="comment"># Typed Entity object!</span></div>
|
| 213 |
+
|
| 214 |
+
<h3>5. Multi-Provider Pattern (Anthropic & Google)</h3>
|
| 215 |
+
<div class="code-block"><span class="comment"># ─── Anthropic (Claude) ───</span>
|
| 216 |
+
<span class="keyword">import</span> anthropic
|
| 217 |
+
|
| 218 |
+
claude = anthropic.Anthropic()
|
| 219 |
+
msg = claude.messages.create(
|
| 220 |
+
model=<span class="string">"claude-3-5-sonnet-20241022"</span>,
|
| 221 |
+
max_tokens=<span class="number">1024</span>,
|
| 222 |
+
system=<span class="string">"You are an expert ML engineer."</span>,
|
| 223 |
+
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Explain LoRA"</span>}]
|
| 224 |
+
)
|
| 225 |
+
<span class="function">print</span>(msg.content[<span class="number">0</span>].text)
|
| 226 |
+
|
| 227 |
+
<span class="comment"># ─── Google Gemini ───</span>
|
| 228 |
+
<span class="keyword">import</span> google.generativeai <span class="keyword">as</span> genai
|
| 229 |
+
|
| 230 |
+
genai.configure(api_key=<span class="string">"YOUR_KEY"</span>)
|
| 231 |
+
model = genai.GenerativeModel(<span class="string">"gemini-1.5-pro"</span>)
|
| 232 |
+
response = model.generate_content(<span class="string">"Explain transformers"</span>)
|
| 233 |
+
<span class="function">print</span>(response.text)</div>
|
| 234 |
+
|
| 235 |
+
<h3>6. Comparing Models Programmatically</h3>
|
| 236 |
+
<div class="code-block"><span class="keyword">import</span> time
|
| 237 |
+
|
| 238 |
+
<span class="keyword">def</span> <span class="function">benchmark_model</span>(model_name, prompt, client):
|
| 239 |
+
start = time.time()
|
| 240 |
+
resp = client.chat.completions.create(
|
| 241 |
+
model=model_name,
|
| 242 |
+
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: prompt}],
|
| 243 |
+
max_tokens=<span class="number">300</span>
|
| 244 |
+
)
|
| 245 |
+
elapsed = time.time() - start
|
| 246 |
+
tokens = resp.usage.total_tokens
|
| 247 |
+
<span class="keyword">return</span> {
|
| 248 |
+
<span class="string">"model"</span>: model_name,
|
| 249 |
+
<span class="string">"tokens"</span>: tokens,
|
| 250 |
+
<span class="string">"latency"</span>: <span class="string">f"{elapsed:.2f}s"</span>,
|
| 251 |
+
<span class="string">"tokens_per_sec"</span>: <span class="function">round</span>(resp.usage.completion_tokens / elapsed),
|
| 252 |
+
<span class="string">"response"</span>: resp.choices[<span class="number">0</span>].message.content[:<span class="number">100</span>]
|
| 253 |
+
}
|
| 254 |
+
|
| 255 |
+
<span class="comment"># Compare GPT-4o vs GPT-4o-mini</span>
|
| 256 |
+
prompt = <span class="string">"What is the capital of France? Explain its history in 3 sentences."</span>
|
| 257 |
+
<span class="keyword">for</span> model <span class="keyword">in</span> [<span class="string">"gpt-4o"</span>, <span class="string">"gpt-4o-mini"</span>]:
|
| 258 |
+
result = benchmark_model(model, prompt, client)
|
| 259 |
+
<span class="function">print</span>(result)</div>
|
| 260 |
+
</div > `,
|
| 261 |
interview: `
|
| 262 |
+
< div class="section" >
|
| 263 |
+
<h2>🎯 LLM Fundamentals — In-Depth Interview Questions</h2>
|
| 264 |
+
<div class="interview-box"><strong>Q1: What happens when temperature = 0?</strong><p><strong>Answer:</strong> The model becomes <strong>deterministic</strong>, always picking the highest-probability token (greedy decoding). Use for tasks requiring consistency (code generation, extraction, classification). Side effects: can get stuck in repetitive loops; technically temperature=0 is greedy, temperature=1 is the trained distribution, above 1 is "hotter" (more random). For near-deterministic with slight randomness, use temperature=0.1 with top_p=1.</p></div>
|
| 265 |
+
<div class="interview-box"><strong>Q2: Why do LLMs hallucinate, and what are the solutions?</strong><p><strong>Answer:</strong> LLMs don't "know" facts — they model <strong>token probabilities</strong>. When asked about something rare or unseen, the model generates statistically plausible text rather than admitting ignorance. Solutions: (1) <strong>RAG</strong> — ground answers in retrieved documents; (2) <strong>Lower temperature</strong> — reduce sampling randomness; (3) <strong>Structured output forcing</strong> — constrain output to valid formats; (4) <strong>Self-consistency</strong> — sample N times, pick the majority answer; (5) <strong>Calibrated prompting</strong> — explicitly instruct "say I don't know if unsure."</p></div>
|
| 266 |
+
<div class="interview-box"><strong>Q3: What's the difference between context window and memory?</strong><p><strong>Answer:</strong> Context window is the tokens the model can process in a <strong>single inference pass</strong> — it's completely stateless. There is no persistent memory between API calls. "Memory" in frameworks like LangChain is implemented externally by: (1) storing past conversation turns in a database, (2) summarizing old turns to compress history, (3) reinserting relevant history into the new prompt. Every call is independent — the model has zero recall of previous calls.</p></div>
|
| 267 |
+
<div class="interview-box"><strong>Q4: What is RLHF vs DPO and which is better?</strong><p><strong>Answer:</strong> <strong>RLHF</strong>: Train a separate reward model on human preferences, then optimize the LLM policy via PPO (reinforcement learning). Complex, unstable, expensive. <strong>DPO</strong> (Direct Preference Optimization): Skip the reward model entirely — directly optimize from preference pairs using a closed-form solution. Simpler, more stable, cheaper. DPO is now preferred for open-source models (LLaMA 3, Gemma). RLHF is still used by OpenAI/Anthropic where they have massive human labeling infrastructure.</p></div>
|
| 268 |
+
<div class="interview-box"><strong>Q5: What is the "Lost in the Middle" phenomenon?</strong><p><strong>Answer:</strong> Research by Liu et al. (2023) showed that LLMs perform significantly worse when relevant information is placed in the <strong>middle</strong> of long contexts compared to the beginning or end. The model's attention mechanism attends most to recent tokens (recency bias) and the very first tokens (primacy bias). Practical implication: place the most critical context at the <strong>start or end</strong> of your prompt, never buried in the middle of a long document.</p></div>
|
| 269 |
+
<div class="interview-box"><strong>Q6: Explain the difference between GPT-4o and o1/o3 models.</strong><p><strong>Answer:</strong> GPT-4o is a standard auto-regressive LLM — generates tokens left-to-right in one pass. o1/o3 are <strong>reasoning models</strong> that use "chain-of-thought before answering" — they generate internal reasoning tokens (hidden from the user) before producing the final answer. This makes them dramatically better at math, logic, and coding, but 3-10x slower and more expensive. Use GPT-4o for speed-sensitive tasks (chat, extraction), o1/o3 for complex reasoning (math proofs, hard coding, multi-step analysis).</p></div>
|
| 270 |
+
</div > `
|
| 271 |
},
|
| 272 |
'transformers': {
|
| 273 |
concepts: `
|
| 274 |
<div class="section">
|
| 275 |
+
<h2>🔗 Transformer Architecture — Complete Deep Dive</h2>
|
| 276 |
<div class="info-box">
|
| 277 |
+
<div class="box-title">⚡ "Attention Is All You Need" (Vaswani et al., 2017)</div>
|
| 278 |
+
<div class="box-content">The Transformer replaced RNNs with pure attention mechanisms. The key insight: instead of processing tokens sequentially, process all tokens <strong>in parallel</strong>, computing relevance scores between every pair. This enabled massive parallelization on GPUs and is the foundation of every modern LLM, from GPT-4 to LLaMA to Gemini.</div>
|
| 279 |
</div>
|
| 280 |
+
|
| 281 |
+
<h3>1. Self-Attention — The Core Mechanism</h3>
|
| 282 |
+
<p>For each token, compute 3 vectors via learned linear projections: <strong>Query (Q)</strong>, <strong>Key (K)</strong>, <strong>Value (V)</strong>. The attention formula:</p>
|
| 283 |
+
<div class="formula">Attention(Q, K, V) = softmax(QK<sup>T</sup> / √d<sub>k</sub>) × V</div>
|
| 284 |
<table>
|
| 285 |
<tr><th>Component</th><th>Role</th><th>Analogy</th></tr>
|
| 286 |
+
<tr><td><strong>Query (Q)</strong></td><td>What this token is looking for</td><td>Search query on Google</td></tr>
|
| 287 |
+
<tr><td><strong>Key (K)</strong></td><td>What each token offers/advertises</td><td>Page titles in search index</td></tr>
|
| 288 |
+
<tr><td><strong>Value (V)</strong></td><td>Actual content to retrieve</td><td>Page content returned</td></tr>
|
| 289 |
+
<tr><td><strong>√d<sub>k</sub> scaling</strong></td><td>Prevents softmax saturation for large dims</td><td>Normalization for numerical stability</td></tr>
|
| 290 |
+
</table>
|
| 291 |
+
<p><strong>Why it works:</strong> Self-attention lets every token attend to every other token in O(1) hops (vs O(n) for RNNs). "The cat sat on the <strong>mat</strong>" — the word "mat" can directly attend to "cat" and "sat" to understand context, without information passing through intermediate words.</p>
|
| 292 |
+
|
| 293 |
+
<h3>2. Multi-Head Attention (MHA)</h3>
|
| 294 |
+
<p>Run <strong>h independent attention heads</strong> in parallel, each learning different relationship types. Concatenate outputs and project. GPT-4 likely uses ~96 heads. Each head specializes: head 1 may track subject-verb agreement, head 2 may track pronoun coreference, head 3 may track positional patterns.</p>
|
| 295 |
+
<div class="formula">MultiHead(Q,K,V) = Concat(head<sub>1</sub>, ..., head<sub>h</sub>) × W<sub>O</sub><br>where head<sub>i</sub> = Attention(QW<sub>i</sub><sup>Q</sup>, KW<sub>i</sub><sup>K</sup>, VW<sub>i</sub><sup>V</sup>)</div>
|
| 296 |
+
|
| 297 |
+
<h3>3. Modern Attention Variants</h3>
|
| 298 |
+
<table>
|
| 299 |
+
<tr><th>Variant</th><th>Key Idea</th><th>Used By</th><th>KV Cache Savings</th></tr>
|
| 300 |
+
<tr><td><strong>MHA</strong> (Multi-Head)</td><td>Separate Q, K, V per head</td><td>GPT-2, BERT</td><td>1x (baseline)</td></tr>
|
| 301 |
+
<tr><td><strong>GQA</strong> (Grouped Query)</td><td>Share K,V across groups of Q heads</td><td>LLaMA 3, Gemma, Mistral</td><td>4-8x smaller</td></tr>
|
| 302 |
+
<tr><td><strong>MQA</strong> (Multi-Query)</td><td>Single K,V shared across ALL Q heads</td><td>PaLM, Falcon</td><td>32-96x smaller</td></tr>
|
| 303 |
+
<tr><td><strong>Sliding Window</strong></td><td>Attend only to nearby tokens (window)</td><td>Mistral, Mixtral</td><td>Fixed memory regardless of length</td></tr>
|
| 304 |
+
</table>
|
| 305 |
+
|
| 306 |
+
<h3>4. Positional Encoding (RoPE)</h3>
|
| 307 |
+
<p><strong>RoPE (Rotary Position Embedding)</strong> encodes position by rotating Q and K vectors in complex space. Advantages: (1) Relative position naturally emerges from dot products, (2) Enables length extrapolation beyond training length (with techniques like YaRN, NTK-aware scaling), (3) No additional parameters. Used by LLaMA, Mistral, Gemma, Qwen — virtually all modern open LLMs.</p>
|
| 308 |
+
|
| 309 |
+
<h3>5. Transformer Block Architecture</h3>
|
| 310 |
+
<p>Each transformer block has: (1) <strong>Multi-Head Attention</strong> → (2) <strong>Residual Connection + LayerNorm</strong> → (3) <strong>Feed-Forward Network (FFN)</strong> with hidden dim 4x model dim → (4) <strong>Residual Connection + LayerNorm</strong>. Modern models use <strong>Pre-LayerNorm</strong> (normalize before attention, not after) and <strong>SwiGLU</strong> activation in FFN instead of ReLU for better performance.</p>
|
| 311 |
+
|
| 312 |
+
<h3>6. FlashAttention — Memory-Efficient Attention</h3>
|
| 313 |
+
<p>Standard attention requires O(n²) memory for the attention matrix. <strong>FlashAttention</strong> (Dao et al., 2022) computes exact attention without materializing the full matrix by using tiling and kernel fusion. Result: 2-4x faster inference, ~20x less memory for long sequences. FlashAttention 2 adds further optimizations. Essential for context windows >8K tokens.</p>
|
| 314 |
+
|
| 315 |
+
<h3>7. Mixture-of-Experts (MoE)</h3>
|
| 316 |
+
<p>Instead of one massive FFN, use <strong>N expert FFNs</strong> and a router that selects top-k experts per token. Only selected experts are activated (sparse computation). Mixtral 8x7B has 8 experts, activates 2 per token — 47B total params but only 13B active per token. Result: 3-4x more efficient than dense models of same quality.</p>
|
| 317 |
+
|
| 318 |
+
<h3>8. Decoder-Only vs Encoder-Decoder</h3>
|
| 319 |
+
<table>
|
| 320 |
+
<tr><th>Architecture</th><th>Attention Type</th><th>Best For</th><th>Examples</th></tr>
|
| 321 |
+
<tr><td><strong>Decoder-Only</strong></td><td>Causal (left-to-right only)</td><td>Text generation, chat, code</td><td>GPT-4, LLaMA, Gemma, Mistral</td></tr>
|
| 322 |
+
<tr><td><strong>Encoder-Only</strong></td><td>Bidirectional (sees all tokens)</td><td>Classification, NER, embeddings</td><td>BERT, RoBERTa, DeBERTa</td></tr>
|
| 323 |
+
<tr><td><strong>Encoder-Decoder</strong></td><td>Encoder bidirectional, decoder causal</td><td>Translation, summarization</td><td>T5, BART, mT5, Flan-T5</td></tr>
|
| 324 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 325 |
</div>`,
|
| 326 |
code: `
|
| 327 |
<div class="section">
|
| 328 |
+
<h2>💻 Transformer Architecture — Code Examples</h2>
|
| 329 |
+
|
| 330 |
+
<h3>1. Self-Attention from Scratch (NumPy)</h3>
|
| 331 |
<div class="code-block"><span class="keyword">import</span> numpy <span class="keyword">as</span> np
|
| 332 |
|
| 333 |
<span class="keyword">def</span> <span class="function">scaled_dot_product_attention</span>(Q, K, V, mask=<span class="keyword">None</span>):
|
|
|
|
| 343 |
K = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
|
| 344 |
V = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
|
| 345 |
output, attn_weights = scaled_dot_product_attention(Q, K, V)
|
| 346 |
+
<span class="function">print</span>(<span class="string">"Attention weights (each row sums to 1):"</span>)
|
| 347 |
+
<span class="function">print</span>(attn_weights)</div>
|
| 348 |
+
|
| 349 |
+
<h3>2. PyTorch Multi-Head Attention</h3>
|
| 350 |
+
<div class="code-block"><span class="keyword">import</span> torch
|
| 351 |
+
<span class="keyword">import</span> torch.nn <span class="keyword">as</span> nn
|
| 352 |
+
|
| 353 |
+
<span class="keyword">class</span> <span class="function">MultiHeadAttention</span>(nn.Module):
|
| 354 |
+
<span class="keyword">def</span> <span class="function">__init__</span>(self, d_model=<span class="number">512</span>, n_heads=<span class="number">8</span>):
|
| 355 |
+
<span class="keyword">super</span>().__init__()
|
| 356 |
+
self.n_heads = n_heads
|
| 357 |
+
self.d_k = d_model // n_heads
|
| 358 |
+
self.W_q = nn.Linear(d_model, d_model)
|
| 359 |
+
self.W_k = nn.Linear(d_model, d_model)
|
| 360 |
+
self.W_v = nn.Linear(d_model, d_model)
|
| 361 |
+
self.W_o = nn.Linear(d_model, d_model)
|
| 362 |
+
|
| 363 |
+
<span class="keyword">def</span> <span class="function">forward</span>(self, x, mask=<span class="keyword">None</span>):
|
| 364 |
+
B, T, C = x.shape
|
| 365 |
+
Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
|
| 366 |
+
K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
|
| 367 |
+
V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
|
| 368 |
+
|
| 369 |
+
scores = (Q @ K.transpose(-<span class="number">2</span>, -<span class="number">1</span>)) / (self.d_k ** <span class="number">0.5</span>)
|
| 370 |
+
<span class="keyword">if</span> mask <span class="keyword">is not None</span>:
|
| 371 |
+
scores = scores.masked_fill(mask == <span class="number">0</span>, -<span class="number">1e9</span>)
|
| 372 |
+
attn = torch.softmax(scores, dim=-<span class="number">1</span>)
|
| 373 |
+
out = (attn @ V).transpose(<span class="number">1</span>, <span class="number">2</span>).contiguous().view(B, T, C)
|
| 374 |
+
<span class="keyword">return</span> self.W_o(out)
|
| 375 |
+
|
| 376 |
+
<span class="comment"># Usage</span>
|
| 377 |
+
mha = MultiHeadAttention(d_model=<span class="number">512</span>, n_heads=<span class="number">8</span>)
|
| 378 |
+
x = torch.randn(<span class="number">2</span>, <span class="number">10</span>, <span class="number">512</span>) <span class="comment"># batch=2, seq=10, dim=512</span>
|
| 379 |
+
output = mha(x) <span class="comment"># (2, 10, 512)</span></div>
|
| 380 |
+
|
| 381 |
+
<h3>3. Inspecting Attention Patterns</h3>
|
| 382 |
<div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer
|
| 383 |
|
| 384 |
model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>, output_attentions=<span class="keyword">True</span>)
|
|
|
|
| 388 |
outputs = model(**inputs)
|
| 389 |
|
| 390 |
<span class="comment"># outputs.attentions: tuple of (batch, heads, seq, seq) per layer</span>
|
| 391 |
+
attn = outputs.attentions[<span class="number">0</span>] <span class="comment"># Layer 0: shape (1, 12, 6, 6)</span>
|
| 392 |
+
<span class="function">print</span>(<span class="string">f"Layers: {len(outputs.attentions)}, Heads: {attn.shape[1]}"</span>)
|
| 393 |
+
<span class="function">print</span>(<span class="string">f"Token 'the' attends most to: {attn[0, 0, -1].argmax()}"</span>)</div>
|
| 394 |
+
|
| 395 |
+
<h3>4. FlashAttention Usage</h3>
|
| 396 |
+
<div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM
|
| 397 |
+
<span class="keyword">import</span> torch
|
| 398 |
+
|
| 399 |
+
<span class="comment"># Enable FlashAttention 2 (requires compatible GPU)</span>
|
| 400 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 401 |
+
<span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
|
| 402 |
+
torch_dtype=torch.bfloat16,
|
| 403 |
+
attn_implementation=<span class="string">"flash_attention_2"</span>, <span class="comment"># 2-4x faster!</span>
|
| 404 |
+
device_map=<span class="string">"auto"</span>
|
| 405 |
+
)
|
| 406 |
+
|
| 407 |
+
<span class="comment"># Or use SDPA (PyTorch native, works everywhere)</span>
|
| 408 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 409 |
+
<span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
|
| 410 |
+
attn_implementation=<span class="string">"sdpa"</span>, <span class="comment"># Scaled Dot Product Attention</span>
|
| 411 |
+
device_map=<span class="string">"auto"</span>
|
| 412 |
+
)</div>
|
| 413 |
</div>`,
|
| 414 |
interview: `
|
| 415 |
<div class="section">
|
| 416 |
+
<h2>🎯 Transformer Architecture — In-Depth Interview Questions</h2>
|
| 417 |
+
<div class="interview-box"><strong>Q1: Why divide by √d_k in attention?</strong><p><strong>Answer:</strong> For large d_k, dot products grow large in magnitude (variance ≈ d_k), pushing softmax into regions with extremely small gradients (saturated). Dividing by √d_k normalizes variance to 1, keeping gradients healthy. Without it, training becomes unstable — same principle as Xavier/He weight initialization.</p></div>
|
| 418 |
+
<div class="interview-box"><strong>Q2: What is KV Cache and why is it critical for inference?</strong><p><strong>Answer:</strong> During autoregressive generation, Key and Value matrices for past tokens are <strong>cached in GPU memory</strong> so they don't need recomputation on each new token. Without KV cache: generating token n requires reprocessing all n-1 previous tokens — O(n²) total work. With KV cache: each new token only computes its own Q, K, V and attends to cached K, V — O(n) total. A 7B model with 8K context uses ~4GB just for KV cache. This is why <strong>GPU memory</strong> (not compute) is the real bottleneck for long-context inference.</p></div>
|
| 419 |
+
<div class="interview-box"><strong>Q3: What's the difference between MHA, GQA, and MQA?</strong><p><strong>Answer:</strong> <strong>MHA</strong> (Multi-Head Attention): Separate K,V per head — maximum expressivity but largest KV cache. <strong>GQA</strong> (Grouped Query Attention): K,V shared across groups of Q heads (e.g., 8 Q heads share 1 KV pair). 4-8x smaller KV cache with minimal quality loss. Used by LLaMA-3, Mistral, Gemma. <strong>MQA</strong> (Multi-Query Attention): ALL Q heads share a SINGLE K,V pair. Maximum KV cache savings (32-96x) but slightly lower quality. Used by PaLM, Falcon. Industry has settled on GQA as the best tradeoff.</p></div>
|
| 420 |
+
<div class="interview-box"><strong>Q4: What is RoPE and why replaced sinusoidal encoding?</strong><p><strong>Answer:</strong> RoPE (Rotary Position Embedding) encodes position by <strong>rotating</strong> Q and K vectors in 2D complex planes. Key advantages: (1) Relative position naturally emerges from dot products — Attention(q_m, k_n) depends only on m-n, not absolute positions. (2) No additional learned parameters. (3) Better length generalization — techniques like YaRN and NTK-aware scaling allow extending context beyond training length. Sinusoidal encoding struggled with extrapolation and required absolute position awareness.</p></div>
|
| 421 |
+
<div class="interview-box"><strong>Q5: What is FlashAttention and how does it achieve speedup?</strong><p><strong>Answer:</strong> Standard attention materializes the full N×N attention matrix in GPU HBM (High Bandwidth Memory). FlashAttention uses <strong>tiling</strong> — it breaks Q, K, V into blocks, computes attention within SRAM (fast on-chip memory), and never writes the full attention matrix to HBM. This reduces memory IO from O(N²) to O(N²/M) where M is SRAM size. Result: exact same output, but 2-4x faster and uses O(N) memory instead of O(N²). It's a pure systems optimization — no approximation.</p></div>
|
| 422 |
+
<div class="interview-box"><strong>Q6: Explain Mixture-of-Experts (MoE) and its tradeoffs.</strong><p><strong>Answer:</strong> MoE replaces the single FFN in each transformer block with N parallel expert FFNs plus a learned router. For each token, the router selects top-k experts (usually k=2). Only selected experts are activated — rest are skipped. <strong>Benefits:</strong> Train a model with 8x more parameters at ~2x the compute cost of a dense model. <strong>Tradeoffs:</strong> (1) All parameters must fit in memory even though only k are active. (2) Load balancing — if router always picks the same experts, others waste space. Solved with auxiliary loss. (3) Harder to fine-tune — expert specialization can be disrupted. Example: Mixtral 8x7B = 47B params but only 13B active per token.</p></div>
|
| 423 |
</div>`
|
| 424 |
},
|
| 425 |
'huggingface': {
|
| 426 |
concepts: `
|
| 427 |
+
< div class="section" >
|
| 428 |
<h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
|
| 429 |
<div class="info-box">
|
| 430 |
<div class="box-title">⚡ The GitHub of AI</div>
|