AashishAIHub commited on
Commit
3be1eeb
·
1 Parent(s): 4145e94

feat: Deep-dive expansion of LLM Fundamentals, Transformers, and Fine-Tuning modules

Browse files

- LLM Fundamentals: 8 concept sections, 6 code examples, 6 interview Qs
- Transformers: MoE, FlashAttention, GQA/MQA, RoPE, PyTorch MHA code
- Fine-Tuning: QLoRA math, SFTTrainer, DPO training, adapter merging, 5 Qs

Files changed (1) hide show
  1. GenAI-AgenticAI/app.js +327 -71
GenAI-AgenticAI/app.js CHANGED
@@ -19,46 +19,99 @@ const MODULE_CONTENT = {
19
  'llm-fundamentals': {
20
  concepts: `
21
  <div class="section">
22
- <h2>LLM Fundamentals — What Every Practitioner Must Know</h2>
23
- <h3>🧠 What is a Language Model?</h3>
24
  <div class="info-box">
25
  <div class="box-title">⚡ The Core Idea</div>
26
  <div class="box-content">
27
  A language model is a probability distribution over sequences of tokens: <strong>P(token_n | token_1, token_2, ..., token_n-1)</strong>. LLMs are trained to predict the next token. During inference, they sample repeatedly from this distribution to generate text. Everything — creativity, reasoning, hallucination — emerges from this single objective.
28
  </div>
29
  </div>
30
- <h3>Tokenization — The Hidden Layer</h3>
31
- <p>Text is never fed directly to an LLM. It's first converted to <strong>tokens</strong> (sub-word units) using algorithms like <strong>BPE (Byte-Pair Encoding)</strong> or <strong>SentencePiece</strong>. "unbelievable" might become ["un", "believ", "able"]. This matters because: (1) cost is per-token, (2) rare words split into many tokens, (3) code/math tokenize differently than prose.</p>
 
 
 
 
 
 
 
 
32
  <table>
33
- <tr><th>Parameter</th><th>What it controls</th><th>Typical range</th></tr>
34
- <tr><td>Temperature</td><td>Randomness of sampling (higher = more creative)</td><td>0.0 2.0</td></tr>
35
- <tr><td>Top-p (nucleus)</td><td>Cumulative probability cutoff for token candidates</td><td>0.7 1.0</td></tr>
36
- <tr><td>Top-k</td><td>Limit token candidates to k highest-probability</td><td>10 100</td></tr>
37
- <tr><td>Max tokens</td><td>Maximum generation length</td><td>256 128k</td></tr>
 
38
  </table>
39
- <h3>Context WindowThe LLM's Working Memory</h3>
40
- <p>The context window is the total number of tokens an LLM can "see" at once (both input + output). GPT-4o: 128k tokens, Gemini 1.5 Pro: 2M tokens. <strong>Critical insight:</strong> performance degrades in the middle of very long contexts ("lost in the middle" phenomenon). Place the most important content at the start or end.</p>
41
- <h3>Pre-training vs Fine-tuning vs RLHF</h3>
42
- <div class="comparison">
43
- <div class="comparison-bad">
44
- <strong>Pre-training (Base Model)</strong><br>
45
- Trained on massive text corpus to predict next tokens. Knows everything but follows no instructions. Example: raw GPT-4, Llama-3.
46
- </div>
47
- <div class="comparison-good">
48
- <strong>Instruction-tuned (Chat Model)</strong><br>
49
- Fine-tuned on instruction-response pairs + RLHF to be helpful and follow directions. Example: GPT-4o, Llama-3-Instruct, Gemini.
50
- </div>
 
 
 
 
51
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  </div>`,
53
  code: `
54
  <div class="section">
55
- <h2>💻 LLM Fundamentals — Code Examples</h2>
56
- <h3>OpenAI API — Core Patterns</h3>
 
57
  <div class="code-block"><span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI
58
 
59
- client = OpenAI()
60
 
61
- <span class="comment"># Basic completion</span>
62
  response = client.chat.completions.create(
63
  model=<span class="string">"gpt-4o"</span>,
64
  messages=[
@@ -68,66 +121,213 @@ response = client.chat.completions.create(
68
  temperature=<span class="number">0.7</span>,
69
  max_tokens=<span class="number">512</span>
70
  )
71
- <span class="function">print</span>(response.choices[<span class="number">0</span>].message.content)</div>
72
- <h3>Streaming Responses</h3>
73
- <div class="code-block"><span class="comment"># Streaming for real-time output</span>
 
 
 
 
 
 
 
 
 
 
74
  stream = client.chat.completions.create(
75
  model=<span class="string">"gpt-4o"</span>,
76
  messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Write a haiku about neural nets"</span>}],
77
  stream=<span class="keyword">True</span>
78
  )
 
 
79
  <span class="keyword">for</span> chunk <span class="keyword">in</span> stream:
80
- <span class="keyword">if</span> chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">is not None</span>:
81
- <span class="function">print</span>(chunk.choices[<span class="number">0</span>].delta.content, end=<span class="string">""</span>)</div>
82
- <h3>Token Counting</h3>
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  <div class="code-block"><span class="keyword">import</span> tiktoken
84
 
 
85
  enc = tiktoken.encoding_for_model(<span class="string">"gpt-4o"</span>)
86
  text = <span class="string">"The transformer architecture changed everything."</span>
87
  tokens = enc.encode(text)
88
- <span class="function">print</span>(<span class="string">f"Token count: {len(tokens)}"</span>) <span class="comment"># 6 tokens</span>
89
- <span class="function">print</span>(<span class="string">f"Tokens: {[enc.decode([t]) for t in tokens]}"</span>)</div>
90
- </div>`,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  interview: `
92
- <div class="section">
93
- <h2>🎯 LLM Interview Questions</h2>
94
- <div class="interview-box"><strong>Q1: What happens when temperature = 0?</strong><p><strong>Answer:</strong> The model becomes <strong>deterministic</strong>, always picking the highest-probability token (greedy decoding). Use for tasks requiring consistency (e.g., code generation, extraction). Side effect: can get stuck in repetitive loops. Temperature = 1 is the trained distribution; above 1 is "hotter" (more random).</p></div>
95
- <div class="interview-box"><strong>Q2: Why do LLMs hallucinate?</strong><p><strong>Answer:</strong> LLMs don't "know" facts — they model <strong>token probabilities</strong>. When asked about something rare or unknown, the model generates statistically plausible-sounding text rather than saying "I don't know." Solutions: RAG (ground to real documents), lower temperature, structured output forcing, and calibrated uncertainty prompting.</p></div>
96
- <div class="interview-box"><strong>Q3: What's the difference between context window and memory?</strong><p><strong>Answer:</strong> Context window is the tokens the model can process in a <strong>single inference pass</strong> — it's stateless. There is no persistent memory between calls. "Memory" in frameworks like LangChain is implemented externally by storing past conversation turns in a database and reinserting them into the prompt.</p></div>
97
- <div class="interview-box"><strong>Q4: What is RLHF and why is it needed?</strong><p><strong>Answer:</strong> Reinforcement Learning from Human Feedback. A base model is fine-tuned to maximize a <strong>reward model</strong> trained on human preference rankings. Without it, the model is just a next-token predictor and won't follow instructions, refuse harmful requests, or be consistently helpful.</p></div>
98
- </div>`
 
 
99
  },
100
  'transformers': {
101
  concepts: `
102
  <div class="section">
103
- <h2>Transformer Architecture — The Engine of Modern AI</h2>
104
  <div class="info-box">
105
- <div class="box-title">⚡ "Attention Is All You Need" (2017)</div>
106
- <div class="box-content">Vaswani et al. replaced RNNs with pure attention mechanisms. The key insight: instead of processing tokens sequentially, process all tokens <strong>in parallel</strong>, computing relevance scores between every pair. This enabled massive parallelization on GPUs and is why we can train 100B+ parameter models.</div>
107
  </div>
108
- <h3>Self-Attention — The Core Mechanism</h3>
109
- <p>For each token, compute 3 vectors: <strong>Query (Q), Key (K), Value (V)</strong> via learned linear projections. Attention score = softmax(QKᵀ / √d_k) × V. The score represents: "how much should token i attend to token j?" The division by √d_k prevents vanishing gradients in deep models.</p>
 
 
110
  <table>
111
  <tr><th>Component</th><th>Role</th><th>Analogy</th></tr>
112
- <tr><td>Query (Q)</td><td>What this token is looking for</td><td>Search query</td></tr>
113
- <tr><td>Key (K)</td><td>What each token offers</td><td>Index key</td></tr>
114
- <tr><td>Value (V)</td><td>Actual content to retrieve</td><td>Document content</td></tr>
115
- <tr><td>Softmax(QKᵀ/√d)</td><td>Attention weights (sum to 1)</td><td>Relevance scores</td></tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  </table>
117
- <h3>Multi-Head Attention</h3>
118
- <p>Run h independent attention heads in parallel, each learning different types of relationships (syntax, semantics, coreference). Concatenate outputs and project. GPT-4 likely uses ~96 heads. Each head specializes: head 1 may track subject-verb agreement, head 2 may track pronoun references.</p>
119
- <h3>Positional Encoding</h3>
120
- <p>Attention has no notion of order (it's a set operation). Positional encodings inject position information. Original Transformers used sinusoidal functions. Modern LLMs use <strong>RoPE (Rotary Position Embedding)</strong> — LLaMA, Mistral, Gemma all use RoPE, which enables better length generalization.</p>
121
- <h3>Decoder-Only vs Encoder-Decoder</h3>
122
- <div class="comparison">
123
- <div class="comparison-bad"><strong>Decoder-Only (GPT-style)</strong><br>Causal (left-to-right) attention. Can only see past tokens. Optimized for text generation. Examples: GPT-4, LLaMA, Gemma, Mistral.</div>
124
- <div class="comparison-good"><strong>Encoder-Decoder (T5-style)</strong><br>Encoder sees full input. Decoder generates output attending to encoder. Better for seq2seq tasks (translation, summarization). Examples: T5, BART, mT5.</div>
125
- </div>
126
  </div>`,
127
  code: `
128
  <div class="section">
129
- <h2>💻 Transformer Architecture — Code</h2>
130
- <h3>Self-Attention from Scratch (NumPy)</h3>
 
131
  <div class="code-block"><span class="keyword">import</span> numpy <span class="keyword">as</span> np
132
 
133
  <span class="keyword">def</span> <span class="function">scaled_dot_product_attention</span>(Q, K, V, mask=<span class="keyword">None</span>):
@@ -143,8 +343,42 @@ Q = np.random.randn(<span class="number">3</span>, <span class="number">4</span>
143
  K = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
144
  V = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
145
  output, attn_weights = scaled_dot_product_attention(Q, K, V)
146
- <span class="function">print</span>(<span class="string">f"Output shape: {output.shape}"</span>) <span class="comment"># (3, 4)</span></div>
147
- <h3>Inspecting Attention with Hugging Face</h3>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer
149
 
150
  model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>, output_attentions=<span class="keyword">True</span>)
@@ -154,21 +388,43 @@ inputs = tokenizer(<span class="string">"The cat sat on the"</span>, return_tens
154
  outputs = model(**inputs)
155
 
156
  <span class="comment"># outputs.attentions: tuple of (batch, heads, seq, seq) per layer</span>
157
- attn_layer0 = outputs.attentions[<span class="number">0</span>] <span class="comment"># shape: (1, 12, 6, 6)</span>
158
- <span class="function">print</span>(<span class="string">f"Layers: {len(outputs.attentions)}, Heads: {attn_layer0.shape[1]}"</span>)</div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  </div>`,
160
  interview: `
161
  <div class="section">
162
- <h2>🎯 Transformer Interview Questions</h2>
163
- <div class="interview-box"><strong>Q1: Why divide by √d_k in attention?</strong><p><strong>Answer:</strong> For large d_k, dot products grow large in magnitude, pushing softmax into regions with very small gradients (saturated). Dividing by √d_k keeps variance at 1, preventing this. It's the same principle as Xavier/He initialization in neural networks.</p></div>
164
- <div class="interview-box"><strong>Q2: What is KV Cache and why is it important?</strong><p><strong>Answer:</strong> During autoregressive generation, Key and Value matrices for past tokens are <strong>cached</strong> so they don't need to be recomputed on each new token. This reduces per-token computation from O(n²) to O(n). Without KV cache, inference would be ~100x slower. It's why GPU memory is the bottleneck for long context.</p></div>
165
- <div class="interview-box"><strong>Q3: What's the difference between MHA and GQA (Grouped Query Attention)?</strong><p><strong>Answer:</strong> Multi-Head Attention (MHA) has separate K,V for every head. Grouped Query Attention (GQA) shares K,V heads across groups of Q heads. This reduces KV cache memory by 4-8x with minimal quality loss. LLaMA-3, Mistral, Gemma all use GQA.</p></div>
166
- <div class="interview-box"><strong>Q4: What is RoPE and why is it better than sinusoidal?</strong><p><strong>Answer:</strong> Rotary Position Embedding encodes position by <strong>rotating</strong> the Q and K vectors in complex space. Key advantages: relative position naturally emerges from dot products, enables length extrapolation beyond training length (with tricks like YaRN), no additional parameters. Standard in all modern open-source LLMs.</p></div>
 
 
167
  </div>`
168
  },
169
  'huggingface': {
170
  concepts: `
171
- <div class="section">
172
  <h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
173
  <div class="info-box">
174
  <div class="box-title">⚡ The GitHub of AI</div>
 
19
  'llm-fundamentals': {
20
  concepts: `
21
  <div class="section">
22
+ <h2>🧠 LLM Fundamentals — Complete Deep Dive</h2>
 
23
  <div class="info-box">
24
  <div class="box-title">⚡ The Core Idea</div>
25
  <div class="box-content">
26
  A language model is a probability distribution over sequences of tokens: <strong>P(token_n | token_1, token_2, ..., token_n-1)</strong>. LLMs are trained to predict the next token. During inference, they sample repeatedly from this distribution to generate text. Everything — creativity, reasoning, hallucination — emerges from this single objective.
27
  </div>
28
  </div>
29
+
30
+ <h3>1. How Language Models Actually Work</h3>
31
+ <p>An LLM is fundamentally a <strong>next-token predictor</strong>. Given a sequence of tokens, it outputs a probability distribution over the entire vocabulary (~32K-128K tokens). The training objective is to minimize <strong>cross-entropy loss</strong> between the predicted distribution and the actual next token across billions of text examples. The model learns grammar, facts, reasoning patterns, and even code — all as statistical regularities in token sequences.</p>
32
+ <div class="callout insight">
33
+ <div class="callout-title">🔑 Key Insight: Emergent Abilities</div>
34
+ <p>Below ~10B parameters, models just predict tokens. Above ~50B, new abilities <strong>emerge</strong> that weren't explicitly trained: chain-of-thought reasoning, few-shot learning, code generation, translation between languages never paired in training data. This is why scale matters and is the foundation of the "scaling laws" (Chinchilla, Kaplan et al.).</p>
35
+ </div>
36
+
37
+ <h3>2. Tokenization — The Hidden Layer</h3>
38
+ <p>Text is never fed directly to an LLM. It's first converted to <strong>tokens</strong> (sub-word units). Understanding tokenization is critical because:</p>
39
  <table>
40
+ <tr><th>Aspect</th><th>Why It Matters</th><th>Example</th></tr>
41
+ <tr><td>Cost</td><td>API pricing is per-token, not per-word</td><td>"unbelievable" = 3 tokens = 3x cost vs 1 word</td></tr>
42
+ <tr><td>Context limits</td><td>128K tokens 128K words (~75K words)</td><td>1 token ≈ 0.75 English words on average</td></tr>
43
+ <tr><td>Non-English penalty</td><td>Languages like Hindi/Chinese use 2-3x more tokens per word</td><td>"नमस्ते" might be 6 tokens vs "hello" = 1 token</td></tr>
44
+ <tr><td>Code tokenization</td><td>Whitespace and syntax consume tokens</td><td>4 spaces of indentation = 1 token wasted per line</td></tr>
45
+ <tr><td>Number handling</td><td>Numbers tokenize unpredictably</td><td>"1234567" might split as ["123", "45", "67"] — why LLMs are bad at math</td></tr>
46
  </table>
47
+ <p><strong>Algorithms:</strong> BPE (GPT, LLaMA) merges frequent byte pairs iteratively. WordPiece (BERT) — maximizes likelihood. SentencePiece/Unigram (T5) — statistical segmentation. Modern LLMs use vocabularies of 32K-128K tokens.</p>
48
+
49
+ <h3>3. Inference Parameters Controlling Output</h3>
50
+ <table>
51
+ <tr><th>Parameter</th><th>What it controls</th><th>Range</th><th>When to change</th></tr>
52
+ <tr><td><strong>Temperature</strong></td><td>Sharpens/flattens the probability distribution</td><td>0.0 – 2.0</td><td>0 for extraction/code, 0.7 for chat, 1.2+ for creative writing</td></tr>
53
+ <tr><td><strong>Top-p (nucleus)</strong></td><td>Cumulative probability cutoff only consider tokens within top-p mass</td><td>0.7 1.0</td><td>Use 0.9 as default; lower for focused, higher for diverse</td></tr>
54
+ <tr><td><strong>Top-k</strong></td><td>Hard limit on candidate tokens</td><td>10 – 100</td><td>Rarely needed if using top-p; useful as safety net</td></tr>
55
+ <tr><td><strong>Frequency penalty</strong></td><td>Penalizes repeated tokens proportionally</td><td>0.0 – 2.0</td><td>Increase to reduce repetitive output</td></tr>
56
+ <tr><td><strong>Presence penalty</strong></td><td>Flat penalty for any repeated token</td><td>0.0 – 2.0</td><td>Increase to encourage topic diversity</td></tr>
57
+ <tr><td><strong>Max tokens</strong></td><td>Generation length limit</td><td>1 128K</td><td>Set to expected output length + margin; never use -1 for safety</td></tr>
58
+ <tr><td><strong>Stop sequences</strong></td><td>Strings that stop generation</td><td>Any text</td><td>Essential for structured output: stop at "}" for JSON</td></tr>
59
+ </table>
60
+ <div class="callout warning">
61
+ <div class="callout-title">⚠️ Common Mistake</div>
62
+ <p>Don't combine temperature=0 with top_p=0.1 — they interact. Use <strong>either</strong> temperature OR top-p for sampling control, not both. OpenAI recommends changing one and leaving the other at default.</p>
63
  </div>
64
+
65
+ <h3>4. Context Window — The LLM's Working Memory</h3>
66
+ <p>The context window determines how many tokens the model can process in a single call (input + output combined).</p>
67
+ <table>
68
+ <tr><th>Model</th><th>Context Window</th><th>Approx. Pages</th></tr>
69
+ <tr><td>GPT-4o</td><td>128K tokens</td><td>~200 pages</td></tr>
70
+ <tr><td>Claude 3.5 Sonnet</td><td>200K tokens</td><td>~350 pages</td></tr>
71
+ <tr><td>Gemini 1.5 Pro</td><td>2M tokens</td><td>~3,000 pages</td></tr>
72
+ <tr><td>LLaMA 3.1</td><td>128K tokens</td><td>~200 pages</td></tr>
73
+ <tr><td>Mistral Large</td><td>128K tokens</td><td>~200 pages</td></tr>
74
+ </table>
75
+ <p><strong>"Lost in the Middle"</strong> (Liu et al., 2023): Performance degrades for information placed in the middle of very long contexts. Models attend most to the <strong>beginning and end</strong> of prompts. Strategy: put the most important content at the start or end; use retrieval to avoid stuffing the entire context.</p>
76
+
77
+ <h3>5. Pre-training Pipeline</h3>
78
+ <p>Training an LLM from scratch involves: (1) <strong>Data collection</strong> — crawl the web (Common Crawl, ~1 trillion tokens), books, code (GitHub), conversations. (2) <strong>Data cleaning</strong> — deduplication, quality filtering, toxicity removal, PII scrubbing. (3) <strong>Tokenizer training</strong> — build BPE vocabulary from the corpus. (4) <strong>Pre-training</strong> — next-token prediction on massive GPU clusters (thousands of A100s/H100s for weeks). Cost: $2M-$100M+ for frontier models.</p>
79
+
80
+ <h3>6. Alignment: RLHF, DPO, and Constitutional AI</h3>
81
+ <p>A base model predicts tokens but doesn't follow instructions or refuse harmful content. Alignment methods bridge this gap:</p>
82
+ <table>
83
+ <tr><th>Method</th><th>How It Works</th><th>Used By</th></tr>
84
+ <tr><td><strong>SFT (Supervised Fine-Tuning)</strong></td><td>Train on (instruction, response) pairs from human annotators</td><td>All models (Step 1)</td></tr>
85
+ <tr><td><strong>RLHF</strong></td><td>Train a reward model on human preferences, then optimize policy via PPO</td><td>GPT-4, Claude (early)</td></tr>
86
+ <tr><td><strong>DPO (Direct Preference Optimization)</strong></td><td>Skip the reward model — directly optimize from preference pairs, simpler and more stable</td><td>LLaMA 3, Zephyr, Gemma</td></tr>
87
+ <tr><td><strong>Constitutional AI</strong></td><td>Model critiques and revises its own outputs against a set of principles</td><td>Claude (Anthropic)</td></tr>
88
+ <tr><td><strong>RLAIF</strong></td><td>Use an AI model (not humans) to generate preference data</td><td>Gemini, some open models</td></tr>
89
+ </table>
90
+
91
+ <h3>7. The Modern LLM Landscape (2024-2025)</h3>
92
+ <table>
93
+ <tr><th>Provider</th><th>Flagship Model</th><th>Strengths</th><th>Best For</th></tr>
94
+ <tr><td>OpenAI</td><td>GPT-4o, o1, o3</td><td>Best all-around, strong coding, reasoning chains (o1/o3)</td><td>General purpose, production</td></tr>
95
+ <tr><td>Anthropic</td><td>Claude 3.5 Sonnet</td><td>Best for long documents, coding, safety-conscious</td><td>Enterprise, agents, analysis</td></tr>
96
+ <tr><td>Google</td><td>Gemini 1.5 Pro/2.0</td><td>Massive context (2M), multi-modal, grounding</td><td>Document processing, multi-modal</td></tr>
97
+ <tr><td>Meta</td><td>LLaMA 3.1/3.2</td><td>Best open-source, fine-tunable, commercially free</td><td>Self-hosting, fine-tuning</td></tr>
98
+ <tr><td>Mistral</td><td>Mistral Large, Mixtral</td><td>Strong open models, MoE efficiency</td><td>European market, cost-effective</td></tr>
99
+ <tr><td>DeepSeek</td><td>DeepSeek V3, R1</td><td>Exceptional reasoning, competitive with o1</td><td>Math, coding, research</td></tr>
100
+ </table>
101
+
102
+ <h3>8. Scaling Laws — Why Bigger Models Get Smarter</h3>
103
+ <p><strong>Chinchilla Scaling Law</strong> (Hoffmann et al., 2022): For a compute-optimal model, training tokens should scale proportionally to model parameters. A 70B model should be trained on ~1.4 trillion tokens. Key insight: many earlier models were <strong>undertrained</strong> (too many params, not enough data). LLaMA showed smaller, well-trained models can match larger undertrained ones.</p>
104
  </div>`,
105
  code: `
106
  <div class="section">
107
+ <h2>💻 LLM Fundamentals — Comprehensive Code Examples</h2>
108
+
109
+ <h3>1. OpenAI API — Complete Patterns</h3>
110
  <div class="code-block"><span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI
111
 
112
+ client = OpenAI() <span class="comment"># Uses OPENAI_API_KEY env var</span>
113
 
114
+ <span class="comment"># ─── Basic Chat Completion ───</span>
115
  response = client.chat.completions.create(
116
  model=<span class="string">"gpt-4o"</span>,
117
  messages=[
 
121
  temperature=<span class="number">0.7</span>,
122
  max_tokens=<span class="number">512</span>
123
  )
124
+ <span class="function">print</span>(response.choices[<span class="number">0</span>].message.content)
125
+
126
+ <span class="comment"># ─── Multi-turn Conversation ───</span>
127
+ messages = [
128
+ {<span class="string">"role"</span>: <span class="string">"system"</span>, <span class="string">"content"</span>: <span class="string">"You are a Python tutor."</span>},
129
+ {<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"What is a decorator?"</span>},
130
+ {<span class="string">"role"</span>: <span class="string">"assistant"</span>, <span class="string">"content"</span>: <span class="string">"A decorator is a function that wraps another function..."</span>},
131
+ {<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Show me an example with arguments."</span>}
132
+ ]
133
+ resp = client.chat.completions.create(model=<span class="string">"gpt-4o"</span>, messages=messages)</div>
134
+
135
+ <h3>2. Streaming Responses</h3>
136
+ <div class="code-block"><span class="comment"># Streaming for real-time output — essential for UX</span>
137
  stream = client.chat.completions.create(
138
  model=<span class="string">"gpt-4o"</span>,
139
  messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Write a haiku about neural nets"</span>}],
140
  stream=<span class="keyword">True</span>
141
  )
142
+
143
+ full_response = <span class="string">""</span>
144
  <span class="keyword">for</span> chunk <span class="keyword">in</span> stream:
145
+ token = chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">or</span> <span class="string">""</span>
146
+ full_response += token
147
+ <span class="function">print</span>(token, end=<span class="string">""</span>, flush=<span class="keyword">True</span>)
148
+
149
+ <span class="comment"># Async streaming (for FastAPI/web apps)</span>
150
+ <span class="keyword">async def</span> <span class="function">stream_chat</span>(prompt):
151
+ <span class="keyword">from</span> openai <span class="keyword">import</span> AsyncOpenAI
152
+ aclient = AsyncOpenAI()
153
+ stream = <span class="keyword">await</span> aclient.chat.completions.create(
154
+ model=<span class="string">"gpt-4o"</span>, stream=<span class="keyword">True</span>,
155
+ messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: prompt}]
156
+ )
157
+ <span class="keyword">async for</span> chunk <span class="keyword">in</span> stream:
158
+ <span class="keyword">yield</span> chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">or</span> <span class="string">""</span></div>
159
+
160
+ <h3>3. Token Counting & Cost Estimation</h3>
161
  <div class="code-block"><span class="keyword">import</span> tiktoken
162
 
163
+ <span class="comment"># Count tokens for any model</span>
164
  enc = tiktoken.encoding_for_model(<span class="string">"gpt-4o"</span>)
165
  text = <span class="string">"The transformer architecture changed everything."</span>
166
  tokens = enc.encode(text)
167
+ <span class="function">print</span>(<span class="string">f"Token count: {len(tokens)}"</span>)
168
+ <span class="function">print</span>(<span class="string">f"Tokens: {[enc.decode([t]) for t in tokens]}"</span>)
169
+
170
+ <span class="comment"># Cost estimation helper</span>
171
+ <span class="keyword">def</span> <span class="function">estimate_cost</span>(text, model=<span class="string">"gpt-4o"</span>):
172
+ enc = tiktoken.encoding_for_model(model)
173
+ token_count = <span class="function">len</span>(enc.encode(text))
174
+ prices = {
175
+ <span class="string">"gpt-4o"</span>: (<span class="number">2.50</span>, <span class="number">10.00</span>), <span class="comment"># (input, output) per 1M tokens</span>
176
+ <span class="string">"gpt-4o-mini"</span>: (<span class="number">0.15</span>, <span class="number">0.60</span>),
177
+ <span class="string">"claude-3-5-sonnet"</span>: (<span class="number">3.00</span>, <span class="number">15.00</span>),
178
+ }
179
+ input_price = prices.get(model, (<span class="number">1</span>,<span class="number">1</span>))[<span class="number">0</span>]
180
+ cost = (token_count / <span class="number">1_000_000</span>) * input_price
181
+ <span class="keyword">return</span> <span class="string">f"{token_count} tokens = $\\{cost:.4f}"</span>
182
+
183
+ print(estimate_cost(<span class="string">"Explain AI in 500 words"</span>))</div>
184
+
185
+ <h3>4. Structured Output (JSON Mode)</h3>
186
+ <div class="code-block"><span class="comment"># Force JSON output — essential for pipelines</span>
187
+ response = client.chat.completions.create(
188
+ model=<span class="string">"gpt-4o"</span>,
189
+ response_format={<span class="string">"type"</span>: <span class="string">"json_object"</span>},
190
+ messages=[{
191
+ <span class="string">"role"</span>: <span class="string">"user"</span>,
192
+ <span class="string">"content"</span>: <span class="string">"Extract entities from: 'Elon Musk founded SpaceX in 2002'. Return JSON with fields: persons, orgs, dates."</span>
193
+ }]
194
+ )
195
+ <span class="keyword">import</span> json
196
+ data = json.loads(response.choices[<span class="number">0</span>].message.content)
197
+ <span class="function">print</span>(data) <span class="comment"># {"persons": ["Elon Musk"], "orgs": ["SpaceX"], "dates": ["2002"]}</span>
198
+
199
+ <span class="comment"># Pydantic structured output (newest API)</span>
200
+ <span class="keyword">from</span> pydantic <span class="keyword">import</span> BaseModel
201
+
202
+ <span class="keyword">class</span> <span class="function">Entity</span>(BaseModel):
203
+ persons: list[str]
204
+ organizations: list[str]
205
+ dates: list[str]
206
+
207
+ completion = client.beta.chat.completions.parse(
208
+ model=<span class="string">"gpt-4o"</span>,
209
+ messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Extract from: 'Google was founded in 1998'"</span>}],
210
+ response_format=Entity
211
+ )
212
+ entity = completion.choices[<span class="number">0</span>].message.parsed <span class="comment"># Typed Entity object!</span></div>
213
+
214
+ <h3>5. Multi-Provider Pattern (Anthropic & Google)</h3>
215
+ <div class="code-block"><span class="comment"># ─── Anthropic (Claude) ───</span>
216
+ <span class="keyword">import</span> anthropic
217
+
218
+ claude = anthropic.Anthropic()
219
+ msg = claude.messages.create(
220
+ model=<span class="string">"claude-3-5-sonnet-20241022"</span>,
221
+ max_tokens=<span class="number">1024</span>,
222
+ system=<span class="string">"You are an expert ML engineer."</span>,
223
+ messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Explain LoRA"</span>}]
224
+ )
225
+ <span class="function">print</span>(msg.content[<span class="number">0</span>].text)
226
+
227
+ <span class="comment"># ─── Google Gemini ───</span>
228
+ <span class="keyword">import</span> google.generativeai <span class="keyword">as</span> genai
229
+
230
+ genai.configure(api_key=<span class="string">"YOUR_KEY"</span>)
231
+ model = genai.GenerativeModel(<span class="string">"gemini-1.5-pro"</span>)
232
+ response = model.generate_content(<span class="string">"Explain transformers"</span>)
233
+ <span class="function">print</span>(response.text)</div>
234
+
235
+ <h3>6. Comparing Models Programmatically</h3>
236
+ <div class="code-block"><span class="keyword">import</span> time
237
+
238
+ <span class="keyword">def</span> <span class="function">benchmark_model</span>(model_name, prompt, client):
239
+ start = time.time()
240
+ resp = client.chat.completions.create(
241
+ model=model_name,
242
+ messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: prompt}],
243
+ max_tokens=<span class="number">300</span>
244
+ )
245
+ elapsed = time.time() - start
246
+ tokens = resp.usage.total_tokens
247
+ <span class="keyword">return</span> {
248
+ <span class="string">"model"</span>: model_name,
249
+ <span class="string">"tokens"</span>: tokens,
250
+ <span class="string">"latency"</span>: <span class="string">f"{elapsed:.2f}s"</span>,
251
+ <span class="string">"tokens_per_sec"</span>: <span class="function">round</span>(resp.usage.completion_tokens / elapsed),
252
+ <span class="string">"response"</span>: resp.choices[<span class="number">0</span>].message.content[:<span class="number">100</span>]
253
+ }
254
+
255
+ <span class="comment"># Compare GPT-4o vs GPT-4o-mini</span>
256
+ prompt = <span class="string">"What is the capital of France? Explain its history in 3 sentences."</span>
257
+ <span class="keyword">for</span> model <span class="keyword">in</span> [<span class="string">"gpt-4o"</span>, <span class="string">"gpt-4o-mini"</span>]:
258
+ result = benchmark_model(model, prompt, client)
259
+ <span class="function">print</span>(result)</div>
260
+ </div > `,
261
  interview: `
262
+ < div class="section" >
263
+ <h2>🎯 LLM Fundamentals — In-Depth Interview Questions</h2>
264
+ <div class="interview-box"><strong>Q1: What happens when temperature = 0?</strong><p><strong>Answer:</strong> The model becomes <strong>deterministic</strong>, always picking the highest-probability token (greedy decoding). Use for tasks requiring consistency (code generation, extraction, classification). Side effects: can get stuck in repetitive loops; technically temperature=0 is greedy, temperature=1 is the trained distribution, above 1 is "hotter" (more random). For near-deterministic with slight randomness, use temperature=0.1 with top_p=1.</p></div>
265
+ <div class="interview-box"><strong>Q2: Why do LLMs hallucinate, and what are the solutions?</strong><p><strong>Answer:</strong> LLMs don't "know" facts — they model <strong>token probabilities</strong>. When asked about something rare or unseen, the model generates statistically plausible text rather than admitting ignorance. Solutions: (1) <strong>RAG</strong> ground answers in retrieved documents; (2) <strong>Lower temperature</strong> reduce sampling randomness; (3) <strong>Structured output forcing</strong> — constrain output to valid formats; (4) <strong>Self-consistency</strong> — sample N times, pick the majority answer; (5) <strong>Calibrated prompting</strong> — explicitly instruct "say I don't know if unsure."</p></div>
266
+ <div class="interview-box"><strong>Q3: What's the difference between context window and memory?</strong><p><strong>Answer:</strong> Context window is the tokens the model can process in a <strong>single inference pass</strong> — it's completely stateless. There is no persistent memory between API calls. "Memory" in frameworks like LangChain is implemented externally by: (1) storing past conversation turns in a database, (2) summarizing old turns to compress history, (3) reinserting relevant history into the new prompt. Every call is independent — the model has zero recall of previous calls.</p></div>
267
+ <div class="interview-box"><strong>Q4: What is RLHF vs DPO and which is better?</strong><p><strong>Answer:</strong> <strong>RLHF</strong>: Train a separate reward model on human preferences, then optimize the LLM policy via PPO (reinforcement learning). Complex, unstable, expensive. <strong>DPO</strong> (Direct Preference Optimization): Skip the reward model entirely — directly optimize from preference pairs using a closed-form solution. Simpler, more stable, cheaper. DPO is now preferred for open-source models (LLaMA 3, Gemma). RLHF is still used by OpenAI/Anthropic where they have massive human labeling infrastructure.</p></div>
268
+ <div class="interview-box"><strong>Q5: What is the "Lost in the Middle" phenomenon?</strong><p><strong>Answer:</strong> Research by Liu et al. (2023) showed that LLMs perform significantly worse when relevant information is placed in the <strong>middle</strong> of long contexts compared to the beginning or end. The model's attention mechanism attends most to recent tokens (recency bias) and the very first tokens (primacy bias). Practical implication: place the most critical context at the <strong>start or end</strong> of your prompt, never buried in the middle of a long document.</p></div>
269
+ <div class="interview-box"><strong>Q6: Explain the difference between GPT-4o and o1/o3 models.</strong><p><strong>Answer:</strong> GPT-4o is a standard auto-regressive LLM — generates tokens left-to-right in one pass. o1/o3 are <strong>reasoning models</strong> that use "chain-of-thought before answering" — they generate internal reasoning tokens (hidden from the user) before producing the final answer. This makes them dramatically better at math, logic, and coding, but 3-10x slower and more expensive. Use GPT-4o for speed-sensitive tasks (chat, extraction), o1/o3 for complex reasoning (math proofs, hard coding, multi-step analysis).</p></div>
270
+ </div > `
271
  },
272
  'transformers': {
273
  concepts: `
274
  <div class="section">
275
+ <h2>🔗 Transformer Architecture — Complete Deep Dive</h2>
276
  <div class="info-box">
277
+ <div class="box-title">⚡ "Attention Is All You Need" (Vaswani et al., 2017)</div>
278
+ <div class="box-content">The Transformer replaced RNNs with pure attention mechanisms. The key insight: instead of processing tokens sequentially, process all tokens <strong>in parallel</strong>, computing relevance scores between every pair. This enabled massive parallelization on GPUs and is the foundation of every modern LLM, from GPT-4 to LLaMA to Gemini.</div>
279
  </div>
280
+
281
+ <h3>1. Self-Attention The Core Mechanism</h3>
282
+ <p>For each token, compute 3 vectors via learned linear projections: <strong>Query (Q)</strong>, <strong>Key (K)</strong>, <strong>Value (V)</strong>. The attention formula:</p>
283
+ <div class="formula">Attention(Q, K, V) = softmax(QK<sup>T</sup> / √d<sub>k</sub>) × V</div>
284
  <table>
285
  <tr><th>Component</th><th>Role</th><th>Analogy</th></tr>
286
+ <tr><td><strong>Query (Q)</strong></td><td>What this token is looking for</td><td>Search query on Google</td></tr>
287
+ <tr><td><strong>Key (K)</strong></td><td>What each token offers/advertises</td><td>Page titles in search index</td></tr>
288
+ <tr><td><strong>Value (V)</strong></td><td>Actual content to retrieve</td><td>Page content returned</td></tr>
289
+ <tr><td><strong>√d<sub>k</sub> scaling</strong></td><td>Prevents softmax saturation for large dims</td><td>Normalization for numerical stability</td></tr>
290
+ </table>
291
+ <p><strong>Why it works:</strong> Self-attention lets every token attend to every other token in O(1) hops (vs O(n) for RNNs). "The cat sat on the <strong>mat</strong>" — the word "mat" can directly attend to "cat" and "sat" to understand context, without information passing through intermediate words.</p>
292
+
293
+ <h3>2. Multi-Head Attention (MHA)</h3>
294
+ <p>Run <strong>h independent attention heads</strong> in parallel, each learning different relationship types. Concatenate outputs and project. GPT-4 likely uses ~96 heads. Each head specializes: head 1 may track subject-verb agreement, head 2 may track pronoun coreference, head 3 may track positional patterns.</p>
295
+ <div class="formula">MultiHead(Q,K,V) = Concat(head<sub>1</sub>, ..., head<sub>h</sub>) × W<sub>O</sub><br>where head<sub>i</sub> = Attention(QW<sub>i</sub><sup>Q</sup>, KW<sub>i</sub><sup>K</sup>, VW<sub>i</sub><sup>V</sup>)</div>
296
+
297
+ <h3>3. Modern Attention Variants</h3>
298
+ <table>
299
+ <tr><th>Variant</th><th>Key Idea</th><th>Used By</th><th>KV Cache Savings</th></tr>
300
+ <tr><td><strong>MHA</strong> (Multi-Head)</td><td>Separate Q, K, V per head</td><td>GPT-2, BERT</td><td>1x (baseline)</td></tr>
301
+ <tr><td><strong>GQA</strong> (Grouped Query)</td><td>Share K,V across groups of Q heads</td><td>LLaMA 3, Gemma, Mistral</td><td>4-8x smaller</td></tr>
302
+ <tr><td><strong>MQA</strong> (Multi-Query)</td><td>Single K,V shared across ALL Q heads</td><td>PaLM, Falcon</td><td>32-96x smaller</td></tr>
303
+ <tr><td><strong>Sliding Window</strong></td><td>Attend only to nearby tokens (window)</td><td>Mistral, Mixtral</td><td>Fixed memory regardless of length</td></tr>
304
+ </table>
305
+
306
+ <h3>4. Positional Encoding (RoPE)</h3>
307
+ <p><strong>RoPE (Rotary Position Embedding)</strong> encodes position by rotating Q and K vectors in complex space. Advantages: (1) Relative position naturally emerges from dot products, (2) Enables length extrapolation beyond training length (with techniques like YaRN, NTK-aware scaling), (3) No additional parameters. Used by LLaMA, Mistral, Gemma, Qwen — virtually all modern open LLMs.</p>
308
+
309
+ <h3>5. Transformer Block Architecture</h3>
310
+ <p>Each transformer block has: (1) <strong>Multi-Head Attention</strong> → (2) <strong>Residual Connection + LayerNorm</strong> → (3) <strong>Feed-Forward Network (FFN)</strong> with hidden dim 4x model dim → (4) <strong>Residual Connection + LayerNorm</strong>. Modern models use <strong>Pre-LayerNorm</strong> (normalize before attention, not after) and <strong>SwiGLU</strong> activation in FFN instead of ReLU for better performance.</p>
311
+
312
+ <h3>6. FlashAttention — Memory-Efficient Attention</h3>
313
+ <p>Standard attention requires O(n²) memory for the attention matrix. <strong>FlashAttention</strong> (Dao et al., 2022) computes exact attention without materializing the full matrix by using tiling and kernel fusion. Result: 2-4x faster inference, ~20x less memory for long sequences. FlashAttention 2 adds further optimizations. Essential for context windows &gt;8K tokens.</p>
314
+
315
+ <h3>7. Mixture-of-Experts (MoE)</h3>
316
+ <p>Instead of one massive FFN, use <strong>N expert FFNs</strong> and a router that selects top-k experts per token. Only selected experts are activated (sparse computation). Mixtral 8x7B has 8 experts, activates 2 per token — 47B total params but only 13B active per token. Result: 3-4x more efficient than dense models of same quality.</p>
317
+
318
+ <h3>8. Decoder-Only vs Encoder-Decoder</h3>
319
+ <table>
320
+ <tr><th>Architecture</th><th>Attention Type</th><th>Best For</th><th>Examples</th></tr>
321
+ <tr><td><strong>Decoder-Only</strong></td><td>Causal (left-to-right only)</td><td>Text generation, chat, code</td><td>GPT-4, LLaMA, Gemma, Mistral</td></tr>
322
+ <tr><td><strong>Encoder-Only</strong></td><td>Bidirectional (sees all tokens)</td><td>Classification, NER, embeddings</td><td>BERT, RoBERTa, DeBERTa</td></tr>
323
+ <tr><td><strong>Encoder-Decoder</strong></td><td>Encoder bidirectional, decoder causal</td><td>Translation, summarization</td><td>T5, BART, mT5, Flan-T5</td></tr>
324
  </table>
 
 
 
 
 
 
 
 
 
325
  </div>`,
326
  code: `
327
  <div class="section">
328
+ <h2>💻 Transformer Architecture — Code Examples</h2>
329
+
330
+ <h3>1. Self-Attention from Scratch (NumPy)</h3>
331
  <div class="code-block"><span class="keyword">import</span> numpy <span class="keyword">as</span> np
332
 
333
  <span class="keyword">def</span> <span class="function">scaled_dot_product_attention</span>(Q, K, V, mask=<span class="keyword">None</span>):
 
343
  K = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
344
  V = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
345
  output, attn_weights = scaled_dot_product_attention(Q, K, V)
346
+ <span class="function">print</span>(<span class="string">"Attention weights (each row sums to 1):"</span>)
347
+ <span class="function">print</span>(attn_weights)</div>
348
+
349
+ <h3>2. PyTorch Multi-Head Attention</h3>
350
+ <div class="code-block"><span class="keyword">import</span> torch
351
+ <span class="keyword">import</span> torch.nn <span class="keyword">as</span> nn
352
+
353
+ <span class="keyword">class</span> <span class="function">MultiHeadAttention</span>(nn.Module):
354
+ <span class="keyword">def</span> <span class="function">__init__</span>(self, d_model=<span class="number">512</span>, n_heads=<span class="number">8</span>):
355
+ <span class="keyword">super</span>().__init__()
356
+ self.n_heads = n_heads
357
+ self.d_k = d_model // n_heads
358
+ self.W_q = nn.Linear(d_model, d_model)
359
+ self.W_k = nn.Linear(d_model, d_model)
360
+ self.W_v = nn.Linear(d_model, d_model)
361
+ self.W_o = nn.Linear(d_model, d_model)
362
+
363
+ <span class="keyword">def</span> <span class="function">forward</span>(self, x, mask=<span class="keyword">None</span>):
364
+ B, T, C = x.shape
365
+ Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
366
+ K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
367
+ V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
368
+
369
+ scores = (Q @ K.transpose(-<span class="number">2</span>, -<span class="number">1</span>)) / (self.d_k ** <span class="number">0.5</span>)
370
+ <span class="keyword">if</span> mask <span class="keyword">is not None</span>:
371
+ scores = scores.masked_fill(mask == <span class="number">0</span>, -<span class="number">1e9</span>)
372
+ attn = torch.softmax(scores, dim=-<span class="number">1</span>)
373
+ out = (attn @ V).transpose(<span class="number">1</span>, <span class="number">2</span>).contiguous().view(B, T, C)
374
+ <span class="keyword">return</span> self.W_o(out)
375
+
376
+ <span class="comment"># Usage</span>
377
+ mha = MultiHeadAttention(d_model=<span class="number">512</span>, n_heads=<span class="number">8</span>)
378
+ x = torch.randn(<span class="number">2</span>, <span class="number">10</span>, <span class="number">512</span>) <span class="comment"># batch=2, seq=10, dim=512</span>
379
+ output = mha(x) <span class="comment"># (2, 10, 512)</span></div>
380
+
381
+ <h3>3. Inspecting Attention Patterns</h3>
382
  <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer
383
 
384
  model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>, output_attentions=<span class="keyword">True</span>)
 
388
  outputs = model(**inputs)
389
 
390
  <span class="comment"># outputs.attentions: tuple of (batch, heads, seq, seq) per layer</span>
391
+ attn = outputs.attentions[<span class="number">0</span>] <span class="comment"># Layer 0: shape (1, 12, 6, 6)</span>
392
+ <span class="function">print</span>(<span class="string">f"Layers: {len(outputs.attentions)}, Heads: {attn.shape[1]}"</span>)
393
+ <span class="function">print</span>(<span class="string">f"Token 'the' attends most to: {attn[0, 0, -1].argmax()}"</span>)</div>
394
+
395
+ <h3>4. FlashAttention Usage</h3>
396
+ <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM
397
+ <span class="keyword">import</span> torch
398
+
399
+ <span class="comment"># Enable FlashAttention 2 (requires compatible GPU)</span>
400
+ model = AutoModelForCausalLM.from_pretrained(
401
+ <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
402
+ torch_dtype=torch.bfloat16,
403
+ attn_implementation=<span class="string">"flash_attention_2"</span>, <span class="comment"># 2-4x faster!</span>
404
+ device_map=<span class="string">"auto"</span>
405
+ )
406
+
407
+ <span class="comment"># Or use SDPA (PyTorch native, works everywhere)</span>
408
+ model = AutoModelForCausalLM.from_pretrained(
409
+ <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
410
+ attn_implementation=<span class="string">"sdpa"</span>, <span class="comment"># Scaled Dot Product Attention</span>
411
+ device_map=<span class="string">"auto"</span>
412
+ )</div>
413
  </div>`,
414
  interview: `
415
  <div class="section">
416
+ <h2>🎯 Transformer Architecture — In-Depth Interview Questions</h2>
417
+ <div class="interview-box"><strong>Q1: Why divide by √d_k in attention?</strong><p><strong>Answer:</strong> For large d_k, dot products grow large in magnitude (variance ≈ d_k), pushing softmax into regions with extremely small gradients (saturated). Dividing by √d_k normalizes variance to 1, keeping gradients healthy. Without it, training becomes unstable — same principle as Xavier/He weight initialization.</p></div>
418
+ <div class="interview-box"><strong>Q2: What is KV Cache and why is it critical for inference?</strong><p><strong>Answer:</strong> During autoregressive generation, Key and Value matrices for past tokens are <strong>cached in GPU memory</strong> so they don't need recomputation on each new token. Without KV cache: generating token n requires reprocessing all n-1 previous tokens — O(n²) total work. With KV cache: each new token only computes its own Q, K, V and attends to cached K, V — O(n) total. A 7B model with 8K context uses ~4GB just for KV cache. This is why <strong>GPU memory</strong> (not compute) is the real bottleneck for long-context inference.</p></div>
419
+ <div class="interview-box"><strong>Q3: What's the difference between MHA, GQA, and MQA?</strong><p><strong>Answer:</strong> <strong>MHA</strong> (Multi-Head Attention): Separate K,V per head — maximum expressivity but largest KV cache. <strong>GQA</strong> (Grouped Query Attention): K,V shared across groups of Q heads (e.g., 8 Q heads share 1 KV pair). 4-8x smaller KV cache with minimal quality loss. Used by LLaMA-3, Mistral, Gemma. <strong>MQA</strong> (Multi-Query Attention): ALL Q heads share a SINGLE K,V pair. Maximum KV cache savings (32-96x) but slightly lower quality. Used by PaLM, Falcon. Industry has settled on GQA as the best tradeoff.</p></div>
420
+ <div class="interview-box"><strong>Q4: What is RoPE and why replaced sinusoidal encoding?</strong><p><strong>Answer:</strong> RoPE (Rotary Position Embedding) encodes position by <strong>rotating</strong> Q and K vectors in 2D complex planes. Key advantages: (1) Relative position naturally emerges from dot products Attention(q_m, k_n) depends only on m-n, not absolute positions. (2) No additional learned parameters. (3) Better length generalization — techniques like YaRN and NTK-aware scaling allow extending context beyond training length. Sinusoidal encoding struggled with extrapolation and required absolute position awareness.</p></div>
421
+ <div class="interview-box"><strong>Q5: What is FlashAttention and how does it achieve speedup?</strong><p><strong>Answer:</strong> Standard attention materializes the full N×N attention matrix in GPU HBM (High Bandwidth Memory). FlashAttention uses <strong>tiling</strong> — it breaks Q, K, V into blocks, computes attention within SRAM (fast on-chip memory), and never writes the full attention matrix to HBM. This reduces memory IO from O(N²) to O(N²/M) where M is SRAM size. Result: exact same output, but 2-4x faster and uses O(N) memory instead of O(N²). It's a pure systems optimization — no approximation.</p></div>
422
+ <div class="interview-box"><strong>Q6: Explain Mixture-of-Experts (MoE) and its tradeoffs.</strong><p><strong>Answer:</strong> MoE replaces the single FFN in each transformer block with N parallel expert FFNs plus a learned router. For each token, the router selects top-k experts (usually k=2). Only selected experts are activated — rest are skipped. <strong>Benefits:</strong> Train a model with 8x more parameters at ~2x the compute cost of a dense model. <strong>Tradeoffs:</strong> (1) All parameters must fit in memory even though only k are active. (2) Load balancing — if router always picks the same experts, others waste space. Solved with auxiliary loss. (3) Harder to fine-tune — expert specialization can be disrupted. Example: Mixtral 8x7B = 47B params but only 13B active per token.</p></div>
423
  </div>`
424
  },
425
  'huggingface': {
426
  concepts: `
427
+ < div class="section" >
428
  <h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
429
  <div class="info-box">
430
  <div class="box-title">⚡ The GitHub of AI</div>