Spaces:

AashishAIHub
/

DataScience

Running

AashishAIHub commited on Feb 28

Commit

3be1eeb

1 Parent(s): 4145e94

feat: Deep-dive expansion of LLM Fundamentals, Transformers, and Fine-Tuning modules

- LLM Fundamentals: 8 concept sections, 6 code examples, 6 interview Qs
- Transformers: MoE, FlashAttention, GQA/MQA, RoPE, PyTorch MHA code
- Fine-Tuning: QLoRA math, SFTTrainer, DPO training, adapter merging, 5 Qs

Files changed (1) hide show

GenAI-AgenticAI/app.js +327 -71

GenAI-AgenticAI/app.js CHANGED Viewed

@@ -19,46 +19,99 @@ const MODULE_CONTENT = {
     'llm-fundamentals': {
         concepts: `
             <div class="section">
-                <h2>LLM Fundamentals — What Every Practitioner Must Know</h2>
-                <h3>🧠 What is a Language Model?</h3>
                 <div class="info-box">
                     <div class="box-title">⚡ The Core Idea</div>
                     <div class="box-content">
                         A language model is a probability distribution over sequences of tokens: <strong>P(token_n | token_1, token_2, ..., token_n-1)</strong>. LLMs are trained to predict the next token. During inference, they sample repeatedly from this distribution to generate text. Everything — creativity, reasoning, hallucination — emerges from this single objective.
                     </div>
                 </div>
-                <h3>Tokenization — The Hidden Layer</h3>
-                <p>Text is never fed directly to an LLM. It's first converted to <strong>tokens</strong> (sub-word units) using algorithms like <strong>BPE (Byte-Pair Encoding)</strong> or <strong>SentencePiece</strong>. "unbelievable" might become ["un", "believ", "able"]. This matters because: (1) cost is per-token, (2) rare words split into many tokens, (3) code/math tokenize differently than prose.</p>
                 <table>
-                    <tr><th>Parameter</th><th>What it controls</th><th>Typical range</th></tr>
-                    <tr><td>Temperature</td><td>Randomness of sampling (higher = more creative)</td><td>0.0 – 2.0</td></tr>
-                    <tr><td>Top-p (nucleus)</td><td>Cumulative probability cutoff for token candidates</td><td>0.7 – 1.0</td></tr>
-                    <tr><td>Top-k</td><td>Limit token candidates to k highest-probability</td><td>10 – 100</td></tr>
-                    <tr><td>Max tokens</td><td>Maximum generation length</td><td>256 – 128k</td></tr>
                 </table>
-                <h3>Context Window — The LLM's Working Memory</h3>
-                <p>The context window is the total number of tokens an LLM can "see" at once (both input + output). GPT-4o: 128k tokens, Gemini 1.5 Pro: 2M tokens. <strong>Critical insight:</strong> performance degrades in the middle of very long contexts ("lost in the middle" phenomenon). Place the most important content at the start or end.</p>
-                <h3>Pre-training vs Fine-tuning vs RLHF</h3>
-                <div class="comparison">
-                    <div class="comparison-bad">
-                        <strong>Pre-training (Base Model)</strong><br>
-                        Trained on massive text corpus to predict next tokens. Knows everything but follows no instructions. Example: raw GPT-4, Llama-3.
-                    </div>
-                    <div class="comparison-good">
-                        <strong>Instruction-tuned (Chat Model)</strong><br>
-                        Fine-tuned on instruction-response pairs + RLHF to be helpful and follow directions. Example: GPT-4o, Llama-3-Instruct, Gemini.
-                    </div>
                 </div>
             </div>`,
         code: `
             <div class="section">
-                <h2>💻 LLM Fundamentals — Code Examples</h2>
-                <h3>OpenAI API — Core Patterns</h3>
                 <div class="code-block"><span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI
-client = OpenAI()
-<span class="comment"># Basic completion</span>
 response = client.chat.completions.create(
     model=<span class="string">"gpt-4o"</span>,
     messages=[
@@ -68,66 +121,213 @@ response = client.chat.completions.create(
     temperature=<span class="number">0.7</span>,
     max_tokens=<span class="number">512</span>
 )
-<span class="function">print</span>(response.choices[<span class="number">0</span>].message.content)</div>
-                <h3>Streaming Responses</h3>
-                <div class="code-block"><span class="comment"># Streaming for real-time output</span>
 stream = client.chat.completions.create(
     model=<span class="string">"gpt-4o"</span>,
     messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Write a haiku about neural nets"</span>}],
     stream=<span class="keyword">True</span>
 )
 <span class="keyword">for</span> chunk <span class="keyword">in</span> stream:
-    <span class="keyword">if</span> chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">is not None</span>:
-        <span class="function">print</span>(chunk.choices[<span class="number">0</span>].delta.content, end=<span class="string">""</span>)</div>
-                <h3>Token Counting</h3>
                 <div class="code-block"><span class="keyword">import</span> tiktoken
 enc = tiktoken.encoding_for_model(<span class="string">"gpt-4o"</span>)
 text = <span class="string">"The transformer architecture changed everything."</span>
 tokens = enc.encode(text)
-<span class="function">print</span>(<span class="string">f"Token count: {len(tokens)}"</span>)   <span class="comment"># 6 tokens</span>
-<span class="function">print</span>(<span class="string">f"Tokens: {[enc.decode([t]) for t in tokens]}"</span>)</div>
-            </div>`,
         interview: `
-            <div class="section">
-                <h2>🎯 LLM Interview Questions</h2>
-                <div class="interview-box"><strong>Q1: What happens when temperature = 0?</strong><p><strong>Answer:</strong> The model becomes <strong>deterministic</strong>, always picking the highest-probability token (greedy decoding). Use for tasks requiring consistency (e.g., code generation, extraction). Side effect: can get stuck in repetitive loops. Temperature = 1 is the trained distribution; above 1 is "hotter" (more random).</p></div>
-                <div class="interview-box"><strong>Q2: Why do LLMs hallucinate?</strong><p><strong>Answer:</strong> LLMs don't "know" facts — they model <strong>token probabilities</strong>. When asked about something rare or unknown, the model generates statistically plausible-sounding text rather than saying "I don't know." Solutions: RAG (ground to real documents), lower temperature, structured output forcing, and calibrated uncertainty prompting.</p></div>
-                <div class="interview-box"><strong>Q3: What's the difference between context window and memory?</strong><p><strong>Answer:</strong> Context window is the tokens the model can process in a <strong>single inference pass</strong> — it's stateless. There is no persistent memory between calls. "Memory" in frameworks like LangChain is implemented externally by storing past conversation turns in a database and reinserting them into the prompt.</p></div>
-                <div class="interview-box"><strong>Q4: What is RLHF and why is it needed?</strong><p><strong>Answer:</strong> Reinforcement Learning from Human Feedback. A base model is fine-tuned to maximize a <strong>reward model</strong> trained on human preference rankings. Without it, the model is just a next-token predictor and won't follow instructions, refuse harmful requests, or be consistently helpful.</p></div>
-            </div>`
     },
     'transformers': {
         concepts: `
             <div class="section">
-                <h2>Transformer Architecture — The Engine of Modern AI</h2>
                 <div class="info-box">
-                    <div class="box-title">⚡ "Attention Is All You Need" (2017)</div>
-                    <div class="box-content">Vaswani et al. replaced RNNs with pure attention mechanisms. The key insight: instead of processing tokens sequentially, process all tokens <strong>in parallel</strong>, computing relevance scores between every pair. This enabled massive parallelization on GPUs and is why we can train 100B+ parameter models.</div>
                 </div>
-                <h3>Self-Attention — The Core Mechanism</h3>
-                <p>For each token, compute 3 vectors: <strong>Query (Q), Key (K), Value (V)</strong> via learned linear projections. Attention score = softmax(QKᵀ / √d_k) × V. The score represents: "how much should token i attend to token j?" The division by √d_k prevents vanishing gradients in deep models.</p>
                 <table>
                     <tr><th>Component</th><th>Role</th><th>Analogy</th></tr>
-                    <tr><td>Query (Q)</td><td>What this token is looking for</td><td>Search query</td></tr>
-                    <tr><td>Key (K)</td><td>What each token offers</td><td>Index key</td></tr>
-                    <tr><td>Value (V)</td><td>Actual content to retrieve</td><td>Document content</td></tr>
-                    <tr><td>Softmax(QKᵀ/√d)</td><td>Attention weights (sum to 1)</td><td>Relevance scores</td></tr>
                 </table>
-                <h3>Multi-Head Attention</h3>
-                <p>Run h independent attention heads in parallel, each learning different types of relationships (syntax, semantics, coreference). Concatenate outputs and project. GPT-4 likely uses ~96 heads. Each head specializes: head 1 may track subject-verb agreement, head 2 may track pronoun references.</p>
-                <h3>Positional Encoding</h3>
-                <p>Attention has no notion of order (it's a set operation). Positional encodings inject position information. Original Transformers used sinusoidal functions. Modern LLMs use <strong>RoPE (Rotary Position Embedding)</strong> — LLaMA, Mistral, Gemma all use RoPE, which enables better length generalization.</p>
-                <h3>Decoder-Only vs Encoder-Decoder</h3>
-                <div class="comparison">
-                    <div class="comparison-bad"><strong>Decoder-Only (GPT-style)</strong><br>Causal (left-to-right) attention. Can only see past tokens. Optimized for text generation. Examples: GPT-4, LLaMA, Gemma, Mistral.</div>
-                    <div class="comparison-good"><strong>Encoder-Decoder (T5-style)</strong><br>Encoder sees full input. Decoder generates output attending to encoder. Better for seq2seq tasks (translation, summarization). Examples: T5, BART, mT5.</div>
-                </div>
             </div>`,
         code: `
             <div class="section">
-                <h2>💻 Transformer Architecture — Code</h2>
-                <h3>Self-Attention from Scratch (NumPy)</h3>
                 <div class="code-block"><span class="keyword">import</span> numpy <span class="keyword">as</span> np
 <span class="keyword">def</span> <span class="function">scaled_dot_product_attention</span>(Q, K, V, mask=<span class="keyword">None</span>):
@@ -143,8 +343,42 @@ Q = np.random.randn(<span class="number">3</span>, <span class="number">4</span>
 K = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
 V = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
 output, attn_weights = scaled_dot_product_attention(Q, K, V)
-<span class="function">print</span>(<span class="string">f"Output shape: {output.shape}"</span>)  <span class="comment"># (3, 4)</span></div>
-                <h3>Inspecting Attention with Hugging Face</h3>
                 <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>, output_attentions=<span class="keyword">True</span>)
@@ -154,21 +388,43 @@ inputs = tokenizer(<span class="string">"The cat sat on the"</span>, return_tens
 outputs = model(**inputs)
 <span class="comment"># outputs.attentions: tuple of (batch, heads, seq, seq) per layer</span>
-attn_layer0 = outputs.attentions[<span class="number">0</span>]  <span class="comment"># shape: (1, 12, 6, 6)</span>
-<span class="function">print</span>(<span class="string">f"Layers: {len(outputs.attentions)}, Heads: {attn_layer0.shape[1]}"</span>)</div>
             </div>`,
         interview: `
             <div class="section">
-                <h2>🎯 Transformer Interview Questions</h2>
-                <div class="interview-box"><strong>Q1: Why divide by √d_k in attention?</strong><p><strong>Answer:</strong> For large d_k, dot products grow large in magnitude, pushing softmax into regions with very small gradients (saturated). Dividing by √d_k keeps variance at 1, preventing this. It's the same principle as Xavier/He initialization in neural networks.</p></div>
-                <div class="interview-box"><strong>Q2: What is KV Cache and why is it important?</strong><p><strong>Answer:</strong> During autoregressive generation, Key and Value matrices for past tokens are <strong>cached</strong> so they don't need to be recomputed on each new token. This reduces per-token computation from O(n²) to O(n). Without KV cache, inference would be ~100x slower. It's why GPU memory is the bottleneck for long context.</p></div>
-                <div class="interview-box"><strong>Q3: What's the difference between MHA and GQA (Grouped Query Attention)?</strong><p><strong>Answer:</strong> Multi-Head Attention (MHA) has separate K,V for every head. Grouped Query Attention (GQA) shares K,V heads across groups of Q heads. This reduces KV cache memory by 4-8x with minimal quality loss. LLaMA-3, Mistral, Gemma all use GQA.</p></div>
-                <div class="interview-box"><strong>Q4: What is RoPE and why is it better than sinusoidal?</strong><p><strong>Answer:</strong> Rotary Position Embedding encodes position by <strong>rotating</strong> the Q and K vectors in complex space. Key advantages: relative position naturally emerges from dot products, enables length extrapolation beyond training length (with tricks like YaRN), no additional parameters. Standard in all modern open-source LLMs.</p></div>
             </div>`
     },
     'huggingface': {
         concepts: `
-            <div class="section">
                 <h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
                 <div class="info-box">
                     <div class="box-title">⚡ The GitHub of AI</div>

     'llm-fundamentals': {
         concepts: `
             <div class="section">
+                <h2>🧠 LLM Fundamentals — Complete Deep Dive</h2>
                 <div class="info-box">
                     <div class="box-title">⚡ The Core Idea</div>
                     <div class="box-content">
                         A language model is a probability distribution over sequences of tokens: <strong>P(token_n | token_1, token_2, ..., token_n-1)</strong>. LLMs are trained to predict the next token. During inference, they sample repeatedly from this distribution to generate text. Everything — creativity, reasoning, hallucination — emerges from this single objective.
                     </div>
                 </div>
+                <h3>1. How Language Models Actually Work</h3>
+                <p>An LLM is fundamentally a <strong>next-token predictor</strong>. Given a sequence of tokens, it outputs a probability distribution over the entire vocabulary (~32K-128K tokens). The training objective is to minimize <strong>cross-entropy loss</strong> between the predicted distribution and the actual next token across billions of text examples. The model learns grammar, facts, reasoning patterns, and even code — all as statistical regularities in token sequences.</p>
+                <div class="callout insight">
+                    <div class="callout-title">🔑 Key Insight: Emergent Abilities</div>
+                    <p>Below ~10B parameters, models just predict tokens. Above ~50B, new abilities <strong>emerge</strong> that weren't explicitly trained: chain-of-thought reasoning, few-shot learning, code generation, translation between languages never paired in training data. This is why scale matters and is the foundation of the "scaling laws" (Chinchilla, Kaplan et al.).</p>
+                </div>
+                <h3>2. Tokenization — The Hidden Layer</h3>
+                <p>Text is never fed directly to an LLM. It's first converted to <strong>tokens</strong> (sub-word units). Understanding tokenization is critical because:</p>
                 <table>
+                    <tr><th>Aspect</th><th>Why It Matters</th><th>Example</th></tr>
+                    <tr><td>Cost</td><td>API pricing is per-token, not per-word</td><td>"unbelievable" = 3 tokens = 3x cost vs 1 word</td></tr>
+                    <tr><td>Context limits</td><td>128K tokens ≠ 128K words (~75K words)</td><td>1 token ≈ 0.75 English words on average</td></tr>
+                    <tr><td>Non-English penalty</td><td>Languages like Hindi/Chinese use 2-3x more tokens per word</td><td>"नमस्ते" might be 6 tokens vs "hello" = 1 token</td></tr>
+                    <tr><td>Code tokenization</td><td>Whitespace and syntax consume tokens</td><td>4 spaces of indentation = 1 token wasted per line</td></tr>
+                    <tr><td>Number handling</td><td>Numbers tokenize unpredictably</td><td>"1234567" might split as ["123", "45", "67"] — why LLMs are bad at math</td></tr>
                 </table>
+                <p><strong>Algorithms:</strong> BPE (GPT, LLaMA) — merges frequent byte pairs iteratively. WordPiece (BERT) — maximizes likelihood. SentencePiece/Unigram (T5) — statistical segmentation. Modern LLMs use vocabularies of 32K-128K tokens.</p>
+                <h3>3. Inference Parameters — Controlling Output</h3>
+                <table>
+                    <tr><th>Parameter</th><th>What it controls</th><th>Range</th><th>When to change</th></tr>
+                    <tr><td><strong>Temperature</strong></td><td>Sharpens/flattens the probability distribution</td><td>0.0 – 2.0</td><td>0 for extraction/code, 0.7 for chat, 1.2+ for creative writing</td></tr>
+                    <tr><td><strong>Top-p (nucleus)</strong></td><td>Cumulative probability cutoff — only consider tokens within top-p mass</td><td>0.7 – 1.0</td><td>Use 0.9 as default; lower for focused, higher for diverse</td></tr>
+                    <tr><td><strong>Top-k</strong></td><td>Hard limit on candidate tokens</td><td>10 – 100</td><td>Rarely needed if using top-p; useful as safety net</td></tr>
+                    <tr><td><strong>Frequency penalty</strong></td><td>Penalizes repeated tokens proportionally</td><td>0.0 – 2.0</td><td>Increase to reduce repetitive output</td></tr>
+                    <tr><td><strong>Presence penalty</strong></td><td>Flat penalty for any repeated token</td><td>0.0 – 2.0</td><td>Increase to encourage topic diversity</td></tr>
+                    <tr><td><strong>Max tokens</strong></td><td>Generation length limit</td><td>1 – 128K</td><td>Set to expected output length + margin; never use -1 for safety</td></tr>
+                    <tr><td><strong>Stop sequences</strong></td><td>Strings that stop generation</td><td>Any text</td><td>Essential for structured output: stop at "}" for JSON</td></tr>
+                </table>
+                <div class="callout warning">
+                    <div class="callout-title">⚠️ Common Mistake</div>
+                    <p>Don't combine temperature=0 with top_p=0.1 — they interact. Use <strong>either</strong> temperature OR top-p for sampling control, not both. OpenAI recommends changing one and leaving the other at default.</p>
                 </div>
+                <h3>4. Context Window — The LLM's Working Memory</h3>
+                <p>The context window determines how many tokens the model can process in a single call (input + output combined).</p>
+                <table>
+                    <tr><th>Model</th><th>Context Window</th><th>Approx. Pages</th></tr>
+                    <tr><td>GPT-4o</td><td>128K tokens</td><td>~200 pages</td></tr>
+                    <tr><td>Claude 3.5 Sonnet</td><td>200K tokens</td><td>~350 pages</td></tr>
+                    <tr><td>Gemini 1.5 Pro</td><td>2M tokens</td><td>~3,000 pages</td></tr>
+                    <tr><td>LLaMA 3.1</td><td>128K tokens</td><td>~200 pages</td></tr>
+                    <tr><td>Mistral Large</td><td>128K tokens</td><td>~200 pages</td></tr>
+                </table>
+                <p><strong>"Lost in the Middle"</strong> (Liu et al., 2023): Performance degrades for information placed in the middle of very long contexts. Models attend most to the <strong>beginning and end</strong> of prompts. Strategy: put the most important content at the start or end; use retrieval to avoid stuffing the entire context.</p>
+                <h3>5. Pre-training Pipeline</h3>
+                <p>Training an LLM from scratch involves: (1) <strong>Data collection</strong> — crawl the web (Common Crawl, ~1 trillion tokens), books, code (GitHub), conversations. (2) <strong>Data cleaning</strong> — deduplication, quality filtering, toxicity removal, PII scrubbing. (3) <strong>Tokenizer training</strong> — build BPE vocabulary from the corpus. (4) <strong>Pre-training</strong> — next-token prediction on massive GPU clusters (thousands of A100s/H100s for weeks). Cost: $2M-$100M+ for frontier models.</p>
+                <h3>6. Alignment: RLHF, DPO, and Constitutional AI</h3>
+                <p>A base model predicts tokens but doesn't follow instructions or refuse harmful content. Alignment methods bridge this gap:</p>
+                <table>
+                    <tr><th>Method</th><th>How It Works</th><th>Used By</th></tr>
+                    <tr><td><strong>SFT (Supervised Fine-Tuning)</strong></td><td>Train on (instruction, response) pairs from human annotators</td><td>All models (Step 1)</td></tr>
+                    <tr><td><strong>RLHF</strong></td><td>Train a reward model on human preferences, then optimize policy via PPO</td><td>GPT-4, Claude (early)</td></tr>
+                    <tr><td><strong>DPO (Direct Preference Optimization)</strong></td><td>Skip the reward model — directly optimize from preference pairs, simpler and more stable</td><td>LLaMA 3, Zephyr, Gemma</td></tr>
+                    <tr><td><strong>Constitutional AI</strong></td><td>Model critiques and revises its own outputs against a set of principles</td><td>Claude (Anthropic)</td></tr>
+                    <tr><td><strong>RLAIF</strong></td><td>Use an AI model (not humans) to generate preference data</td><td>Gemini, some open models</td></tr>
+                </table>
+                <h3>7. The Modern LLM Landscape (2024-2025)</h3>
+                <table>
+                    <tr><th>Provider</th><th>Flagship Model</th><th>Strengths</th><th>Best For</th></tr>
+                    <tr><td>OpenAI</td><td>GPT-4o, o1, o3</td><td>Best all-around, strong coding, reasoning chains (o1/o3)</td><td>General purpose, production</td></tr>
+                    <tr><td>Anthropic</td><td>Claude 3.5 Sonnet</td><td>Best for long documents, coding, safety-conscious</td><td>Enterprise, agents, analysis</td></tr>
+                    <tr><td>Google</td><td>Gemini 1.5 Pro/2.0</td><td>Massive context (2M), multi-modal, grounding</td><td>Document processing, multi-modal</td></tr>
+                    <tr><td>Meta</td><td>LLaMA 3.1/3.2</td><td>Best open-source, fine-tunable, commercially free</td><td>Self-hosting, fine-tuning</td></tr>
+                    <tr><td>Mistral</td><td>Mistral Large, Mixtral</td><td>Strong open models, MoE efficiency</td><td>European market, cost-effective</td></tr>
+                    <tr><td>DeepSeek</td><td>DeepSeek V3, R1</td><td>Exceptional reasoning, competitive with o1</td><td>Math, coding, research</td></tr>
+                </table>
+                <h3>8. Scaling Laws — Why Bigger Models Get Smarter</h3>
+                <p><strong>Chinchilla Scaling Law</strong> (Hoffmann et al., 2022): For a compute-optimal model, training tokens should scale proportionally to model parameters. A 70B model should be trained on ~1.4 trillion tokens. Key insight: many earlier models were <strong>undertrained</strong> (too many params, not enough data). LLaMA showed smaller, well-trained models can match larger undertrained ones.</p>
             </div>`,
         code: `
             <div class="section">
+                <h2>💻 LLM Fundamentals — Comprehensive Code Examples</h2>
+                <h3>1. OpenAI API — Complete Patterns</h3>
                 <div class="code-block"><span class="keyword">from</span> openai <span class="keyword">import</span> OpenAI
+client = OpenAI()  <span class="comment"># Uses OPENAI_API_KEY env var</span>
+<span class="comment"># ─── Basic Chat Completion ───</span>
 response = client.chat.completions.create(
     model=<span class="string">"gpt-4o"</span>,
     messages=[
     temperature=<span class="number">0.7</span>,
     max_tokens=<span class="number">512</span>
 )
+<span class="function">print</span>(response.choices[<span class="number">0</span>].message.content)
+<span class="comment"># ─── Multi-turn Conversation ───</span>
+messages = [
+    {<span class="string">"role"</span>: <span class="string">"system"</span>, <span class="string">"content"</span>: <span class="string">"You are a Python tutor."</span>},
+    {<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"What is a decorator?"</span>},
+    {<span class="string">"role"</span>: <span class="string">"assistant"</span>, <span class="string">"content"</span>: <span class="string">"A decorator is a function that wraps another function..."</span>},
+    {<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Show me an example with arguments."</span>}
+]
+resp = client.chat.completions.create(model=<span class="string">"gpt-4o"</span>, messages=messages)</div>
+                <h3>2. Streaming Responses</h3>
+                <div class="code-block"><span class="comment"># Streaming for real-time output — essential for UX</span>
 stream = client.chat.completions.create(
     model=<span class="string">"gpt-4o"</span>,
     messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Write a haiku about neural nets"</span>}],
     stream=<span class="keyword">True</span>
 )
+full_response = <span class="string">""</span>
 <span class="keyword">for</span> chunk <span class="keyword">in</span> stream:
+    token = chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">or</span> <span class="string">""</span>
+    full_response += token
+    <span class="function">print</span>(token, end=<span class="string">""</span>, flush=<span class="keyword">True</span>)
+<span class="comment"># Async streaming (for FastAPI/web apps)</span>
+<span class="keyword">async def</span> <span class="function">stream_chat</span>(prompt):
+    <span class="keyword">from</span> openai <span class="keyword">import</span> AsyncOpenAI
+    aclient = AsyncOpenAI()
+    stream = <span class="keyword">await</span> aclient.chat.completions.create(
+        model=<span class="string">"gpt-4o"</span>, stream=<span class="keyword">True</span>,
+        messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: prompt}]
+    )
+    <span class="keyword">async for</span> chunk <span class="keyword">in</span> stream:
+        <span class="keyword">yield</span> chunk.choices[<span class="number">0</span>].delta.content <span class="keyword">or</span> <span class="string">""</span></div>
+                <h3>3. Token Counting & Cost Estimation</h3>
                 <div class="code-block"><span class="keyword">import</span> tiktoken
+<span class="comment"># Count tokens for any model</span>
 enc = tiktoken.encoding_for_model(<span class="string">"gpt-4o"</span>)
 text = <span class="string">"The transformer architecture changed everything."</span>
 tokens = enc.encode(text)
+<span class="function">print</span>(<span class="string">f"Token count: {len(tokens)}"</span>)
+<span class="function">print</span>(<span class="string">f"Tokens: {[enc.decode([t]) for t in tokens]}"</span>)
+<span class="comment"># Cost estimation helper</span>
+<span class="keyword">def</span> <span class="function">estimate_cost</span>(text, model=<span class="string">"gpt-4o"</span>):
+    enc = tiktoken.encoding_for_model(model)
+    token_count = <span class="function">len</span>(enc.encode(text))
+    prices = {
+        <span class="string">"gpt-4o"</span>: (<span class="number">2.50</span>, <span class="number">10.00</span>),        <span class="comment"># (input, output) per 1M tokens</span>
+        <span class="string">"gpt-4o-mini"</span>: (<span class="number">0.15</span>, <span class="number">0.60</span>),
+        <span class="string">"claude-3-5-sonnet"</span>: (<span class="number">3.00</span>, <span class="number">15.00</span>),
+    }
+    input_price = prices.get(model, (<span class="number">1</span>,<span class="number">1</span>))[<span class="number">0</span>]
+    cost = (token_count / <span class="number">1_000_000</span>) * input_price
+    <span class="keyword">return</span> <span class="string">f"{token_count} tokens = $\\{cost:.4f}"</span>
+print(estimate_cost(<span class="string">"Explain AI in 500 words"</span>))</div>
+                <h3>4. Structured Output (JSON Mode)</h3>
+                <div class="code-block"><span class="comment"># Force JSON output — essential for pipelines</span>
+response = client.chat.completions.create(
+    model=<span class="string">"gpt-4o"</span>,
+    response_format={<span class="string">"type"</span>: <span class="string">"json_object"</span>},
+    messages=[{
+        <span class="string">"role"</span>: <span class="string">"user"</span>,
+        <span class="string">"content"</span>: <span class="string">"Extract entities from: 'Elon Musk founded SpaceX in 2002'. Return JSON with fields: persons, orgs, dates."</span>
+    }]
+)
+<span class="keyword">import</span> json
+data = json.loads(response.choices[<span class="number">0</span>].message.content)
+<span class="function">print</span>(data)  <span class="comment"># {"persons": ["Elon Musk"], "orgs": ["SpaceX"], "dates": ["2002"]}</span>
+<span class="comment"># Pydantic structured output (newest API)</span>
+<span class="keyword">from</span> pydantic <span class="keyword">import</span> BaseModel
+<span class="keyword">class</span> <span class="function">Entity</span>(BaseModel):
+    persons: list[str]
+    organizations: list[str]
+    dates: list[str]
+completion = client.beta.chat.completions.parse(
+    model=<span class="string">"gpt-4o"</span>,
+    messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Extract from: 'Google was founded in 1998'"</span>}],
+    response_format=Entity
+)
+entity = completion.choices[<span class="number">0</span>].message.parsed  <span class="comment"># Typed Entity object!</span></div>
+                <h3>5. Multi-Provider Pattern (Anthropic & Google)</h3>
+                <div class="code-block"><span class="comment"># ─── Anthropic (Claude) ───</span>
+<span class="keyword">import</span> anthropic
+claude = anthropic.Anthropic()
+msg = claude.messages.create(
+    model=<span class="string">"claude-3-5-sonnet-20241022"</span>,
+    max_tokens=<span class="number">1024</span>,
+    system=<span class="string">"You are an expert ML engineer."</span>,
+    messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"Explain LoRA"</span>}]
+)
+<span class="function">print</span>(msg.content[<span class="number">0</span>].text)
+<span class="comment"># ─── Google Gemini ───</span>
+<span class="keyword">import</span> google.generativeai <span class="keyword">as</span> genai
+genai.configure(api_key=<span class="string">"YOUR_KEY"</span>)
+model = genai.GenerativeModel(<span class="string">"gemini-1.5-pro"</span>)
+response = model.generate_content(<span class="string">"Explain transformers"</span>)
+<span class="function">print</span>(response.text)</div>
+                <h3>6. Comparing Models Programmatically</h3>
+                <div class="code-block"><span class="keyword">import</span> time
+<span class="keyword">def</span> <span class="function">benchmark_model</span>(model_name, prompt, client):
+    start = time.time()
+    resp = client.chat.completions.create(
+        model=model_name,
+        messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: prompt}],
+        max_tokens=<span class="number">300</span>
+    )
+    elapsed = time.time() - start
+    tokens = resp.usage.total_tokens
+    <span class="keyword">return</span> {
+        <span class="string">"model"</span>: model_name,
+        <span class="string">"tokens"</span>: tokens,
+        <span class="string">"latency"</span>: <span class="string">f"{elapsed:.2f}s"</span>,
+        <span class="string">"tokens_per_sec"</span>: <span class="function">round</span>(resp.usage.completion_tokens / elapsed),
+        <span class="string">"response"</span>: resp.choices[<span class="number">0</span>].message.content[:<span class="number">100</span>]
+    }
+<span class="comment"># Compare GPT-4o vs GPT-4o-mini</span>
+prompt = <span class="string">"What is the capital of France? Explain its history in 3 sentences."</span>
+<span class="keyword">for</span> model <span class="keyword">in</span> [<span class="string">"gpt-4o"</span>, <span class="string">"gpt-4o-mini"</span>]:
+    result = benchmark_model(model, prompt, client)
+    <span class="function">print</span>(result)</div>
+            </div > `,
         interview: `
+    < div class="section" >
+                <h2>🎯 LLM Fundamentals — In-Depth Interview Questions</h2>
+                <div class="interview-box"><strong>Q1: What happens when temperature = 0?</strong><p><strong>Answer:</strong> The model becomes <strong>deterministic</strong>, always picking the highest-probability token (greedy decoding). Use for tasks requiring consistency (code generation, extraction, classification). Side effects: can get stuck in repetitive loops; technically temperature=0 is greedy, temperature=1 is the trained distribution, above 1 is "hotter" (more random). For near-deterministic with slight randomness, use temperature=0.1 with top_p=1.</p></div>
+                <div class="interview-box"><strong>Q2: Why do LLMs hallucinate, and what are the solutions?</strong><p><strong>Answer:</strong> LLMs don't "know" facts — they model <strong>token probabilities</strong>. When asked about something rare or unseen, the model generates statistically plausible text rather than admitting ignorance. Solutions: (1) <strong>RAG</strong> — ground answers in retrieved documents; (2) <strong>Lower temperature</strong> — reduce sampling randomness; (3) <strong>Structured output forcing</strong> — constrain output to valid formats; (4) <strong>Self-consistency</strong> — sample N times, pick the majority answer; (5) <strong>Calibrated prompting</strong> — explicitly instruct "say I don't know if unsure."</p></div>
+                <div class="interview-box"><strong>Q3: What's the difference between context window and memory?</strong><p><strong>Answer:</strong> Context window is the tokens the model can process in a <strong>single inference pass</strong> — it's completely stateless. There is no persistent memory between API calls. "Memory" in frameworks like LangChain is implemented externally by: (1) storing past conversation turns in a database, (2) summarizing old turns to compress history, (3) reinserting relevant history into the new prompt. Every call is independent — the model has zero recall of previous calls.</p></div>
+                <div class="interview-box"><strong>Q4: What is RLHF vs DPO and which is better?</strong><p><strong>Answer:</strong> <strong>RLHF</strong>: Train a separate reward model on human preferences, then optimize the LLM policy via PPO (reinforcement learning). Complex, unstable, expensive. <strong>DPO</strong> (Direct Preference Optimization): Skip the reward model entirely — directly optimize from preference pairs using a closed-form solution. Simpler, more stable, cheaper. DPO is now preferred for open-source models (LLaMA 3, Gemma). RLHF is still used by OpenAI/Anthropic where they have massive human labeling infrastructure.</p></div>
+                <div class="interview-box"><strong>Q5: What is the "Lost in the Middle" phenomenon?</strong><p><strong>Answer:</strong> Research by Liu et al. (2023) showed that LLMs perform significantly worse when relevant information is placed in the <strong>middle</strong> of long contexts compared to the beginning or end. The model's attention mechanism attends most to recent tokens (recency bias) and the very first tokens (primacy bias). Practical implication: place the most critical context at the <strong>start or end</strong> of your prompt, never buried in the middle of a long document.</p></div>
+                <div class="interview-box"><strong>Q6: Explain the difference between GPT-4o and o1/o3 models.</strong><p><strong>Answer:</strong> GPT-4o is a standard auto-regressive LLM — generates tokens left-to-right in one pass. o1/o3 are <strong>reasoning models</strong> that use "chain-of-thought before answering" — they generate internal reasoning tokens (hidden from the user) before producing the final answer. This makes them dramatically better at math, logic, and coding, but 3-10x slower and more expensive. Use GPT-4o for speed-sensitive tasks (chat, extraction), o1/o3 for complex reasoning (math proofs, hard coding, multi-step analysis).</p></div>
+            </div > `
     },
     'transformers': {
         concepts: `
             <div class="section">
+                <h2>🔗 Transformer Architecture — Complete Deep Dive</h2>
                 <div class="info-box">
+                    <div class="box-title">⚡ "Attention Is All You Need" (Vaswani et al., 2017)</div>
+                    <div class="box-content">The Transformer replaced RNNs with pure attention mechanisms. The key insight: instead of processing tokens sequentially, process all tokens <strong>in parallel</strong>, computing relevance scores between every pair. This enabled massive parallelization on GPUs and is the foundation of every modern LLM, from GPT-4 to LLaMA to Gemini.</div>
                 </div>
+                <h3>1. Self-Attention — The Core Mechanism</h3>
+                <p>For each token, compute 3 vectors via learned linear projections: <strong>Query (Q)</strong>, <strong>Key (K)</strong>, <strong>Value (V)</strong>. The attention formula:</p>
+                <div class="formula">Attention(Q, K, V) = softmax(QK<sup>T</sup> / √d<sub>k</sub>) × V</div>
                 <table>
                     <tr><th>Component</th><th>Role</th><th>Analogy</th></tr>
+                    <tr><td><strong>Query (Q)</strong></td><td>What this token is looking for</td><td>Search query on Google</td></tr>
+                    <tr><td><strong>Key (K)</strong></td><td>What each token offers/advertises</td><td>Page titles in search index</td></tr>
+                    <tr><td><strong>Value (V)</strong></td><td>Actual content to retrieve</td><td>Page content returned</td></tr>
+                    <tr><td><strong>√d<sub>k</sub> scaling</strong></td><td>Prevents softmax saturation for large dims</td><td>Normalization for numerical stability</td></tr>
+                </table>
+                <p><strong>Why it works:</strong> Self-attention lets every token attend to every other token in O(1) hops (vs O(n) for RNNs). "The cat sat on the <strong>mat</strong>" — the word "mat" can directly attend to "cat" and "sat" to understand context, without information passing through intermediate words.</p>
+                <h3>2. Multi-Head Attention (MHA)</h3>
+                <p>Run <strong>h independent attention heads</strong> in parallel, each learning different relationship types. Concatenate outputs and project. GPT-4 likely uses ~96 heads. Each head specializes: head 1 may track subject-verb agreement, head 2 may track pronoun coreference, head 3 may track positional patterns.</p>
+                <div class="formula">MultiHead(Q,K,V) = Concat(head<sub>1</sub>, ..., head<sub>h</sub>) × W<sub>O</sub><br>where head<sub>i</sub> = Attention(QW<sub>i</sub><sup>Q</sup>, KW<sub>i</sub><sup>K</sup>, VW<sub>i</sub><sup>V</sup>)</div>
+                <h3>3. Modern Attention Variants</h3>
+                <table>
+                    <tr><th>Variant</th><th>Key Idea</th><th>Used By</th><th>KV Cache Savings</th></tr>
+                    <tr><td><strong>MHA</strong> (Multi-Head)</td><td>Separate Q, K, V per head</td><td>GPT-2, BERT</td><td>1x (baseline)</td></tr>
+                    <tr><td><strong>GQA</strong> (Grouped Query)</td><td>Share K,V across groups of Q heads</td><td>LLaMA 3, Gemma, Mistral</td><td>4-8x smaller</td></tr>
+                    <tr><td><strong>MQA</strong> (Multi-Query)</td><td>Single K,V shared across ALL Q heads</td><td>PaLM, Falcon</td><td>32-96x smaller</td></tr>
+                    <tr><td><strong>Sliding Window</strong></td><td>Attend only to nearby tokens (window)</td><td>Mistral, Mixtral</td><td>Fixed memory regardless of length</td></tr>
+                </table>
+                <h3>4. Positional Encoding (RoPE)</h3>
+                <p><strong>RoPE (Rotary Position Embedding)</strong> encodes position by rotating Q and K vectors in complex space. Advantages: (1) Relative position naturally emerges from dot products, (2) Enables length extrapolation beyond training length (with techniques like YaRN, NTK-aware scaling), (3) No additional parameters. Used by LLaMA, Mistral, Gemma, Qwen — virtually all modern open LLMs.</p>
+                <h3>5. Transformer Block Architecture</h3>
+                <p>Each transformer block has: (1) <strong>Multi-Head Attention</strong> → (2) <strong>Residual Connection + LayerNorm</strong> → (3) <strong>Feed-Forward Network (FFN)</strong> with hidden dim 4x model dim → (4) <strong>Residual Connection + LayerNorm</strong>. Modern models use <strong>Pre-LayerNorm</strong> (normalize before attention, not after) and <strong>SwiGLU</strong> activation in FFN instead of ReLU for better performance.</p>
+                <h3>6. FlashAttention — Memory-Efficient Attention</h3>
+                <p>Standard attention requires O(n²) memory for the attention matrix. <strong>FlashAttention</strong> (Dao et al., 2022) computes exact attention without materializing the full matrix by using tiling and kernel fusion. Result: 2-4x faster inference, ~20x less memory for long sequences. FlashAttention 2 adds further optimizations. Essential for context windows &gt;8K tokens.</p>
+                <h3>7. Mixture-of-Experts (MoE)</h3>
+                <p>Instead of one massive FFN, use <strong>N expert FFNs</strong> and a router that selects top-k experts per token. Only selected experts are activated (sparse computation). Mixtral 8x7B has 8 experts, activates 2 per token — 47B total params but only 13B active per token. Result: 3-4x more efficient than dense models of same quality.</p>
+                <h3>8. Decoder-Only vs Encoder-Decoder</h3>
+                <table>
+                    <tr><th>Architecture</th><th>Attention Type</th><th>Best For</th><th>Examples</th></tr>
+                    <tr><td><strong>Decoder-Only</strong></td><td>Causal (left-to-right only)</td><td>Text generation, chat, code</td><td>GPT-4, LLaMA, Gemma, Mistral</td></tr>
+                    <tr><td><strong>Encoder-Only</strong></td><td>Bidirectional (sees all tokens)</td><td>Classification, NER, embeddings</td><td>BERT, RoBERTa, DeBERTa</td></tr>
+                    <tr><td><strong>Encoder-Decoder</strong></td><td>Encoder bidirectional, decoder causal</td><td>Translation, summarization</td><td>T5, BART, mT5, Flan-T5</td></tr>
                 </table>
             </div>`,
         code: `
             <div class="section">
+                <h2>💻 Transformer Architecture — Code Examples</h2>
+                <h3>1. Self-Attention from Scratch (NumPy)</h3>
                 <div class="code-block"><span class="keyword">import</span> numpy <span class="keyword">as</span> np
 <span class="keyword">def</span> <span class="function">scaled_dot_product_attention</span>(Q, K, V, mask=<span class="keyword">None</span>):
 K = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
 V = np.random.randn(<span class="number">3</span>, <span class="number">4</span>)
 output, attn_weights = scaled_dot_product_attention(Q, K, V)
+<span class="function">print</span>(<span class="string">"Attention weights (each row sums to 1):"</span>)
+<span class="function">print</span>(attn_weights)</div>
+                <h3>2. PyTorch Multi-Head Attention</h3>
+                <div class="code-block"><span class="keyword">import</span> torch
+<span class="keyword">import</span> torch.nn <span class="keyword">as</span> nn
+<span class="keyword">class</span> <span class="function">MultiHeadAttention</span>(nn.Module):
+    <span class="keyword">def</span> <span class="function">__init__</span>(self, d_model=<span class="number">512</span>, n_heads=<span class="number">8</span>):
+        <span class="keyword">super</span>().__init__()
+        self.n_heads = n_heads
+        self.d_k = d_model // n_heads
+        self.W_q = nn.Linear(d_model, d_model)
+        self.W_k = nn.Linear(d_model, d_model)
+        self.W_v = nn.Linear(d_model, d_model)
+        self.W_o = nn.Linear(d_model, d_model)
+    <span class="keyword">def</span> <span class="function">forward</span>(self, x, mask=<span class="keyword">None</span>):
+        B, T, C = x.shape
+        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
+        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
+        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(<span class="number">1</span>, <span class="number">2</span>)
+        scores = (Q @ K.transpose(-<span class="number">2</span>, -<span class="number">1</span>)) / (self.d_k ** <span class="number">0.5</span>)
+        <span class="keyword">if</span> mask <span class="keyword">is not None</span>:
+            scores = scores.masked_fill(mask == <span class="number">0</span>, -<span class="number">1e9</span>)
+        attn = torch.softmax(scores, dim=-<span class="number">1</span>)
+        out = (attn @ V).transpose(<span class="number">1</span>, <span class="number">2</span>).contiguous().view(B, T, C)
+        <span class="keyword">return</span> self.W_o(out)
+<span class="comment"># Usage</span>
+mha = MultiHeadAttention(d_model=<span class="number">512</span>, n_heads=<span class="number">8</span>)
+x = torch.randn(<span class="number">2</span>, <span class="number">10</span>, <span class="number">512</span>)  <span class="comment"># batch=2, seq=10, dim=512</span>
+output = mha(x)  <span class="comment"># (2, 10, 512)</span></div>
+                <h3>3. Inspecting Attention Patterns</h3>
                 <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>, output_attentions=<span class="keyword">True</span>)
 outputs = model(**inputs)
 <span class="comment"># outputs.attentions: tuple of (batch, heads, seq, seq) per layer</span>
+attn = outputs.attentions[<span class="number">0</span>]  <span class="comment"># Layer 0: shape (1, 12, 6, 6)</span>
+<span class="function">print</span>(<span class="string">f"Layers: {len(outputs.attentions)}, Heads: {attn.shape[1]}"</span>)
+<span class="function">print</span>(<span class="string">f"Token 'the' attends most to: {attn[0, 0, -1].argmax()}"</span>)</div>
+                <h3>4. FlashAttention Usage</h3>
+                <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM
+<span class="keyword">import</span> torch
+<span class="comment"># Enable FlashAttention 2 (requires compatible GPU)</span>
+model = AutoModelForCausalLM.from_pretrained(
+    <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
+    torch_dtype=torch.bfloat16,
+    attn_implementation=<span class="string">"flash_attention_2"</span>,  <span class="comment"># 2-4x faster!</span>
+    device_map=<span class="string">"auto"</span>
+)
+<span class="comment"># Or use SDPA (PyTorch native, works everywhere)</span>
+model = AutoModelForCausalLM.from_pretrained(
+    <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
+    attn_implementation=<span class="string">"sdpa"</span>,  <span class="comment"># Scaled Dot Product Attention</span>
+    device_map=<span class="string">"auto"</span>
+)</div>
             </div>`,
         interview: `
             <div class="section">
+                <h2>🎯 Transformer Architecture — In-Depth Interview Questions</h2>
+                <div class="interview-box"><strong>Q1: Why divide by √d_k in attention?</strong><p><strong>Answer:</strong> For large d_k, dot products grow large in magnitude (variance ≈ d_k), pushing softmax into regions with extremely small gradients (saturated). Dividing by √d_k normalizes variance to 1, keeping gradients healthy. Without it, training becomes unstable — same principle as Xavier/He weight initialization.</p></div>
+                <div class="interview-box"><strong>Q2: What is KV Cache and why is it critical for inference?</strong><p><strong>Answer:</strong> During autoregressive generation, Key and Value matrices for past tokens are <strong>cached in GPU memory</strong> so they don't need recomputation on each new token. Without KV cache: generating token n requires reprocessing all n-1 previous tokens — O(n²) total work. With KV cache: each new token only computes its own Q, K, V and attends to cached K, V — O(n) total. A 7B model with 8K context uses ~4GB just for KV cache. This is why <strong>GPU memory</strong> (not compute) is the real bottleneck for long-context inference.</p></div>
+                <div class="interview-box"><strong>Q3: What's the difference between MHA, GQA, and MQA?</strong><p><strong>Answer:</strong> <strong>MHA</strong> (Multi-Head Attention): Separate K,V per head — maximum expressivity but largest KV cache. <strong>GQA</strong> (Grouped Query Attention): K,V shared across groups of Q heads (e.g., 8 Q heads share 1 KV pair). 4-8x smaller KV cache with minimal quality loss. Used by LLaMA-3, Mistral, Gemma. <strong>MQA</strong> (Multi-Query Attention): ALL Q heads share a SINGLE K,V pair. Maximum KV cache savings (32-96x) but slightly lower quality. Used by PaLM, Falcon. Industry has settled on GQA as the best tradeoff.</p></div>
+                <div class="interview-box"><strong>Q4: What is RoPE and why replaced sinusoidal encoding?</strong><p><strong>Answer:</strong> RoPE (Rotary Position Embedding) encodes position by <strong>rotating</strong> Q and K vectors in 2D complex planes. Key advantages: (1) Relative position naturally emerges from dot products — Attention(q_m, k_n) depends only on m-n, not absolute positions. (2) No additional learned parameters. (3) Better length generalization — techniques like YaRN and NTK-aware scaling allow extending context beyond training length. Sinusoidal encoding struggled with extrapolation and required absolute position awareness.</p></div>
+                <div class="interview-box"><strong>Q5: What is FlashAttention and how does it achieve speedup?</strong><p><strong>Answer:</strong> Standard attention materializes the full N×N attention matrix in GPU HBM (High Bandwidth Memory). FlashAttention uses <strong>tiling</strong> — it breaks Q, K, V into blocks, computes attention within SRAM (fast on-chip memory), and never writes the full attention matrix to HBM. This reduces memory IO from O(N²) to O(N²/M) where M is SRAM size. Result: exact same output, but 2-4x faster and uses O(N) memory instead of O(N²). It's a pure systems optimization — no approximation.</p></div>
+                <div class="interview-box"><strong>Q6: Explain Mixture-of-Experts (MoE) and its tradeoffs.</strong><p><strong>Answer:</strong> MoE replaces the single FFN in each transformer block with N parallel expert FFNs plus a learned router. For each token, the router selects top-k experts (usually k=2). Only selected experts are activated — rest are skipped. <strong>Benefits:</strong> Train a model with 8x more parameters at ~2x the compute cost of a dense model. <strong>Tradeoffs:</strong> (1) All parameters must fit in memory even though only k are active. (2) Load balancing — if router always picks the same experts, others waste space. Solved with auxiliary loss. (3) Harder to fine-tune — expert specialization can be disrupted. Example: Mixtral 8x7B = 47B params but only 13B active per token.</p></div>
             </div>`
     },
     'huggingface': {
         concepts: `
+    < div class="section" >
                 <h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
                 <div class="info-box">
                     <div class="box-title">⚡ The GitHub of AI</div>