Spaces:
Running
Running
Commit Β·
8d69d5b
1
Parent(s): 16fe56e
feat: Enrich existing modules with details from AI Engineering Guidebook 2025 (min-p, RFT, GRPO, 8 RAG architectures, JSON prompting)
Browse files- GenAI-AgenticAI/app.js +101 -20
GenAI-AgenticAI/app.js
CHANGED
|
@@ -51,21 +51,32 @@ const MODULE_CONTENT = {
|
|
| 51 |
<p><strong>Algorithms:</strong> BPE (GPT, LLaMA) β merges frequent byte pairs iteratively. WordPiece (BERT) β maximizes likelihood. SentencePiece/Unigram (T5) β statistical segmentation. Modern LLMs use vocabularies of 32K-128K tokens.</p>
|
| 52 |
|
| 53 |
<h3>3. Inference Parameters β Controlling Output</h3>
|
|
|
|
| 54 |
<table>
|
| 55 |
-
<tr><th>Parameter</th><th>What it controls</th><th>
|
| 56 |
-
<tr><td><strong>
|
| 57 |
-
<tr><td><strong>
|
| 58 |
-
<tr><td><strong>Top-k</strong></td><td>
|
| 59 |
-
<tr><td><strong>
|
| 60 |
-
<tr><td><strong>
|
| 61 |
-
<tr><td><strong>
|
| 62 |
-
<tr><td><strong>Stop sequences</strong></td><td>
|
|
|
|
| 63 |
</table>
|
| 64 |
<div class="callout warning">
|
| 65 |
<div class="callout-title">β οΈ Common Mistake</div>
|
| 66 |
<p>Don't combine temperature=0 with top_p=0.1 β they interact. Use <strong>either</strong> temperature OR top-p for sampling control, not both. OpenAI recommends changing one and leaving the other at default.</p>
|
| 67 |
</div>
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
<h3>4. Context Window β The LLM's Working Memory</h3>
|
| 70 |
<p>The context window determines how many tokens the model can process in a single call (input + output combined).</p>
|
| 71 |
<table>
|
|
@@ -740,6 +751,26 @@ api.create_repo(<span class="string">"your-username/my-demo"</span>, repo_type=<
|
|
| 740 |
</table>
|
| 741 |
<h3>QLoRA β The Game Changer</h3>
|
| 742 |
<p>QLoRA (Dettmers et al., 2023) combines: (1) <strong>4-bit NF4 quantization</strong> of the base model, (2) <strong>double quantization</strong> to compress quantization constants, (3) <strong>paged optimizers</strong> to handle gradient spikes. Fine-tune a 65B model on a single 48GB GPU β impossible before QLoRA.</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 743 |
<h3>When to Fine-Tune vs RAG</h3>
|
| 744 |
<div class="comparison">
|
| 745 |
<div class="comparison-bad"><strong>Use RAG when:</strong> Knowledge changes frequently, facts need to be cited, domain data is large/dynamic. Lower cost, easier updates.</div>
|
|
@@ -781,6 +812,35 @@ trainer = SFTTrainer(
|
|
| 781 |
args=SFTConfig(output_dir=<span class="string">"./llama-finetuned"</span>, num_train_epochs=<span class="number">2</span>)
|
| 782 |
)
|
| 783 |
trainer.train()</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 784 |
<h3>Merge LoRA Weights for Deployment</h3>
|
| 785 |
<div class="code-block"><span class="keyword">from</span> peft <span class="keyword">import</span> PeftModel
|
| 786 |
|
|
@@ -840,21 +900,29 @@ Object.assign(MODULE_CONTENT, {
|
|
| 840 |
<tr><td>all-MiniLM-L6-v2</td><td>384</td><td>256</td><td>Good</td><td>Free</td></tr>
|
| 841 |
</table>
|
| 842 |
|
| 843 |
-
<h3>4.
|
| 844 |
<table>
|
| 845 |
-
<tr><th>
|
| 846 |
-
<tr><td><strong>
|
| 847 |
-
<tr><td><strong>
|
| 848 |
-
<tr><td><strong>
|
| 849 |
-
<tr><td><strong>
|
| 850 |
-
<tr><td><strong>
|
| 851 |
-
<tr><td><strong>
|
|
|
|
|
|
|
| 852 |
</table>
|
| 853 |
|
| 854 |
-
<h3>5.
|
| 855 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 856 |
|
| 857 |
-
<h3>
|
| 858 |
<table>
|
| 859 |
<tr><th>Metric</th><th>Measures</th><th>Target</th></tr>
|
| 860 |
<tr><td><strong>Faithfulness</strong></td><td>Are claims supported by context?</td><td>0.9+</td></tr>
|
|
@@ -1061,15 +1129,28 @@ res = index.query(vector=query_emb, top_k=<span class="number">10</span>,
|
|
| 1061 |
<h3>1. ReAct β The Foundation</h3>
|
| 1062 |
<p>ReAct (Yao 2022): <strong>Thought</strong> > <strong>Action</strong> > <strong>Observation</strong> > repeat. The LLM reasons about what to do, calls a tool, sees the result, and continues until it has a final answer.</p>
|
| 1063 |
|
| 1064 |
-
<h3>2. Agent Architectures</h3>
|
| 1065 |
<table>
|
| 1066 |
<tr><th>Architecture</th><th>How It Works</th><th>Best For</th></tr>
|
| 1067 |
<tr><td><strong>ReAct Loop</strong></td><td>Fixed think-act-observe cycle</td><td>Simple tool-using tasks</td></tr>
|
| 1068 |
<tr><td><strong>Plan-and-Execute</strong></td><td>Full plan first, then execute steps</td><td>Multi-step structured tasks</td></tr>
|
| 1069 |
<tr><td><strong>State Machine (LangGraph)</strong></td><td>Directed graph with conditional edges</td><td>Complex workflows, branching</td></tr>
|
| 1070 |
<tr><td><strong>Reflection</strong></td><td>Agent evaluates own output, retries</td><td>Quality-critical tasks</td></tr>
|
|
|
|
| 1071 |
</table>
|
| 1072 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1073 |
<h3>3. Framework Comparison</h3>
|
| 1074 |
<table>
|
| 1075 |
<tr><th>Framework</th><th>Paradigm</th><th>Strengths</th><th>Best For</th></tr>
|
|
|
|
| 51 |
<p><strong>Algorithms:</strong> BPE (GPT, LLaMA) β merges frequent byte pairs iteratively. WordPiece (BERT) β maximizes likelihood. SentencePiece/Unigram (T5) β statistical segmentation. Modern LLMs use vocabularies of 32K-128K tokens.</p>
|
| 52 |
|
| 53 |
<h3>3. Inference Parameters β Controlling Output</h3>
|
| 54 |
+
<p>Every generation from an LLM is shaped by parameters under the hood. Knowing how to tune these is critical for production AI engineering.</p>
|
| 55 |
<table>
|
| 56 |
+
<tr><th>Parameter</th><th>What it controls</th><th>Book Insight / Use Case</th></tr>
|
| 57 |
+
<tr><td><strong>1. Max tokens</strong></td><td>Hard cap on generation length</td><td>Lower for speed/safety; Higher for summaries</td></tr>
|
| 58 |
+
<tr><td><strong>2. Temperature</strong></td><td>Randomness/Creativity</td><td>~0 for deterministic QA; 0.7-1.0 for brainstorming</td></tr>
|
| 59 |
+
<tr><td><strong>3. Top-k</strong></td><td>Limit sampling to top K tokens</td><td>K=5 forces the model to choose only from 5 most likely words</td></tr>
|
| 60 |
+
<tr><td><strong>4. Top-p (nucleus)</strong></td><td>Limit to smallest set covering p% mass</td><td>P=0.9 is adaptive; handles coherence vs diversity better than K</td></tr>
|
| 61 |
+
<tr><td><strong>5. Frequency penalty</strong></td><td>Discourage reusing frequent tokens</td><td>Set > 0 to stop the model from repeating itself in loops</td></tr>
|
| 62 |
+
<tr><td><strong>6. Presence penalty</strong></td><td>Encourage new topics/tokens</td><td>Set > 0 to push the model towards broad exploration</td></tr>
|
| 63 |
+
<tr><td><strong>7. Stop sequences</strong></td><td>Halts generation at specific tokens</td><td>Critical for structured JSON (stop at "}") or code blocks</td></tr>
|
| 64 |
+
<tr><td><strong>Bonus: Min-p</strong></td><td>Dynamic probability threshold</td><td>Only keeps tokens at least X% as likely as the top token. Most robust for coherence.</td></tr>
|
| 65 |
</table>
|
| 66 |
<div class="callout warning">
|
| 67 |
<div class="callout-title">β οΈ Common Mistake</div>
|
| 68 |
<p>Don't combine temperature=0 with top_p=0.1 β they interact. Use <strong>either</strong> temperature OR top-p for sampling control, not both. OpenAI recommends changing one and leaving the other at default.</p>
|
| 69 |
</div>
|
| 70 |
|
| 71 |
+
<h3>4. 4 LLM Text Generation Strategies</h3>
|
| 72 |
+
<p>Decoding is the process of picking the next token. How we pick it determines the style of the output.</p>
|
| 73 |
+
<ul>
|
| 74 |
+
<li><strong>Greedy Strategy:</strong> Always pick the single token with the highest probability. <em>Issue:</em> Often leads to repetitive, low-quality loops.</li>
|
| 75 |
+
<li><strong>Multinomial Sampling:</strong> Sample from the probability distribution (controlled by temperature). <em>Benefit:</em> Much more creative and human-like.</li>
|
| 76 |
+
<li><strong>Beam Search:</strong> Explores multiple parallel paths ("beams") and picks the sequence with the highest total probability. <em>Best for:</em> Translation and code where sequence-level correctness matters more than creativity.</li>
|
| 77 |
+
<li><strong>Nucleus (Top-p) Sampling:</strong> Restricts sampling to a dynamic "nucleus" of tokens that sum to probability p. <em>Best for:</em> General purpose chat.</li>
|
| 78 |
+
</ul>
|
| 79 |
+
|
| 80 |
<h3>4. Context Window β The LLM's Working Memory</h3>
|
| 81 |
<p>The context window determines how many tokens the model can process in a single call (input + output combined).</p>
|
| 82 |
<table>
|
|
|
|
| 751 |
</table>
|
| 752 |
<h3>QLoRA β The Game Changer</h3>
|
| 753 |
<p>QLoRA (Dettmers et al., 2023) combines: (1) <strong>4-bit NF4 quantization</strong> of the base model, (2) <strong>double quantization</strong> to compress quantization constants, (3) <strong>paged optimizers</strong> to handle gradient spikes. Fine-tune a 65B model on a single 48GB GPU β impossible before QLoRA.</p>
|
| 754 |
+
<h3>7 LLM Fine-tuning Techniques</h3>
|
| 755 |
+
<table>
|
| 756 |
+
<tr><th>Technique</th><th>How It Works</th><th>Use Case</th></tr>
|
| 757 |
+
<tr><td><strong>1. SFT</strong></td><td>Supervised Fine-Tuning on (Q, A) pairs</td><td>Instruction following</td></tr>
|
| 758 |
+
<tr><td><strong>2. RLHF</strong></td><td>Reward model + PPO optimization</td><td>Human value alignment</td></tr>
|
| 759 |
+
<tr><td><strong>3. DPO</strong></td><td>Directly optimize from preferences</td><td>Popular, stable alternative to RLHF</td></tr>
|
| 760 |
+
<tr><td><strong>4. LoRA / QLoRA</strong></td><td>Low-rank adaptation</td><td>Efficient training on consumer GPUs</td></tr>
|
| 761 |
+
<tr><td><strong>5. RFT</strong></td><td>Rejection Fine-Tuning</td><td>Self-improving by filtering best outputs</td></tr>
|
| 762 |
+
<tr><td><strong>6. GRPO</strong></td><td>Group Relative Policy Optimization</td><td>The core of DeepSeek-R1; trains reasoning models without a separate value model</td></tr>
|
| 763 |
+
<tr><td><strong>7. IFT</strong></td><td>Instruction Fine-Tuning</td><td>Bridge gap between base model and agent</td></tr>
|
| 764 |
+
</table>
|
| 765 |
+
|
| 766 |
+
<div class="callout insight">
|
| 767 |
+
<div class="callout-title">π Book Insight: SFT vs RFT</div>
|
| 768 |
+
<p><strong>SFT (Supervised Fine-Tuning)</strong> trains on fixed human-written data. <strong>RFT (Rejection Fine-Tuning)</strong> uses the model itself to generate multiple responses, filters them for correctness (e.g., using a code compiler or calculator), and then fine-tunes only on those verified correct samples. This allows models to self-improve beyond human capabilities in certain domains.</p>
|
| 769 |
+
</div>
|
| 770 |
+
|
| 771 |
+
<h3>Building a Reasoning LLM (GRPO)</h3>
|
| 772 |
+
<p>DeepSeek's <strong>GRPO</strong> (Group Relative Policy Optimization) is the new frontier. Instead of a separate reward model, you generate a group of outputs and rank them relative to each other (e.g., based on mathematical correctness or formatting). This forces the model to learn long chains of thought and "thinking" behavior.</p>
|
| 773 |
+
|
| 774 |
<h3>When to Fine-Tune vs RAG</h3>
|
| 775 |
<div class="comparison">
|
| 776 |
<div class="comparison-bad"><strong>Use RAG when:</strong> Knowledge changes frequently, facts need to be cited, domain data is large/dynamic. Lower cost, easier updates.</div>
|
|
|
|
| 812 |
args=SFTConfig(output_dir=<span class="string">"./llama-finetuned"</span>, num_train_epochs=<span class="number">2</span>)
|
| 813 |
)
|
| 814 |
trainer.train()</div>
|
| 815 |
+
|
| 816 |
+
<h3>Implementing LoRA From Scratch (Conceptual)</h3>
|
| 817 |
+
<div class="code-block"><span class="keyword">import</span> torch
|
| 818 |
+
<span class="keyword">import</span> torch.nn <span class="keyword">as</span> nn
|
| 819 |
+
|
| 820 |
+
<span class="keyword">class</span> <span class="function">LoRALayer</span>(nn.Module):
|
| 821 |
+
<span class="keyword">def</span> <span class="function">__init__</span>(self, W, rank=<span class="number">8</span>, alpha=<span class="number">16</span>):
|
| 822 |
+
super().__init__()
|
| 823 |
+
self.W = W <span class="comment"># Original frozen weights (d x k)</span>
|
| 824 |
+
d, k = W.shape
|
| 825 |
+
<span class="comment"># Low-rank matrices A and B</span>
|
| 826 |
+
self.A = nn.Parameter(torch.randn(d, rank) / (rank**<span class="number">0.5</span>))
|
| 827 |
+
self.B = nn.Parameter(torch.zeros(rank, k))
|
| 828 |
+
self.scaling = alpha / rank
|
| 829 |
+
|
| 830 |
+
<span class="keyword">def</span> <span class="function">forward</span>(self, x):
|
| 831 |
+
<span class="comment"># Output = xW + (xAB) * scaling</span>
|
| 832 |
+
base_out = x @ self.W
|
| 833 |
+
lora_out = (x @ self.A @ self.B) * self.scaling
|
| 834 |
+
<span class="keyword">return</span> base_out + lora_out
|
| 835 |
+
|
| 836 |
+
<span class="comment"># DeepSeek-R1 style GRPO loop (simplified)</span>
|
| 837 |
+
<span class="keyword">def</span> <span class="function">grpo_step</span>(model, query, num_samples=<span class="number">8</span>):
|
| 838 |
+
outputs = model.generate(query, n=num_samples)
|
| 839 |
+
rewards = [compute_reward(o) <span class="keyword">for</span> o <span class="keyword">in</span> outputs]
|
| 840 |
+
<span class="comment"># Normalize rewards within the group</span>
|
| 841 |
+
adv = [(r - mean(rewards)) / std(rewards) <span class="keyword">for</span> r <span class="keyword">in</span> rewards]
|
| 842 |
+
loss = compute_ppo_loss(outputs, adv)
|
| 843 |
+
loss.backward()</div>
|
| 844 |
<h3>Merge LoRA Weights for Deployment</h3>
|
| 845 |
<div class="code-block"><span class="keyword">from</span> peft <span class="keyword">import</span> PeftModel
|
| 846 |
|
|
|
|
| 900 |
<tr><td>all-MiniLM-L6-v2</td><td>384</td><td>256</td><td>Good</td><td>Free</td></tr>
|
| 901 |
</table>
|
| 902 |
|
| 903 |
+
<h3>4. 8 Modern RAG Architectures</h3>
|
| 904 |
<table>
|
| 905 |
+
<tr><th>Architecture</th><th>How It Works</th><th>Advantage</th></tr>
|
| 906 |
+
<tr><td><strong>Naive RAG</strong></td><td>Retrieve top-K, generate</td><td>Simplest, 70% accuracy</td></tr>
|
| 907 |
+
<tr><td><strong>Advanced RAG</strong></td><td>Pre-retrieval (HyDE) + Post (Re-ranking)</td><td>Better precision, 85% accuracy</td></tr>
|
| 908 |
+
<tr><td><strong>Self-RAG</strong></td><td>Agent decides <em>if</em> retrieval is needed</td><td>Reduces token cost / hallucinations</td></tr>
|
| 909 |
+
<tr><td><strong>REFRAG</strong></td><td>Retrieve first, then Refine query and Retrieve again</td><td>Critical for multi-hop reasoning</td></tr>
|
| 910 |
+
<tr><td><strong>CAG</strong> (Context Augmented Gen)</td><td>Pre-process docs into model KV cache</td><td>Ultra-low latency for fixed datasets</td></tr>
|
| 911 |
+
<tr><td><strong>HyDE</strong></td><td>Embed hypothetical answer, not query</td><td>Handles vocabulary mismatch</td></tr>
|
| 912 |
+
<tr><td><strong>Agentic RAG</strong></td><td>Agent uses search tool loop as needed</td><td>Most flexible but slowest</td></tr>
|
| 913 |
+
<tr><td><strong>Knowledge Graph RAG</strong></td><td>Retrieve triple relations (GraphRAG)</td><td>Excellent for complex connections</td></tr>
|
| 914 |
</table>
|
| 915 |
|
| 916 |
+
<h3>5. Traditional RAG vs HyDE</h3>
|
| 917 |
+
<div class="comparison">
|
| 918 |
+
<div class="comparison-bad"><strong>Naive:</strong> Embed "How is company X doing?". Vector search searches for fragments of that query.</div>
|
| 919 |
+
<div class="comparison-good"><strong>HyDE:</strong> LLM writes a <em>hypothetical</em> investor report for company X. We embed THAT report. Vector search finds similar *actual* reports.</div>
|
| 920 |
+
</div>
|
| 921 |
+
|
| 922 |
+
<h3>6. RAG vs Agentic RAG and AI Memory</h3>
|
| 923 |
+
<p>Standard RAG is a <strong>one-shot</strong> process. <strong>Agentic RAG</strong> allows an agent to decide how to search, what to search, and when to stop. Combined with <strong>AI Memory</strong> (persisting relevant facts across sessions), this creates systems that grow smarter with user interaction.</p>
|
| 924 |
|
| 925 |
+
<h3>7. Evaluating RAG (RAGAS)</h3>
|
| 926 |
<table>
|
| 927 |
<tr><th>Metric</th><th>Measures</th><th>Target</th></tr>
|
| 928 |
<tr><td><strong>Faithfulness</strong></td><td>Are claims supported by context?</td><td>0.9+</td></tr>
|
|
|
|
| 1129 |
<h3>1. ReAct β The Foundation</h3>
|
| 1130 |
<p>ReAct (Yao 2022): <strong>Thought</strong> > <strong>Action</strong> > <strong>Observation</strong> > repeat. The LLM reasons about what to do, calls a tool, sees the result, and continues until it has a final answer.</p>
|
| 1131 |
|
| 1132 |
+
<h3>2. Agent Architectures & Patterns</h3>
|
| 1133 |
<table>
|
| 1134 |
<tr><th>Architecture</th><th>How It Works</th><th>Best For</th></tr>
|
| 1135 |
<tr><td><strong>ReAct Loop</strong></td><td>Fixed think-act-observe cycle</td><td>Simple tool-using tasks</td></tr>
|
| 1136 |
<tr><td><strong>Plan-and-Execute</strong></td><td>Full plan first, then execute steps</td><td>Multi-step structured tasks</td></tr>
|
| 1137 |
<tr><td><strong>State Machine (LangGraph)</strong></td><td>Directed graph with conditional edges</td><td>Complex workflows, branching</td></tr>
|
| 1138 |
<tr><td><strong>Reflection</strong></td><td>Agent evaluates own output, retries</td><td>Quality-critical tasks</td></tr>
|
| 1139 |
+
<tr><td><strong>Self-Correction</strong></td><td>Agent detects syntax/logic errors via tools</td><td>Code generation agents</td></tr>
|
| 1140 |
</table>
|
| 1141 |
|
| 1142 |
+
<h3>3. Advanced Prompting for Agents</h3>
|
| 1143 |
+
<ul>
|
| 1144 |
+
<li><strong>JSON Prompting:</strong> Forcing the model to output *only* valid JSON. Essential for reliable tool calling and downstream processing. Strategy: Provide a schema and a "Stop Sequence" at "}".</li>
|
| 1145 |
+
<li><strong>Verbalized Sampling:</strong> Forcing the agent to "think out loud" before choosing a tool. Similar to Chain of Thought but explicitly for tool selection.</li>
|
| 1146 |
+
<li><strong>Few-shot Tooling:</strong> Providing 2-3 examples of correct tool usage in the prompt block. 10x more effective than instructions alone.</li>
|
| 1147 |
+
</ul>
|
| 1148 |
+
|
| 1149 |
+
<div class="callout insight">
|
| 1150 |
+
<div class="callout-title">π Book Insight: 30 Must-Know Agentic Terms</div>
|
| 1151 |
+
<p>Key terms: <strong>Handoff</strong> (passing task to sub-agent), <strong>Orchestrator</strong> (supervisor agent), <strong>Part</strong> (typed data in A2A), <strong>Interrupt</strong> (Human-in-the-loop wait), <strong>Grounding</strong> (connecting to RAG/Tools), <strong>Hallucination guardrails</strong> (output filtering).</p>
|
| 1152 |
+
</div>
|
| 1153 |
+
|
| 1154 |
<h3>3. Framework Comparison</h3>
|
| 1155 |
<table>
|
| 1156 |
<tr><th>Framework</th><th>Paradigm</th><th>Strengths</th><th>Best For</th></tr>
|