Spaces:
Running
Running
Commit Β·
9110364
1
Parent(s): 1cd23a6
fix: resolve syntax error in app.js caused by escaped backticks
Browse files- GenAI-AgenticAI/app.js +31 -31
GenAI-AgenticAI/app.js
CHANGED
|
@@ -2403,20 +2403,20 @@ Object.assign(MODULE_CONTENT, {
|
|
| 2403 |
<li><strong>Stop Sequences:</strong> For older or open models, using <code>}</code> as a stop sequence guarantees no trailing text.</li>
|
| 2404 |
<li><strong>Pre-filling the Assistant:</strong> Append <code>{</code> to the end of your prompt so the model is forced to start generating JSON keys instantly without saying "Here is the JSON...".</li>
|
| 2405 |
</ul>
|
| 2406 |
-
</div>
|
| 2407 |
-
code:
|
| 2408 |
<div class="section">
|
| 2409 |
<h2>π» Prompting for Agents β Code Examples</h2>
|
| 2410 |
|
| 2411 |
<h3>1. Verbalized Sampling Prompt Template</h3>
|
| 2412 |
-
<div class="code-block"><span class="keyword">const</span> system_prompt = <span class="string">
|
| 2413 |
When given a task, you MUST use the following format:
|
| 2414 |
|
| 2415 |
Thought: Consider what you need to do, step by step. Which tool is needed?
|
| 2416 |
Action: The name of the tool to use (e.g. "search_web", "calculate")
|
| 2417 |
Action Input: The arguments for the tool in valid JSON.
|
| 2418 |
|
| 2419 |
-
You MUST articulate your Thought before your Action.
|
| 2420 |
|
| 2421 |
<h3>2. Forcing JSON on Open Models</h3>
|
| 2422 |
<div class="code-block"><span class="keyword">import</span> { pipeline } <span class="keyword">from</span> <span class="string">"@huggingface/transformers"</span>;
|
|
@@ -2431,16 +2431,16 @@ You MUST articulate your Thought before your Action.\`</span></div>
|
|
| 2431 |
});
|
| 2432 |
<span class="keyword">const</span> raw = <span class="string">"{"</span> + out[0].generated_text; <span class="comment">// Prepend the '{' that we forced</span>
|
| 2433 |
<span class="keyword">const</span> json = JSON.parse(raw);</div>
|
| 2434 |
-
</div>
|
| 2435 |
-
interview:
|
| 2436 |
<div class="section">
|
| 2437 |
<h2>π― Prompt Engineering β Interview Questions</h2>
|
| 2438 |
<div class="interview-box"><strong>Q1: Why does Chain of Thought work?</strong><p><strong>Answer:</strong> It provides additional computational steps (tokens) for the model to process logic. Since an LLM spends a fixed amount of computation per token, forcing it to generate a 50-token thought process before answering allocates 50x more computation to solving the problem than just answering immediately.</p></div>
|
| 2439 |
<div class="interview-box"><strong>Q2: How is JSON Prompting different from OpenAI Function Calling?</strong><p><strong>Answer:</strong> JSON prompting is done via the text prompt and relies on the model's instruction following (good for open models). Function/Tool calling is a native API feature where the provider fine-tunes the model explicitly to output arguments matching a schema via constrained decoding, ensuring much higher reliability.</p></div>
|
| 2440 |
-
</div>
|
| 2441 |
},
|
| 2442 |
'llm-optimization': {
|
| 2443 |
-
concepts:
|
| 2444 |
<div class="section">
|
| 2445 |
<h2>ποΈ LLM Optimization β Complete Deep Dive</h2>
|
| 2446 |
<div class="info-box">
|
|
@@ -2470,8 +2470,8 @@ You MUST articulate your Thought before your Action.\`</span></div>
|
|
| 2470 |
<div class="callout-title">β οΈ The KV Cache Bottleneck</div>
|
| 2471 |
<p>While KV caching solves the compute problem, it introduces a memory problem. A 100K context window across high batch sizes can cause the KV cache to consume more GPU RAM than the model weights themselves! This is why techniques like <strong>PagedAttention</strong> (vLLM) and <strong>GQA (Grouped Query Attention)</strong> were invented.</p>
|
| 2472 |
</div>
|
| 2473 |
-
</div>
|
| 2474 |
-
code:
|
| 2475 |
<div class="section">
|
| 2476 |
<h2>π» LLM Optimization β Code Examples</h2>
|
| 2477 |
|
|
@@ -2498,16 +2498,16 @@ tokenizer = AutoTokenizer.from_pretrained(model_path)
|
|
| 2498 |
model.quantize(tokenizer, quant_config=quant_config)
|
| 2499 |
model.save_quantized(quant_path)
|
| 2500 |
tokenizer.save_pretrained(quant_path)</div>
|
| 2501 |
-
</div>
|
| 2502 |
-
interview:
|
| 2503 |
<div class="section">
|
| 2504 |
<h2>π― LLM Optimization β Interview Questions</h2>
|
| 2505 |
<div class="interview-box"><strong>Q1: What is the difference between compute-bound and memory-bandwidth bound?</strong><p><strong>Answer:</strong> Compute-bound means the GPU spends all its time doing math (matrix multiplications). Memory-bandwidth bound means the math is easy, but the GPU spends all its time waiting for weights to be copied from High Bandwidth Memory (HBM) to on-chip SRAM. LLM prefill (reading the prompt) is compute-bound, but decoding (generating tokens one by one) is memory-bandwidth bound.</p></div>
|
| 2506 |
<div class="interview-box"><strong>Q2: Assume you use vLLM. What is PagedAttention?</strong><p><strong>Answer:</strong> Normally, KV cache is pre-allocated continuously in GPU memory. Because output lengths are unknown, frameworks over-allocate memory, wasting up to 60%. PagedAttention divides the KV cache into small blocks (pages) and allocates them dynamically, like virtual memory in an OS. This allows near-zero waste and 2-4x higher concurrency (batching).</p></div>
|
| 2507 |
-
</div>
|
| 2508 |
},
|
| 2509 |
'llm-observability': {
|
| 2510 |
-
concepts:
|
| 2511 |
<div class="section">
|
| 2512 |
<h2>π LLM Observability β Complete Deep Dive</h2>
|
| 2513 |
<div class="info-box">
|
|
@@ -2532,8 +2532,8 @@ tokenizer.save_pretrained(quant_path)</div>
|
|
| 2532 |
<tr><td><strong>Helicone</strong></td><td>Proxy-based observability (just change the base URL, no SDK needed).</td></tr>
|
| 2533 |
<tr><td><strong>Opik (by Comet)</strong></td><td>Agent optimization and evaluation natively integrated with traces.</td></tr>
|
| 2534 |
</table>
|
| 2535 |
-
</div>
|
| 2536 |
-
code:
|
| 2537 |
<div class="section">
|
| 2538 |
<h2>π» Observability β Code Examples</h2>
|
| 2539 |
|
|
@@ -2566,16 +2566,16 @@ client = Client()
|
|
| 2566 |
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">f"Context: {context}\nQuery: {query}"</span>}]
|
| 2567 |
)
|
| 2568 |
<span class="keyword">return</span> resp.choices[<span class="number">0</span>].message.content</div>
|
| 2569 |
-
</div>
|
| 2570 |
-
interview:
|
| 2571 |
<div class="section">
|
| 2572 |
<h2>π― Observability β Interview Questions</h2>
|
| 2573 |
<div class="interview-box"><strong>Q1: What is Time-To-First-Token (TTFT) and why does it matter?</strong><p><strong>Answer:</strong> TTFT measures the latency from the moment the user sends the request until the first token streams back to the client. In LLM applications, total end-to-end latency might be 5-10 seconds, which is unacceptable for UX. Streaming combined with low TTFT (<1 second) creates the illusion of speed and keeps users engaged.</p></div>
|
| 2574 |
<div class="interview-box"><strong>Q2: Why use a proxy like Helicone over an SDK like LangSmith?</strong><p><strong>Answer:</strong> A proxy requires ZERO code changes β you simply change the API base URL from <code>api.openai.com</code> to <code>oai.hconeai.com</code> and pass your proxy key in the header. It automatically logs all prompts, responses, costs, and latencies. However, an SDK (like Langfuse/LangSmith) is required if you want deep, nested trace trees for complex agents (e.g., seeing exactly which step in a 10-step LangGraph flow failed).</p></div>
|
| 2575 |
-
</div>
|
| 2576 |
},
|
| 2577 |
'multiagent': {
|
| 2578 |
-
concepts:
|
| 2579 |
<div class="section">
|
| 2580 |
<h2>πΈοΈ Multi-Agent Systems (MAS)</h2>
|
| 2581 |
<div class="info-box">
|
|
@@ -2602,8 +2602,8 @@ client = Client()
|
|
| 2602 |
<div class="callout-title">π‘ Minimizing Friction</div>
|
| 2603 |
<p>When picking a pattern, prioritize minimizing communication overhead. 10 agents isn't better than 2 if they duplicate work. The system should feel smarter than its individual parts.</p>
|
| 2604 |
</div>
|
| 2605 |
-
</div>
|
| 2606 |
-
code:
|
| 2607 |
<div class="section">
|
| 2608 |
<h2>π» Multi-Agent β Code Examples</h2>
|
| 2609 |
<h3>Simple Router Pattern with LiteLLM</h3>
|
|
@@ -2622,16 +2622,16 @@ client = Client()
|
|
| 2622 |
<span class="keyword">return</span> finance_specialist(query)
|
| 2623 |
<span class="keyword">else</span>:
|
| 2624 |
<span class="keyword">return</span> general_agent(query)</div>
|
| 2625 |
-
</div>
|
| 2626 |
-
interview:
|
| 2627 |
<div class="section">
|
| 2628 |
<h2>π― Multi-Agent β Interview Questions</h2>
|
| 2629 |
<div class="interview-box"><strong>Q1: Parallel vs Sequential orchestration?</strong><p><strong>Answer:</strong> Parallel is for independent tasks (data extraction + web search) to reduce latency. Sequential is for dependent tasks where step B needs output of step A (code writing then code review). Use parallel for scale, sequential for quality-controlled pipelines.</p></div>
|
| 2630 |
<div class="interview-box"><strong>Q2: What is the Hierarchical pattern?</strong><p><strong>Answer:</strong> It mimics a corporate structure: a Manager/Planner agent receives the high-level goal, breaks it into sub-tasks, and delegates them to specialized Worker agents. The Manager tracks state and makes the final quality check. Best for complex, ambiguous projects.</p></div>
|
| 2631 |
-
</div>
|
| 2632 |
},
|
| 2633 |
'tools': {
|
| 2634 |
-
concepts:
|
| 2635 |
<div class="section">
|
| 2636 |
<h2>π§ Function Calling & Tools</h2>
|
| 2637 |
<div class="info-box">
|
|
@@ -2647,8 +2647,8 @@ client = Client()
|
|
| 2647 |
|
| 2648 |
<h3>3. Verbalized Sampling</h3>
|
| 2649 |
<p>Forcing the agent to generate a "Thought:" block before the "Action:" block. This conditions the tool selection on a logical premise, significantly reducing errors in choosing the wrong tool or arguments.</p>
|
| 2650 |
-
</div>
|
| 2651 |
-
code:
|
| 2652 |
<div class="section">
|
| 2653 |
<h2>π» Tools β Code Examples</h2>
|
| 2654 |
<h3>OpenAI Native Tool Call</h3>
|
|
@@ -2663,12 +2663,12 @@ client = Client()
|
|
| 2663 |
}
|
| 2664 |
}]
|
| 2665 |
<span class="comment"># Pass this to chat.completions.create(..., tools=tools)</span></div>
|
| 2666 |
-
</div>
|
| 2667 |
-
interview:
|
| 2668 |
<div class="section">
|
| 2669 |
<h2>π― Tools β Interview Questions</h2>
|
| 2670 |
<div class="interview-box"><strong>Q1: Why use Verbalized Sampling in tool calling?</strong><p><strong>Answer:</strong> It forces the model to articulate a rationale *before* picking a tool. Since tokens are generated left-to-right, the tool selection becomes conditioned on the reasoning, which increases precision, especially when multiple similar tools exist.</p></div>
|
| 2671 |
-
</div>
|
| 2672 |
}
|
| 2673 |
});
|
| 2674 |
|
|
|
|
| 2403 |
<li><strong>Stop Sequences:</strong> For older or open models, using <code>}</code> as a stop sequence guarantees no trailing text.</li>
|
| 2404 |
<li><strong>Pre-filling the Assistant:</strong> Append <code>{</code> to the end of your prompt so the model is forced to start generating JSON keys instantly without saying "Here is the JSON...".</li>
|
| 2405 |
</ul>
|
| 2406 |
+
</div>`,
|
| 2407 |
+
code: `
|
| 2408 |
<div class="section">
|
| 2409 |
<h2>π» Prompting for Agents β Code Examples</h2>
|
| 2410 |
|
| 2411 |
<h3>1. Verbalized Sampling Prompt Template</h3>
|
| 2412 |
+
<div class="code-block"><span class="keyword">const</span> system_prompt = <span class="string">`You are a sophisticated AI agent with access to tools.
|
| 2413 |
When given a task, you MUST use the following format:
|
| 2414 |
|
| 2415 |
Thought: Consider what you need to do, step by step. Which tool is needed?
|
| 2416 |
Action: The name of the tool to use (e.g. "search_web", "calculate")
|
| 2417 |
Action Input: The arguments for the tool in valid JSON.
|
| 2418 |
|
| 2419 |
+
You MUST articulate your Thought before your Action.`</span></div>
|
| 2420 |
|
| 2421 |
<h3>2. Forcing JSON on Open Models</h3>
|
| 2422 |
<div class="code-block"><span class="keyword">import</span> { pipeline } <span class="keyword">from</span> <span class="string">"@huggingface/transformers"</span>;
|
|
|
|
| 2431 |
});
|
| 2432 |
<span class="keyword">const</span> raw = <span class="string">"{"</span> + out[0].generated_text; <span class="comment">// Prepend the '{' that we forced</span>
|
| 2433 |
<span class="keyword">const</span> json = JSON.parse(raw);</div>
|
| 2434 |
+
</div>`,
|
| 2435 |
+
interview: `
|
| 2436 |
<div class="section">
|
| 2437 |
<h2>π― Prompt Engineering β Interview Questions</h2>
|
| 2438 |
<div class="interview-box"><strong>Q1: Why does Chain of Thought work?</strong><p><strong>Answer:</strong> It provides additional computational steps (tokens) for the model to process logic. Since an LLM spends a fixed amount of computation per token, forcing it to generate a 50-token thought process before answering allocates 50x more computation to solving the problem than just answering immediately.</p></div>
|
| 2439 |
<div class="interview-box"><strong>Q2: How is JSON Prompting different from OpenAI Function Calling?</strong><p><strong>Answer:</strong> JSON prompting is done via the text prompt and relies on the model's instruction following (good for open models). Function/Tool calling is a native API feature where the provider fine-tunes the model explicitly to output arguments matching a schema via constrained decoding, ensuring much higher reliability.</p></div>
|
| 2440 |
+
</div>`
|
| 2441 |
},
|
| 2442 |
'llm-optimization': {
|
| 2443 |
+
concepts: `
|
| 2444 |
<div class="section">
|
| 2445 |
<h2>ποΈ LLM Optimization β Complete Deep Dive</h2>
|
| 2446 |
<div class="info-box">
|
|
|
|
| 2470 |
<div class="callout-title">β οΈ The KV Cache Bottleneck</div>
|
| 2471 |
<p>While KV caching solves the compute problem, it introduces a memory problem. A 100K context window across high batch sizes can cause the KV cache to consume more GPU RAM than the model weights themselves! This is why techniques like <strong>PagedAttention</strong> (vLLM) and <strong>GQA (Grouped Query Attention)</strong> were invented.</p>
|
| 2472 |
</div>
|
| 2473 |
+
</div>`,
|
| 2474 |
+
code: `
|
| 2475 |
<div class="section">
|
| 2476 |
<h2>π» LLM Optimization β Code Examples</h2>
|
| 2477 |
|
|
|
|
| 2498 |
model.quantize(tokenizer, quant_config=quant_config)
|
| 2499 |
model.save_quantized(quant_path)
|
| 2500 |
tokenizer.save_pretrained(quant_path)</div>
|
| 2501 |
+
</div>`,
|
| 2502 |
+
interview: `
|
| 2503 |
<div class="section">
|
| 2504 |
<h2>π― LLM Optimization β Interview Questions</h2>
|
| 2505 |
<div class="interview-box"><strong>Q1: What is the difference between compute-bound and memory-bandwidth bound?</strong><p><strong>Answer:</strong> Compute-bound means the GPU spends all its time doing math (matrix multiplications). Memory-bandwidth bound means the math is easy, but the GPU spends all its time waiting for weights to be copied from High Bandwidth Memory (HBM) to on-chip SRAM. LLM prefill (reading the prompt) is compute-bound, but decoding (generating tokens one by one) is memory-bandwidth bound.</p></div>
|
| 2506 |
<div class="interview-box"><strong>Q2: Assume you use vLLM. What is PagedAttention?</strong><p><strong>Answer:</strong> Normally, KV cache is pre-allocated continuously in GPU memory. Because output lengths are unknown, frameworks over-allocate memory, wasting up to 60%. PagedAttention divides the KV cache into small blocks (pages) and allocates them dynamically, like virtual memory in an OS. This allows near-zero waste and 2-4x higher concurrency (batching).</p></div>
|
| 2507 |
+
</div>`
|
| 2508 |
},
|
| 2509 |
'llm-observability': {
|
| 2510 |
+
concepts: `
|
| 2511 |
<div class="section">
|
| 2512 |
<h2>π LLM Observability β Complete Deep Dive</h2>
|
| 2513 |
<div class="info-box">
|
|
|
|
| 2532 |
<tr><td><strong>Helicone</strong></td><td>Proxy-based observability (just change the base URL, no SDK needed).</td></tr>
|
| 2533 |
<tr><td><strong>Opik (by Comet)</strong></td><td>Agent optimization and evaluation natively integrated with traces.</td></tr>
|
| 2534 |
</table>
|
| 2535 |
+
</div>`,
|
| 2536 |
+
code: `
|
| 2537 |
<div class="section">
|
| 2538 |
<h2>π» Observability β Code Examples</h2>
|
| 2539 |
|
|
|
|
| 2566 |
messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">f"Context: {context}\nQuery: {query}"</span>}]
|
| 2567 |
)
|
| 2568 |
<span class="keyword">return</span> resp.choices[<span class="number">0</span>].message.content</div>
|
| 2569 |
+
</div>`,
|
| 2570 |
+
interview: `
|
| 2571 |
<div class="section">
|
| 2572 |
<h2>π― Observability β Interview Questions</h2>
|
| 2573 |
<div class="interview-box"><strong>Q1: What is Time-To-First-Token (TTFT) and why does it matter?</strong><p><strong>Answer:</strong> TTFT measures the latency from the moment the user sends the request until the first token streams back to the client. In LLM applications, total end-to-end latency might be 5-10 seconds, which is unacceptable for UX. Streaming combined with low TTFT (<1 second) creates the illusion of speed and keeps users engaged.</p></div>
|
| 2574 |
<div class="interview-box"><strong>Q2: Why use a proxy like Helicone over an SDK like LangSmith?</strong><p><strong>Answer:</strong> A proxy requires ZERO code changes β you simply change the API base URL from <code>api.openai.com</code> to <code>oai.hconeai.com</code> and pass your proxy key in the header. It automatically logs all prompts, responses, costs, and latencies. However, an SDK (like Langfuse/LangSmith) is required if you want deep, nested trace trees for complex agents (e.g., seeing exactly which step in a 10-step LangGraph flow failed).</p></div>
|
| 2575 |
+
</div>`
|
| 2576 |
},
|
| 2577 |
'multiagent': {
|
| 2578 |
+
concepts: `
|
| 2579 |
<div class="section">
|
| 2580 |
<h2>πΈοΈ Multi-Agent Systems (MAS)</h2>
|
| 2581 |
<div class="info-box">
|
|
|
|
| 2602 |
<div class="callout-title">π‘ Minimizing Friction</div>
|
| 2603 |
<p>When picking a pattern, prioritize minimizing communication overhead. 10 agents isn't better than 2 if they duplicate work. The system should feel smarter than its individual parts.</p>
|
| 2604 |
</div>
|
| 2605 |
+
</div>`,
|
| 2606 |
+
code: `
|
| 2607 |
<div class="section">
|
| 2608 |
<h2>π» Multi-Agent β Code Examples</h2>
|
| 2609 |
<h3>Simple Router Pattern with LiteLLM</h3>
|
|
|
|
| 2622 |
<span class="keyword">return</span> finance_specialist(query)
|
| 2623 |
<span class="keyword">else</span>:
|
| 2624 |
<span class="keyword">return</span> general_agent(query)</div>
|
| 2625 |
+
</div>`,
|
| 2626 |
+
interview: `
|
| 2627 |
<div class="section">
|
| 2628 |
<h2>π― Multi-Agent β Interview Questions</h2>
|
| 2629 |
<div class="interview-box"><strong>Q1: Parallel vs Sequential orchestration?</strong><p><strong>Answer:</strong> Parallel is for independent tasks (data extraction + web search) to reduce latency. Sequential is for dependent tasks where step B needs output of step A (code writing then code review). Use parallel for scale, sequential for quality-controlled pipelines.</p></div>
|
| 2630 |
<div class="interview-box"><strong>Q2: What is the Hierarchical pattern?</strong><p><strong>Answer:</strong> It mimics a corporate structure: a Manager/Planner agent receives the high-level goal, breaks it into sub-tasks, and delegates them to specialized Worker agents. The Manager tracks state and makes the final quality check. Best for complex, ambiguous projects.</p></div>
|
| 2631 |
+
</div>`
|
| 2632 |
},
|
| 2633 |
'tools': {
|
| 2634 |
+
concepts: `
|
| 2635 |
<div class="section">
|
| 2636 |
<h2>π§ Function Calling & Tools</h2>
|
| 2637 |
<div class="info-box">
|
|
|
|
| 2647 |
|
| 2648 |
<h3>3. Verbalized Sampling</h3>
|
| 2649 |
<p>Forcing the agent to generate a "Thought:" block before the "Action:" block. This conditions the tool selection on a logical premise, significantly reducing errors in choosing the wrong tool or arguments.</p>
|
| 2650 |
+
</div>`,
|
| 2651 |
+
code: `
|
| 2652 |
<div class="section">
|
| 2653 |
<h2>π» Tools β Code Examples</h2>
|
| 2654 |
<h3>OpenAI Native Tool Call</h3>
|
|
|
|
| 2663 |
}
|
| 2664 |
}]
|
| 2665 |
<span class="comment"># Pass this to chat.completions.create(..., tools=tools)</span></div>
|
| 2666 |
+
</div>`,
|
| 2667 |
+
interview: `
|
| 2668 |
<div class="section">
|
| 2669 |
<h2>π― Tools β Interview Questions</h2>
|
| 2670 |
<div class="interview-box"><strong>Q1: Why use Verbalized Sampling in tool calling?</strong><p><strong>Answer:</strong> It forces the model to articulate a rationale *before* picking a tool. Since tokens are generated left-to-right, the tool selection becomes conditioned on the reasoning, which increases precision, especially when multiple similar tools exist.</p></div>
|
| 2671 |
+
</div>`
|
| 2672 |
}
|
| 2673 |
});
|
| 2674 |
|