AashishAIHub commited on
Commit
9110364
Β·
1 Parent(s): 1cd23a6

fix: resolve syntax error in app.js caused by escaped backticks

Browse files
Files changed (1) hide show
  1. GenAI-AgenticAI/app.js +31 -31
GenAI-AgenticAI/app.js CHANGED
@@ -2403,20 +2403,20 @@ Object.assign(MODULE_CONTENT, {
2403
  <li><strong>Stop Sequences:</strong> For older or open models, using <code>}</code> as a stop sequence guarantees no trailing text.</li>
2404
  <li><strong>Pre-filling the Assistant:</strong> Append <code>{</code> to the end of your prompt so the model is forced to start generating JSON keys instantly without saying "Here is the JSON...".</li>
2405
  </ul>
2406
- </div>\`,
2407
- code: \`
2408
  <div class="section">
2409
  <h2>πŸ’» Prompting for Agents β€” Code Examples</h2>
2410
 
2411
  <h3>1. Verbalized Sampling Prompt Template</h3>
2412
- <div class="code-block"><span class="keyword">const</span> system_prompt = <span class="string">\`You are a sophisticated AI agent with access to tools.
2413
  When given a task, you MUST use the following format:
2414
 
2415
  Thought: Consider what you need to do, step by step. Which tool is needed?
2416
  Action: The name of the tool to use (e.g. "search_web", "calculate")
2417
  Action Input: The arguments for the tool in valid JSON.
2418
 
2419
- You MUST articulate your Thought before your Action.\`</span></div>
2420
 
2421
  <h3>2. Forcing JSON on Open Models</h3>
2422
  <div class="code-block"><span class="keyword">import</span> { pipeline } <span class="keyword">from</span> <span class="string">"@huggingface/transformers"</span>;
@@ -2431,16 +2431,16 @@ You MUST articulate your Thought before your Action.\`</span></div>
2431
  });
2432
  <span class="keyword">const</span> raw = <span class="string">"{"</span> + out[0].generated_text; <span class="comment">// Prepend the '{' that we forced</span>
2433
  <span class="keyword">const</span> json = JSON.parse(raw);</div>
2434
- </div>\`,
2435
- interview: \`
2436
  <div class="section">
2437
  <h2>🎯 Prompt Engineering β€” Interview Questions</h2>
2438
  <div class="interview-box"><strong>Q1: Why does Chain of Thought work?</strong><p><strong>Answer:</strong> It provides additional computational steps (tokens) for the model to process logic. Since an LLM spends a fixed amount of computation per token, forcing it to generate a 50-token thought process before answering allocates 50x more computation to solving the problem than just answering immediately.</p></div>
2439
  <div class="interview-box"><strong>Q2: How is JSON Prompting different from OpenAI Function Calling?</strong><p><strong>Answer:</strong> JSON prompting is done via the text prompt and relies on the model's instruction following (good for open models). Function/Tool calling is a native API feature where the provider fine-tunes the model explicitly to output arguments matching a schema via constrained decoding, ensuring much higher reliability.</p></div>
2440
- </div>\`
2441
  },
2442
  'llm-optimization': {
2443
- concepts: \`
2444
  <div class="section">
2445
  <h2>πŸ—œοΈ LLM Optimization β€” Complete Deep Dive</h2>
2446
  <div class="info-box">
@@ -2470,8 +2470,8 @@ You MUST articulate your Thought before your Action.\`</span></div>
2470
  <div class="callout-title">⚠️ The KV Cache Bottleneck</div>
2471
  <p>While KV caching solves the compute problem, it introduces a memory problem. A 100K context window across high batch sizes can cause the KV cache to consume more GPU RAM than the model weights themselves! This is why techniques like <strong>PagedAttention</strong> (vLLM) and <strong>GQA (Grouped Query Attention)</strong> were invented.</p>
2472
  </div>
2473
- </div>\`,
2474
- code: \`
2475
  <div class="section">
2476
  <h2>πŸ’» LLM Optimization β€” Code Examples</h2>
2477
 
@@ -2498,16 +2498,16 @@ tokenizer = AutoTokenizer.from_pretrained(model_path)
2498
  model.quantize(tokenizer, quant_config=quant_config)
2499
  model.save_quantized(quant_path)
2500
  tokenizer.save_pretrained(quant_path)</div>
2501
- </div>\`,
2502
- interview: \`
2503
  <div class="section">
2504
  <h2>🎯 LLM Optimization β€” Interview Questions</h2>
2505
  <div class="interview-box"><strong>Q1: What is the difference between compute-bound and memory-bandwidth bound?</strong><p><strong>Answer:</strong> Compute-bound means the GPU spends all its time doing math (matrix multiplications). Memory-bandwidth bound means the math is easy, but the GPU spends all its time waiting for weights to be copied from High Bandwidth Memory (HBM) to on-chip SRAM. LLM prefill (reading the prompt) is compute-bound, but decoding (generating tokens one by one) is memory-bandwidth bound.</p></div>
2506
  <div class="interview-box"><strong>Q2: Assume you use vLLM. What is PagedAttention?</strong><p><strong>Answer:</strong> Normally, KV cache is pre-allocated continuously in GPU memory. Because output lengths are unknown, frameworks over-allocate memory, wasting up to 60%. PagedAttention divides the KV cache into small blocks (pages) and allocates them dynamically, like virtual memory in an OS. This allows near-zero waste and 2-4x higher concurrency (batching).</p></div>
2507
- </div>\`
2508
  },
2509
  'llm-observability': {
2510
- concepts: \`
2511
  <div class="section">
2512
  <h2>πŸ”­ LLM Observability β€” Complete Deep Dive</h2>
2513
  <div class="info-box">
@@ -2532,8 +2532,8 @@ tokenizer.save_pretrained(quant_path)</div>
2532
  <tr><td><strong>Helicone</strong></td><td>Proxy-based observability (just change the base URL, no SDK needed).</td></tr>
2533
  <tr><td><strong>Opik (by Comet)</strong></td><td>Agent optimization and evaluation natively integrated with traces.</td></tr>
2534
  </table>
2535
- </div>\`,
2536
- code: \`
2537
  <div class="section">
2538
  <h2>πŸ’» Observability β€” Code Examples</h2>
2539
 
@@ -2566,16 +2566,16 @@ client = Client()
2566
  messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">f"Context: {context}\nQuery: {query}"</span>}]
2567
  )
2568
  <span class="keyword">return</span> resp.choices[<span class="number">0</span>].message.content</div>
2569
- </div>\`,
2570
- interview: \`
2571
  <div class="section">
2572
  <h2>🎯 Observability β€” Interview Questions</h2>
2573
  <div class="interview-box"><strong>Q1: What is Time-To-First-Token (TTFT) and why does it matter?</strong><p><strong>Answer:</strong> TTFT measures the latency from the moment the user sends the request until the first token streams back to the client. In LLM applications, total end-to-end latency might be 5-10 seconds, which is unacceptable for UX. Streaming combined with low TTFT (&lt;1 second) creates the illusion of speed and keeps users engaged.</p></div>
2574
  <div class="interview-box"><strong>Q2: Why use a proxy like Helicone over an SDK like LangSmith?</strong><p><strong>Answer:</strong> A proxy requires ZERO code changes β€” you simply change the API base URL from <code>api.openai.com</code> to <code>oai.hconeai.com</code> and pass your proxy key in the header. It automatically logs all prompts, responses, costs, and latencies. However, an SDK (like Langfuse/LangSmith) is required if you want deep, nested trace trees for complex agents (e.g., seeing exactly which step in a 10-step LangGraph flow failed).</p></div>
2575
- </div>\`
2576
  },
2577
  'multiagent': {
2578
- concepts: \`
2579
  <div class="section">
2580
  <h2>πŸ•ΈοΈ Multi-Agent Systems (MAS)</h2>
2581
  <div class="info-box">
@@ -2602,8 +2602,8 @@ client = Client()
2602
  <div class="callout-title">πŸ’‘ Minimizing Friction</div>
2603
  <p>When picking a pattern, prioritize minimizing communication overhead. 10 agents isn't better than 2 if they duplicate work. The system should feel smarter than its individual parts.</p>
2604
  </div>
2605
- </div>\`,
2606
- code: \`
2607
  <div class="section">
2608
  <h2>πŸ’» Multi-Agent β€” Code Examples</h2>
2609
  <h3>Simple Router Pattern with LiteLLM</h3>
@@ -2622,16 +2622,16 @@ client = Client()
2622
  <span class="keyword">return</span> finance_specialist(query)
2623
  <span class="keyword">else</span>:
2624
  <span class="keyword">return</span> general_agent(query)</div>
2625
- </div>\`,
2626
- interview: \`
2627
  <div class="section">
2628
  <h2>🎯 Multi-Agent β€” Interview Questions</h2>
2629
  <div class="interview-box"><strong>Q1: Parallel vs Sequential orchestration?</strong><p><strong>Answer:</strong> Parallel is for independent tasks (data extraction + web search) to reduce latency. Sequential is for dependent tasks where step B needs output of step A (code writing then code review). Use parallel for scale, sequential for quality-controlled pipelines.</p></div>
2630
  <div class="interview-box"><strong>Q2: What is the Hierarchical pattern?</strong><p><strong>Answer:</strong> It mimics a corporate structure: a Manager/Planner agent receives the high-level goal, breaks it into sub-tasks, and delegates them to specialized Worker agents. The Manager tracks state and makes the final quality check. Best for complex, ambiguous projects.</p></div>
2631
- </div>\`
2632
  },
2633
  'tools': {
2634
- concepts: \`
2635
  <div class="section">
2636
  <h2>πŸ”§ Function Calling & Tools</h2>
2637
  <div class="info-box">
@@ -2647,8 +2647,8 @@ client = Client()
2647
 
2648
  <h3>3. Verbalized Sampling</h3>
2649
  <p>Forcing the agent to generate a "Thought:" block before the "Action:" block. This conditions the tool selection on a logical premise, significantly reducing errors in choosing the wrong tool or arguments.</p>
2650
- </div>\`,
2651
- code: \`
2652
  <div class="section">
2653
  <h2>πŸ’» Tools β€” Code Examples</h2>
2654
  <h3>OpenAI Native Tool Call</h3>
@@ -2663,12 +2663,12 @@ client = Client()
2663
  }
2664
  }]
2665
  <span class="comment"># Pass this to chat.completions.create(..., tools=tools)</span></div>
2666
- </div>\`,
2667
- interview: \`
2668
  <div class="section">
2669
  <h2>🎯 Tools β€” Interview Questions</h2>
2670
  <div class="interview-box"><strong>Q1: Why use Verbalized Sampling in tool calling?</strong><p><strong>Answer:</strong> It forces the model to articulate a rationale *before* picking a tool. Since tokens are generated left-to-right, the tool selection becomes conditioned on the reasoning, which increases precision, especially when multiple similar tools exist.</p></div>
2671
- </div>\`
2672
  }
2673
  });
2674
 
 
2403
  <li><strong>Stop Sequences:</strong> For older or open models, using <code>}</code> as a stop sequence guarantees no trailing text.</li>
2404
  <li><strong>Pre-filling the Assistant:</strong> Append <code>{</code> to the end of your prompt so the model is forced to start generating JSON keys instantly without saying "Here is the JSON...".</li>
2405
  </ul>
2406
+ </div>`,
2407
+ code: `
2408
  <div class="section">
2409
  <h2>πŸ’» Prompting for Agents β€” Code Examples</h2>
2410
 
2411
  <h3>1. Verbalized Sampling Prompt Template</h3>
2412
+ <div class="code-block"><span class="keyword">const</span> system_prompt = <span class="string">`You are a sophisticated AI agent with access to tools.
2413
  When given a task, you MUST use the following format:
2414
 
2415
  Thought: Consider what you need to do, step by step. Which tool is needed?
2416
  Action: The name of the tool to use (e.g. "search_web", "calculate")
2417
  Action Input: The arguments for the tool in valid JSON.
2418
 
2419
+ You MUST articulate your Thought before your Action.`</span></div>
2420
 
2421
  <h3>2. Forcing JSON on Open Models</h3>
2422
  <div class="code-block"><span class="keyword">import</span> { pipeline } <span class="keyword">from</span> <span class="string">"@huggingface/transformers"</span>;
 
2431
  });
2432
  <span class="keyword">const</span> raw = <span class="string">"{"</span> + out[0].generated_text; <span class="comment">// Prepend the '{' that we forced</span>
2433
  <span class="keyword">const</span> json = JSON.parse(raw);</div>
2434
+ </div>`,
2435
+ interview: `
2436
  <div class="section">
2437
  <h2>🎯 Prompt Engineering β€” Interview Questions</h2>
2438
  <div class="interview-box"><strong>Q1: Why does Chain of Thought work?</strong><p><strong>Answer:</strong> It provides additional computational steps (tokens) for the model to process logic. Since an LLM spends a fixed amount of computation per token, forcing it to generate a 50-token thought process before answering allocates 50x more computation to solving the problem than just answering immediately.</p></div>
2439
  <div class="interview-box"><strong>Q2: How is JSON Prompting different from OpenAI Function Calling?</strong><p><strong>Answer:</strong> JSON prompting is done via the text prompt and relies on the model's instruction following (good for open models). Function/Tool calling is a native API feature where the provider fine-tunes the model explicitly to output arguments matching a schema via constrained decoding, ensuring much higher reliability.</p></div>
2440
+ </div>`
2441
  },
2442
  'llm-optimization': {
2443
+ concepts: `
2444
  <div class="section">
2445
  <h2>πŸ—œοΈ LLM Optimization β€” Complete Deep Dive</h2>
2446
  <div class="info-box">
 
2470
  <div class="callout-title">⚠️ The KV Cache Bottleneck</div>
2471
  <p>While KV caching solves the compute problem, it introduces a memory problem. A 100K context window across high batch sizes can cause the KV cache to consume more GPU RAM than the model weights themselves! This is why techniques like <strong>PagedAttention</strong> (vLLM) and <strong>GQA (Grouped Query Attention)</strong> were invented.</p>
2472
  </div>
2473
+ </div>`,
2474
+ code: `
2475
  <div class="section">
2476
  <h2>πŸ’» LLM Optimization β€” Code Examples</h2>
2477
 
 
2498
  model.quantize(tokenizer, quant_config=quant_config)
2499
  model.save_quantized(quant_path)
2500
  tokenizer.save_pretrained(quant_path)</div>
2501
+ </div>`,
2502
+ interview: `
2503
  <div class="section">
2504
  <h2>🎯 LLM Optimization β€” Interview Questions</h2>
2505
  <div class="interview-box"><strong>Q1: What is the difference between compute-bound and memory-bandwidth bound?</strong><p><strong>Answer:</strong> Compute-bound means the GPU spends all its time doing math (matrix multiplications). Memory-bandwidth bound means the math is easy, but the GPU spends all its time waiting for weights to be copied from High Bandwidth Memory (HBM) to on-chip SRAM. LLM prefill (reading the prompt) is compute-bound, but decoding (generating tokens one by one) is memory-bandwidth bound.</p></div>
2506
  <div class="interview-box"><strong>Q2: Assume you use vLLM. What is PagedAttention?</strong><p><strong>Answer:</strong> Normally, KV cache is pre-allocated continuously in GPU memory. Because output lengths are unknown, frameworks over-allocate memory, wasting up to 60%. PagedAttention divides the KV cache into small blocks (pages) and allocates them dynamically, like virtual memory in an OS. This allows near-zero waste and 2-4x higher concurrency (batching).</p></div>
2507
+ </div>`
2508
  },
2509
  'llm-observability': {
2510
+ concepts: `
2511
  <div class="section">
2512
  <h2>πŸ”­ LLM Observability β€” Complete Deep Dive</h2>
2513
  <div class="info-box">
 
2532
  <tr><td><strong>Helicone</strong></td><td>Proxy-based observability (just change the base URL, no SDK needed).</td></tr>
2533
  <tr><td><strong>Opik (by Comet)</strong></td><td>Agent optimization and evaluation natively integrated with traces.</td></tr>
2534
  </table>
2535
+ </div>`,
2536
+ code: `
2537
  <div class="section">
2538
  <h2>πŸ’» Observability β€” Code Examples</h2>
2539
 
 
2566
  messages=[{<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">f"Context: {context}\nQuery: {query}"</span>}]
2567
  )
2568
  <span class="keyword">return</span> resp.choices[<span class="number">0</span>].message.content</div>
2569
+ </div>`,
2570
+ interview: `
2571
  <div class="section">
2572
  <h2>🎯 Observability β€” Interview Questions</h2>
2573
  <div class="interview-box"><strong>Q1: What is Time-To-First-Token (TTFT) and why does it matter?</strong><p><strong>Answer:</strong> TTFT measures the latency from the moment the user sends the request until the first token streams back to the client. In LLM applications, total end-to-end latency might be 5-10 seconds, which is unacceptable for UX. Streaming combined with low TTFT (&lt;1 second) creates the illusion of speed and keeps users engaged.</p></div>
2574
  <div class="interview-box"><strong>Q2: Why use a proxy like Helicone over an SDK like LangSmith?</strong><p><strong>Answer:</strong> A proxy requires ZERO code changes β€” you simply change the API base URL from <code>api.openai.com</code> to <code>oai.hconeai.com</code> and pass your proxy key in the header. It automatically logs all prompts, responses, costs, and latencies. However, an SDK (like Langfuse/LangSmith) is required if you want deep, nested trace trees for complex agents (e.g., seeing exactly which step in a 10-step LangGraph flow failed).</p></div>
2575
+ </div>`
2576
  },
2577
  'multiagent': {
2578
+ concepts: `
2579
  <div class="section">
2580
  <h2>πŸ•ΈοΈ Multi-Agent Systems (MAS)</h2>
2581
  <div class="info-box">
 
2602
  <div class="callout-title">πŸ’‘ Minimizing Friction</div>
2603
  <p>When picking a pattern, prioritize minimizing communication overhead. 10 agents isn't better than 2 if they duplicate work. The system should feel smarter than its individual parts.</p>
2604
  </div>
2605
+ </div>`,
2606
+ code: `
2607
  <div class="section">
2608
  <h2>πŸ’» Multi-Agent β€” Code Examples</h2>
2609
  <h3>Simple Router Pattern with LiteLLM</h3>
 
2622
  <span class="keyword">return</span> finance_specialist(query)
2623
  <span class="keyword">else</span>:
2624
  <span class="keyword">return</span> general_agent(query)</div>
2625
+ </div>`,
2626
+ interview: `
2627
  <div class="section">
2628
  <h2>🎯 Multi-Agent β€” Interview Questions</h2>
2629
  <div class="interview-box"><strong>Q1: Parallel vs Sequential orchestration?</strong><p><strong>Answer:</strong> Parallel is for independent tasks (data extraction + web search) to reduce latency. Sequential is for dependent tasks where step B needs output of step A (code writing then code review). Use parallel for scale, sequential for quality-controlled pipelines.</p></div>
2630
  <div class="interview-box"><strong>Q2: What is the Hierarchical pattern?</strong><p><strong>Answer:</strong> It mimics a corporate structure: a Manager/Planner agent receives the high-level goal, breaks it into sub-tasks, and delegates them to specialized Worker agents. The Manager tracks state and makes the final quality check. Best for complex, ambiguous projects.</p></div>
2631
+ </div>`
2632
  },
2633
  'tools': {
2634
+ concepts: `
2635
  <div class="section">
2636
  <h2>πŸ”§ Function Calling & Tools</h2>
2637
  <div class="info-box">
 
2647
 
2648
  <h3>3. Verbalized Sampling</h3>
2649
  <p>Forcing the agent to generate a "Thought:" block before the "Action:" block. This conditions the tool selection on a logical premise, significantly reducing errors in choosing the wrong tool or arguments.</p>
2650
+ </div>`,
2651
+ code: `
2652
  <div class="section">
2653
  <h2>πŸ’» Tools β€” Code Examples</h2>
2654
  <h3>OpenAI Native Tool Call</h3>
 
2663
  }
2664
  }]
2665
  <span class="comment"># Pass this to chat.completions.create(..., tools=tools)</span></div>
2666
+ </div>`,
2667
+ interview: `
2668
  <div class="section">
2669
  <h2>🎯 Tools β€” Interview Questions</h2>
2670
  <div class="interview-box"><strong>Q1: Why use Verbalized Sampling in tool calling?</strong><p><strong>Answer:</strong> It forces the model to articulate a rationale *before* picking a tool. Since tokens are generated left-to-right, the tool selection becomes conditioned on the reasoning, which increases precision, especially when multiple similar tools exist.</p></div>
2671
+ </div>`
2672
  }
2673
  });
2674