AashishAIHub commited on
Commit
04751a7
Β·
1 Parent(s): 5ad7fc4

style: final educator-grade polish for AI-Engineer and CME 295

Browse files
Files changed (2) hide show
  1. AI-Engineer/app.js +27 -16
  2. CME295-Transformers/index.html +20 -2
AI-Engineer/app.js CHANGED
@@ -927,15 +927,15 @@ Object.assign(MODULE_CONTENT, {
927
  <p><strong>Indexing phase (offline):</strong> (1) Load documents (PDF, HTML, Markdown, DB), (2) Chunk into segments (~500 tokens), (3) Embed each chunk with an embedding model, (4) Store vectors + metadata in a vector database.</p>
928
  <p><strong>Query phase (online):</strong> (1) Embed user query with same model, (2) Retrieve top-k similar chunks via ANN search, (3) Inject chunks into LLM prompt as context, (4) LLM generates a grounded response with citations.</p>
929
 
930
- <h3>2. Chunking Strategies β€” The Most Critical Design Decision</h3>
931
- <table>
932
- <tr><th>Strategy</th><th>How It Works</th><th>Best For</th><th>Tradeoff</th></tr>
933
- <tr><td><strong>Fixed-size</strong></td><td>Split every N tokens</td><td>Generic text</td><td>May cut mid-sentence</td></tr>
934
- <tr><td><strong>Recursive character</strong></td><td>Split by paragraphs, sentences, words</td><td>Most documents</td><td>LangChain default; good balance</td></tr>
935
- <tr><td><strong>Semantic chunking</strong></td><td>Split where embedding similarity drops</td><td>Long documents</td><td>Groups related content; 10x slower</td></tr>
936
- <tr><td><strong>Document structure</strong></td><td>Parse by headings, sections, tables</td><td>PDFs, HTML, Markdown</td><td>Preserves context hierarchy</td></tr>
937
- <tr><td><strong>Agentic chunking</strong></td><td>LLM decides chunk boundaries</td><td>Highest quality</td><td>Expensive but best recall</td></tr>
938
- </table>
939
  <div class="callout tip">
940
  <div class="callout-title">πŸ’‘ Chunk Size Sweet Spot</div>
941
  <p>256-512 tokens for OpenAI embeddings, 128-256 for smaller models. Use 50-100 token overlap. Test with your actual queries β€” measure retrieval recall, not just generation quality.</p>
@@ -1187,8 +1187,15 @@ res = index.query(vector=query_emb, top_k=<span class="number">10</span>,
1187
  <div class="box-content">An agent is an LLM + a <strong>reasoning loop</strong> + <strong>tools</strong>. It doesn't just respond β€” it plans, calls tools, observes results, and iterates. The agent paradigm turns LLMs from answer machines into action machines.</div>
1188
  </div>
1189
 
1190
- <h3>1. ReAct β€” The Foundation</h3>
1191
- <p>ReAct (Yao 2022): <strong>Thought</strong> > <strong>Action</strong> > <strong>Observation</strong> > repeat. The LLM reasons about what to do, calls a tool, sees the result, and continues until it has a final answer.</p>
 
 
 
 
 
 
 
1192
 
1193
  <h3>2. Agent Architectures & Patterns</h3>
1194
  <table>
@@ -1799,14 +1806,18 @@ output = model.generate(inputs, max_new_tokens=<span class="number">100</span>)
1799
  </table>
1800
 
1801
  <h3>3. vLLM β€” The Production Standard</h3>
1802
- <p><strong>PagedAttention</strong> (inspired by OS virtual memory): stores KV cache in non-contiguous memory pages. Traditional serving pre-allocates max KV cache β€” wasting 60-80% of GPU memory. PagedAttention allocates on demand: 3-24x higher throughput. <a href="https://arxiv.org/abs/2309.06180" target="_blank" style="color:var(--accent)">[Read the PagedAttention Paper]</a></p>
1803
 
 
 
 
 
 
1804
  <div class="info-box">
1805
- <div class="box-title">πŸ“˜ Recommended Resources</div>
1806
  <ul style="margin-top:10px; color:var(--text-muted)">
1807
- <li><a href="https://github.com/vllm-project/vllm" target="_blank" style="color:var(--accent)">vLLM Project Page</a> β€” High-throughput serving engine.</li>
1808
- <li><a href="https://github.com/huggingface/text-generation-inference" target="_blank" style="color:var(--accent)">HF TGI Repository</a> β€” Production-ready LLM serving.</li>
1809
- <li><a href="https://ollama.com" target="_blank" style="color:var(--accent)">Ollama Official Site</a> β€” Best for local development.</li>
1810
  </ul>
1811
  </div>
1812
 
 
927
  <p><strong>Indexing phase (offline):</strong> (1) Load documents (PDF, HTML, Markdown, DB), (2) Chunk into segments (~500 tokens), (3) Embed each chunk with an embedding model, (4) Store vectors + metadata in a vector database.</p>
928
  <p><strong>Query phase (online):</strong> (1) Embed user query with same model, (2) Retrieve top-k similar chunks via ANN search, (3) Inject chunks into LLM prompt as context, (4) LLM generates a grounded response with citations.</p>
929
 
930
+ <h3>2. The Evolution of RAG</h3>
931
+ <div class="comparison">
932
+ <div class="box-bad"><strong>Stage 1: Naive RAG</strong> (Retreive > Stuff > Gen). Works for simple docs. Fails when query and doc use different words (vocabulary mismatch).</div>
933
+ <div class="box-good"><strong>Stage 2: Advanced RAG</strong> (Rewrite > Retrieve > Re-rank). Uses HyDE to rewrite queries and a Re-ranker model to sort chunks by actual relevance before feeding LLM.</div>
934
+ </div>
935
+ <div class="callout insight">
936
+ <div class="callout-title">πŸ’‘ The "Library Librarian" Analogy</div>
937
+ <p>Naive RAG is like a librarian who only looks at the <strong>index</strong> of a book. Advanced RAG is a librarian who <strong>skims the chapters</strong> to make sure they actually answer your question before handing you the book.</p>
938
+ </div>
939
  <div class="callout tip">
940
  <div class="callout-title">πŸ’‘ Chunk Size Sweet Spot</div>
941
  <p>256-512 tokens for OpenAI embeddings, 128-256 for smaller models. Use 50-100 token overlap. Test with your actual queries β€” measure retrieval recall, not just generation quality.</p>
 
1187
  <div class="box-content">An agent is an LLM + a <strong>reasoning loop</strong> + <strong>tools</strong>. It doesn't just respond β€” it plans, calls tools, observes results, and iterates. The agent paradigm turns LLMs from answer machines into action machines.</div>
1188
  </div>
1189
 
1190
+ <h3>1. ReAct: The Brain's Operating System</h3>
1191
+ <p>ReAct (Yao 2022) is the fundamental "Thinking Loop". It forces the LLM to output its internal state before action: <strong>Thought</strong> (Reasoning) > <strong>Action</strong> (Tool Call) > <strong>Observation</strong> (Result). This prevents "reflexive" hallucinations where the model answers before it has any data.</p>
1192
+
1193
+ <div class="info-box">
1194
+ <div class="box-title">πŸ”Œ MCP: The USB Port for AI</div>
1195
+ <div class="box-content">
1196
+ <strong>Model Context Protocol (MCP)</strong> is the industry's attempt to standardize how agents talk to tools. Instead of rewriting tool-calling code for every model, MCP creates a universal "plug-and-play" interface where any model can use any database, API, or local file system seamlessly.
1197
+ </div>
1198
+ </div>
1199
 
1200
  <h3>2. Agent Architectures & Patterns</h3>
1201
  <table>
 
1806
  </table>
1807
 
1808
  <h3>3. vLLM β€” The Production Standard</h3>
1809
+ <p><strong>PagedAttention</strong>: Inspired by OS virtual memory. Traditional serving pre-allocates max VRAM for every request β€” an "empty seat" problem. PagedAttention allocates memory on-demand in blocks, allowing you to serve <strong>5-10x more users</strong> on the same hardware.</p>
1810
 
1811
+ <div class="callout tip">
1812
+ <div class="callout-title">πŸš€ Scaling Insight</div>
1813
+ <p>In production, your biggest cost is not compute, but <strong>VRAM fragmentation</strong>. vLLM solves this at the architectural level, making it the industry go-to for high-throughput API endpoints.</p>
1814
+ </div>
1815
+
1816
  <div class="info-box">
1817
+ <div class="box-title">πŸ“˜ Professional Resources</div>
1818
  <ul style="margin-top:10px; color:var(--text-muted)">
1819
+ <li><a href="https://arxiv.org/abs/2309.06180" target="_blank" style="color:var(--accent)">PagedAttention Paper</a> β€” The math behind the efficiency.</li>
1820
+ <li><a href="https://github.com/vllm-project/vllm" target="_blank" style="color:var(--accent)">vLLM Project</a> β€” The current state-of-the-art serving engine.</li>
 
1821
  </ul>
1822
  </div>
1823
 
CME295-Transformers/index.html CHANGED
@@ -73,6 +73,25 @@
73
  display: block;
74
  }
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  .grid {
77
  display: grid;
78
  grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
@@ -494,8 +513,7 @@
494
  "lecture-1": {
495
  overview: `
496
  <p>Before Large Language Models, computers needed a way to process human language reliably. This module covers the historic journey from basic NLP processing to the revolutionary architecture that powers modern AI.</p>
497
- <div class="callout insight">
498
- <div class="callout-title">The Paradigm Shift</div>
499
  Historically, RNNs were the deep learning standard for text, but they processed data sequentially, leading to information loss over long sentences (vanishing gradients). The Transformer paper ("Attention Is All You Need", 2017) completely eliminated recurrent layers in favor of the <strong>Self-Attention Mechanism</strong>, allowing the model to look at the entire context simultaneously.
500
  </div>
501
  `,
 
73
  display: block;
74
  }
75
 
76
+ .stanford-note {
77
+ background: rgba(140, 21, 21, 0.15);
78
+ border-left: 4px solid #8c1515;
79
+ padding: 20px;
80
+ border-radius: 8px;
81
+ margin: 25px 0;
82
+ position: relative;
83
+ }
84
+
85
+ .stanford-note::before {
86
+ content: 'πŸ” STANFORD LECTURE NOTE';
87
+ font-size: 0.75em;
88
+ font-weight: 800;
89
+ color: #8c1515;
90
+ display: block;
91
+ margin-bottom: 10px;
92
+ letter-spacing: 1px;
93
+ }
94
+
95
  .grid {
96
  display: grid;
97
  grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
 
513
  "lecture-1": {
514
  overview: `
515
  <p>Before Large Language Models, computers needed a way to process human language reliably. This module covers the historic journey from basic NLP processing to the revolutionary architecture that powers modern AI.</p>
516
+ <div class="stanford-note">
 
517
  Historically, RNNs were the deep learning standard for text, but they processed data sequentially, leading to information loss over long sentences (vanishing gradients). The Transformer paper ("Attention Is All You Need", 2017) completely eliminated recurrent layers in favor of the <strong>Self-Attention Mechanism</strong>, allowing the model to look at the entire context simultaneously.
518
  </div>
519
  `,