Spaces:
Running
Running
Commit Β·
04751a7
1
Parent(s): 5ad7fc4
style: final educator-grade polish for AI-Engineer and CME 295
Browse files- AI-Engineer/app.js +27 -16
- CME295-Transformers/index.html +20 -2
AI-Engineer/app.js
CHANGED
|
@@ -927,15 +927,15 @@ Object.assign(MODULE_CONTENT, {
|
|
| 927 |
<p><strong>Indexing phase (offline):</strong> (1) Load documents (PDF, HTML, Markdown, DB), (2) Chunk into segments (~500 tokens), (3) Embed each chunk with an embedding model, (4) Store vectors + metadata in a vector database.</p>
|
| 928 |
<p><strong>Query phase (online):</strong> (1) Embed user query with same model, (2) Retrieve top-k similar chunks via ANN search, (3) Inject chunks into LLM prompt as context, (4) LLM generates a grounded response with citations.</p>
|
| 929 |
|
| 930 |
-
<h3>2.
|
| 931 |
-
<
|
| 932 |
-
<
|
| 933 |
-
<
|
| 934 |
-
|
| 935 |
-
|
| 936 |
-
<
|
| 937 |
-
<
|
| 938 |
-
</
|
| 939 |
<div class="callout tip">
|
| 940 |
<div class="callout-title">π‘ Chunk Size Sweet Spot</div>
|
| 941 |
<p>256-512 tokens for OpenAI embeddings, 128-256 for smaller models. Use 50-100 token overlap. Test with your actual queries β measure retrieval recall, not just generation quality.</p>
|
|
@@ -1187,8 +1187,15 @@ res = index.query(vector=query_emb, top_k=<span class="number">10</span>,
|
|
| 1187 |
<div class="box-content">An agent is an LLM + a <strong>reasoning loop</strong> + <strong>tools</strong>. It doesn't just respond β it plans, calls tools, observes results, and iterates. The agent paradigm turns LLMs from answer machines into action machines.</div>
|
| 1188 |
</div>
|
| 1189 |
|
| 1190 |
-
<h3>1. ReAct
|
| 1191 |
-
<p>ReAct (Yao 2022): <strong>Thought</strong> > <strong>Action</strong> > <strong>Observation</strong>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1192 |
|
| 1193 |
<h3>2. Agent Architectures & Patterns</h3>
|
| 1194 |
<table>
|
|
@@ -1799,14 +1806,18 @@ output = model.generate(inputs, max_new_tokens=<span class="number">100</span>)
|
|
| 1799 |
</table>
|
| 1800 |
|
| 1801 |
<h3>3. vLLM β The Production Standard</h3>
|
| 1802 |
-
<p><strong>PagedAttention</strong>
|
| 1803 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1804 |
<div class="info-box">
|
| 1805 |
-
<div class="box-title">π
|
| 1806 |
<ul style="margin-top:10px; color:var(--text-muted)">
|
| 1807 |
-
<li><a href="https://
|
| 1808 |
-
<li><a href="https://github.com/
|
| 1809 |
-
<li><a href="https://ollama.com" target="_blank" style="color:var(--accent)">Ollama Official Site</a> β Best for local development.</li>
|
| 1810 |
</ul>
|
| 1811 |
</div>
|
| 1812 |
|
|
|
|
| 927 |
<p><strong>Indexing phase (offline):</strong> (1) Load documents (PDF, HTML, Markdown, DB), (2) Chunk into segments (~500 tokens), (3) Embed each chunk with an embedding model, (4) Store vectors + metadata in a vector database.</p>
|
| 928 |
<p><strong>Query phase (online):</strong> (1) Embed user query with same model, (2) Retrieve top-k similar chunks via ANN search, (3) Inject chunks into LLM prompt as context, (4) LLM generates a grounded response with citations.</p>
|
| 929 |
|
| 930 |
+
<h3>2. The Evolution of RAG</h3>
|
| 931 |
+
<div class="comparison">
|
| 932 |
+
<div class="box-bad"><strong>Stage 1: Naive RAG</strong> (Retreive > Stuff > Gen). Works for simple docs. Fails when query and doc use different words (vocabulary mismatch).</div>
|
| 933 |
+
<div class="box-good"><strong>Stage 2: Advanced RAG</strong> (Rewrite > Retrieve > Re-rank). Uses HyDE to rewrite queries and a Re-ranker model to sort chunks by actual relevance before feeding LLM.</div>
|
| 934 |
+
</div>
|
| 935 |
+
<div class="callout insight">
|
| 936 |
+
<div class="callout-title">π‘ The "Library Librarian" Analogy</div>
|
| 937 |
+
<p>Naive RAG is like a librarian who only looks at the <strong>index</strong> of a book. Advanced RAG is a librarian who <strong>skims the chapters</strong> to make sure they actually answer your question before handing you the book.</p>
|
| 938 |
+
</div>
|
| 939 |
<div class="callout tip">
|
| 940 |
<div class="callout-title">π‘ Chunk Size Sweet Spot</div>
|
| 941 |
<p>256-512 tokens for OpenAI embeddings, 128-256 for smaller models. Use 50-100 token overlap. Test with your actual queries β measure retrieval recall, not just generation quality.</p>
|
|
|
|
| 1187 |
<div class="box-content">An agent is an LLM + a <strong>reasoning loop</strong> + <strong>tools</strong>. It doesn't just respond β it plans, calls tools, observes results, and iterates. The agent paradigm turns LLMs from answer machines into action machines.</div>
|
| 1188 |
</div>
|
| 1189 |
|
| 1190 |
+
<h3>1. ReAct: The Brain's Operating System</h3>
|
| 1191 |
+
<p>ReAct (Yao 2022) is the fundamental "Thinking Loop". It forces the LLM to output its internal state before action: <strong>Thought</strong> (Reasoning) > <strong>Action</strong> (Tool Call) > <strong>Observation</strong> (Result). This prevents "reflexive" hallucinations where the model answers before it has any data.</p>
|
| 1192 |
+
|
| 1193 |
+
<div class="info-box">
|
| 1194 |
+
<div class="box-title">π MCP: The USB Port for AI</div>
|
| 1195 |
+
<div class="box-content">
|
| 1196 |
+
<strong>Model Context Protocol (MCP)</strong> is the industry's attempt to standardize how agents talk to tools. Instead of rewriting tool-calling code for every model, MCP creates a universal "plug-and-play" interface where any model can use any database, API, or local file system seamlessly.
|
| 1197 |
+
</div>
|
| 1198 |
+
</div>
|
| 1199 |
|
| 1200 |
<h3>2. Agent Architectures & Patterns</h3>
|
| 1201 |
<table>
|
|
|
|
| 1806 |
</table>
|
| 1807 |
|
| 1808 |
<h3>3. vLLM β The Production Standard</h3>
|
| 1809 |
+
<p><strong>PagedAttention</strong>: Inspired by OS virtual memory. Traditional serving pre-allocates max VRAM for every request β an "empty seat" problem. PagedAttention allocates memory on-demand in blocks, allowing you to serve <strong>5-10x more users</strong> on the same hardware.</p>
|
| 1810 |
|
| 1811 |
+
<div class="callout tip">
|
| 1812 |
+
<div class="callout-title">π Scaling Insight</div>
|
| 1813 |
+
<p>In production, your biggest cost is not compute, but <strong>VRAM fragmentation</strong>. vLLM solves this at the architectural level, making it the industry go-to for high-throughput API endpoints.</p>
|
| 1814 |
+
</div>
|
| 1815 |
+
|
| 1816 |
<div class="info-box">
|
| 1817 |
+
<div class="box-title">π Professional Resources</div>
|
| 1818 |
<ul style="margin-top:10px; color:var(--text-muted)">
|
| 1819 |
+
<li><a href="https://arxiv.org/abs/2309.06180" target="_blank" style="color:var(--accent)">PagedAttention Paper</a> β The math behind the efficiency.</li>
|
| 1820 |
+
<li><a href="https://github.com/vllm-project/vllm" target="_blank" style="color:var(--accent)">vLLM Project</a> β The current state-of-the-art serving engine.</li>
|
|
|
|
| 1821 |
</ul>
|
| 1822 |
</div>
|
| 1823 |
|
CME295-Transformers/index.html
CHANGED
|
@@ -73,6 +73,25 @@
|
|
| 73 |
display: block;
|
| 74 |
}
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
.grid {
|
| 77 |
display: grid;
|
| 78 |
grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
|
|
@@ -494,8 +513,7 @@
|
|
| 494 |
"lecture-1": {
|
| 495 |
overview: `
|
| 496 |
<p>Before Large Language Models, computers needed a way to process human language reliably. This module covers the historic journey from basic NLP processing to the revolutionary architecture that powers modern AI.</p>
|
| 497 |
-
<div class="
|
| 498 |
-
<div class="callout-title">The Paradigm Shift</div>
|
| 499 |
Historically, RNNs were the deep learning standard for text, but they processed data sequentially, leading to information loss over long sentences (vanishing gradients). The Transformer paper ("Attention Is All You Need", 2017) completely eliminated recurrent layers in favor of the <strong>Self-Attention Mechanism</strong>, allowing the model to look at the entire context simultaneously.
|
| 500 |
</div>
|
| 501 |
`,
|
|
|
|
| 73 |
display: block;
|
| 74 |
}
|
| 75 |
|
| 76 |
+
.stanford-note {
|
| 77 |
+
background: rgba(140, 21, 21, 0.15);
|
| 78 |
+
border-left: 4px solid #8c1515;
|
| 79 |
+
padding: 20px;
|
| 80 |
+
border-radius: 8px;
|
| 81 |
+
margin: 25px 0;
|
| 82 |
+
position: relative;
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
.stanford-note::before {
|
| 86 |
+
content: 'π STANFORD LECTURE NOTE';
|
| 87 |
+
font-size: 0.75em;
|
| 88 |
+
font-weight: 800;
|
| 89 |
+
color: #8c1515;
|
| 90 |
+
display: block;
|
| 91 |
+
margin-bottom: 10px;
|
| 92 |
+
letter-spacing: 1px;
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
.grid {
|
| 96 |
display: grid;
|
| 97 |
grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
|
|
|
|
| 513 |
"lecture-1": {
|
| 514 |
overview: `
|
| 515 |
<p>Before Large Language Models, computers needed a way to process human language reliably. This module covers the historic journey from basic NLP processing to the revolutionary architecture that powers modern AI.</p>
|
| 516 |
+
<div class="stanford-note">
|
|
|
|
| 517 |
Historically, RNNs were the deep learning standard for text, but they processed data sequentially, leading to information loss over long sentences (vanishing gradients). The Transformer paper ("Attention Is All You Need", 2017) completely eliminated recurrent layers in favor of the <strong>Self-Attention Mechanism</strong>, allowing the model to look at the entire context simultaneously.
|
| 518 |
</div>
|
| 519 |
`,
|