| # Concept: Streaming & Response Control | |
| ## Overview | |
| This example demonstrates **streaming responses** and **token limits**, two essential techniques for building responsive AI agents with controlled output. | |
| ## The Streaming Problem | |
| ### Traditional (Non-Streaming) Approach | |
| ``` | |
| User sends prompt | |
| β | |
| [Wait 10 seconds...] | |
| β | |
| Complete response appears all at once | |
| ``` | |
| **Problems:** | |
| - Poor user experience (long wait) | |
| - No progress indication | |
| - Can't interrupt bad responses | |
| - Feels unresponsive | |
| ### Streaming Approach (This Example) | |
| ``` | |
| User sends prompt | |
| β | |
| "Hoisting" (0.1s) β User sees first word! | |
| β | |
| "is a" (0.2s) β More text appears | |
| β | |
| "JavaScript" (0.3s) β Continuous feedback | |
| β | |
| [Continues token by token...] | |
| ``` | |
| **Benefits:** | |
| - Immediate feedback | |
| - Progress visible | |
| - Can interrupt early | |
| - Feels interactive | |
| ## How Streaming Works | |
| ### Token-by-Token Generation | |
| LLMs generate one token at a time internally. Streaming exposes this: | |
| ``` | |
| Internal LLM Process: | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Token 1: "Hoisting" β | |
| β Token 2: "is" β | |
| β Token 3: "a" β | |
| β Token 4: "JavaScript" β | |
| β Token 5: "mechanism" β | |
| β ... β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| Without Streaming: With Streaming: | |
| Wait for all tokens Emit each token immediately | |
| βββ Buffer β Return βββ Callback β Display | |
| ``` | |
| ### The onTextChunk Callback | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββ | |
| β Model Generation β | |
| ββββββββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| ββββββββββ΄ββββββββββ | |
| β Each new token β | |
| ββββββββββ¬ββββββββββ | |
| β | |
| ββββββββββββββββββββββ | |
| β onTextChunk(text) β β Your callback | |
| ββββββββββ¬ββββββββββββ | |
| β | |
| Your code processes it: | |
| β’ Display to user | |
| β’ Send over network | |
| β’ Log to file | |
| β’ Analyze content | |
| ``` | |
| ## Token Limits: maxTokens | |
| ### Why Limit Output? | |
| Without limits, models might generate: | |
| ``` | |
| User: "Explain hoisting" | |
| Model: [Generates 10,000 words including: | |
| - Complete JavaScript history | |
| - Every edge case | |
| - Unrelated examples | |
| - Never stops...] | |
| ``` | |
| With limits: | |
| ``` | |
| User: "Explain hoisting" | |
| Model: [Generates ~1500 words | |
| - Core concept | |
| - Key examples | |
| - Stops at 2000 tokens] | |
| ``` | |
| ### Token Budgeting | |
| ``` | |
| Context Window: 4096 tokens | |
| ββ System Prompt: 200 tokens | |
| ββ User Message: 100 tokens | |
| ββ Response (maxTokens): 2000 tokens | |
| ββ Remaining for history: 1796 tokens | |
| Total used: 2300 tokens | |
| Available: 1796 tokens for future conversation | |
| ``` | |
| ### Cost vs Quality | |
| ``` | |
| Token Limit Output Quality Use Case | |
| βββββββββββ βββββββββββββββ βββββββββββββββββ | |
| 100 Brief, may be cut Quick answers | |
| 500 Concise but complete Short explanations | |
| 2000 (example) Detailed Full explanations | |
| No limit Risk of rambling When length unknown | |
| ``` | |
| ## Real-Time Applications | |
| ### Pattern 1: Interactive CLI | |
| ``` | |
| User: "Explain closures" | |
| β | |
| Terminal: "A closure is a function..." | |
| (Appears word by word, like typing) | |
| β | |
| User sees progress, knows it's working | |
| ``` | |
| ### Pattern 2: Web Application | |
| ``` | |
| Browser Server | |
| β β | |
| ββββ Send prompt ββββββββββ | |
| β β | |
| ββββ Chunk 1: "Closures"βββ€ | |
| β (Display immediately) β | |
| β β | |
| ββββ Chunk 2: "are"ββββββββ€ | |
| β (Append to display) β | |
| β β | |
| ββββ Chunk 3: "functions"ββ€ | |
| β (Keep appending...) β | |
| ``` | |
| Implementation: | |
| - Server-Sent Events (SSE) | |
| - WebSockets | |
| - HTTP streaming | |
| ### Pattern 3: Multi-Consumer | |
| ``` | |
| onTextChunk(text) | |
| β | |
| βββββββββΌββββββββ | |
| β β β | |
| Console WebSocket Log File | |
| Display β Client β Storage | |
| ``` | |
| ## Performance Characteristics | |
| ### Latency vs Throughput | |
| ``` | |
| Time to First Token (TTFT): | |
| ββ Small model (1.7B): ~100ms | |
| ββ Medium model (8B): ~200ms | |
| ββ Large model (20B): ~500ms | |
| Tokens Per Second: | |
| ββ Small model: 50-80 tok/s | |
| ββ Medium model: 20-35 tok/s | |
| ββ Large model: 10-15 tok/s | |
| User Experience: | |
| TTFT < 500ms β Feels instant | |
| Tok/s > 20 β Reads naturally | |
| ``` | |
| ### Resource Trade-offs | |
| ``` | |
| Model Size Memory Speed Quality | |
| ββββββββββ ββββββββ βββββ βββββββ | |
| 1.7B ~2GB Fast Good | |
| 8B ~6GB Medium Better | |
| 20B ~12GB Slower Best | |
| ``` | |
| ## Advanced Concepts | |
| ### Buffering Strategies | |
| **No Buffer (Immediate)** | |
| ``` | |
| Every token β callback β display | |
| ββ Smoothest UX but more overhead | |
| ``` | |
| **Line Buffer** | |
| ``` | |
| Accumulate until newline β flush | |
| ββ Better for paragraph-based output | |
| ``` | |
| **Time Buffer** | |
| ``` | |
| Accumulate for 50ms β flush batch | |
| ββ Reduces callback frequency | |
| ``` | |
| ### Early Stopping | |
| ``` | |
| Generation in progress: | |
| "The answer is clearly... wait, actually..." | |
| β | |
| onTextChunk detects issue | |
| β | |
| Stop generation | |
| β | |
| "Let me reconsider" | |
| ``` | |
| Useful for: | |
| - Detecting off-topic responses | |
| - Safety filters | |
| - Relevance checking | |
| ### Progressive Enhancement | |
| ``` | |
| Partial Response Analysis: | |
| βββββββββββββββββββββββββββββββββββ | |
| β "To implement this feature..." β | |
| β β | |
| β β Already useful information β | |
| β β | |
| β "...you'll need: 1) Node.js" β | |
| β β | |
| β β Can start acting on this β | |
| β β | |
| β "2) Express framework" β | |
| βββββββββββββββββββββββββββββββββββ | |
| Agent can begin working before response completes! | |
| ``` | |
| ## Context Size Awareness | |
| ### Why It Matters | |
| ``` | |
| ββββββββββββββββββββββββββββββββββ | |
| β Context Window (4096) β | |
| ββββββββββββββββββββββββββββββββββ€ | |
| β System Prompt 200 tokens β | |
| β Conversation History 1000 β | |
| β Current Prompt 100 β | |
| β Response Space 2796 β | |
| ββββββββββββββββββββββββββββββββββ | |
| If maxTokens > 2796: | |
| βββ Error or truncation! | |
| ``` | |
| ### Dynamic Adjustment | |
| ``` | |
| Available = contextSize - (prompt + history) | |
| if (maxTokens > available) { | |
| maxTokens = available; | |
| // or clear old history | |
| } | |
| ``` | |
| ## Streaming in Agent Architectures | |
| ### Simple Agent | |
| ``` | |
| User β LLM (streaming) β Display | |
| ββ onTextChunk shows progress | |
| ``` | |
| ### Multi-Step Agent | |
| ``` | |
| Step 1: Plan (stream) β Show thinking | |
| Step 2: Act (stream) β Show action | |
| Step 3: Result (stream) β Show outcome | |
| ββ User sees agent's process | |
| ``` | |
| ### Collaborative Agents | |
| ``` | |
| Agent A (streaming) βββ | |
| βββ Coordinator β User | |
| Agent B (streaming) βββ | |
| ββ Both stream simultaneously | |
| ``` | |
| ## Best Practices | |
| ### 1. Always Set maxTokens | |
| ``` | |
| β Good: | |
| session.prompt(query, { maxTokens: 2000 }) | |
| β Risky: | |
| session.prompt(query) | |
| ββ May use entire context! | |
| ``` | |
| ### 2. Handle Partial Updates | |
| ``` | |
| let fullResponse = ''; | |
| onTextChunk: (chunk) => { | |
| fullResponse += chunk; | |
| display(chunk); // Show immediately | |
| logComplete = false; // Mark incomplete | |
| } | |
| // After completion: | |
| saveToDatabase(fullResponse); | |
| ``` | |
| ### 3. Provide Feedback | |
| ``` | |
| onTextChunk: (chunk) => { | |
| if (firstChunk) { | |
| showLoadingDone(); | |
| firstChunk = false; | |
| } | |
| appendToDisplay(chunk); | |
| } | |
| ``` | |
| ### 4. Monitor Performance | |
| ``` | |
| const startTime = Date.now(); | |
| let tokenCount = 0; | |
| onTextChunk: (chunk) => { | |
| tokenCount += estimateTokens(chunk); | |
| const elapsed = (Date.now() - startTime) / 1000; | |
| const tokensPerSecond = tokenCount / elapsed; | |
| updateMetrics(tokensPerSecond); | |
| } | |
| ``` | |
| ## Key Takeaways | |
| 1. **Streaming improves UX**: Users see progress immediately | |
| 2. **maxTokens controls cost**: Prevents runaway generation | |
| 3. **Token-by-token generation**: LLMs produce one token at a time | |
| 4. **onTextChunk callback**: Your hook into the generation process | |
| 5. **Context awareness matters**: Monitor available space | |
| 6. **Essential for production**: Real-time systems need streaming | |
| ## Comparison | |
| ``` | |
| Feature intro.js coding.js (this) | |
| ββββββββββββββββ βββββββββ βββββββββββββββββ | |
| Streaming β β | |
| Token limit β β (2000) | |
| Real-time output β β | |
| Progress visible β β | |
| User control β β | |
| ``` | |
| This pattern is foundational for building responsive, user-friendly AI agent interfaces. | |