Concept: Streaming & Response Control
Overview
This example demonstrates streaming responses and token limits, two essential techniques for building responsive AI agents with controlled output.
The Streaming Problem
Traditional (Non-Streaming) Approach
User sends prompt
β
[Wait 10 seconds...]
β
Complete response appears all at once
Problems:
- Poor user experience (long wait)
- No progress indication
- Can't interrupt bad responses
- Feels unresponsive
Streaming Approach (This Example)
User sends prompt
β
"Hoisting" (0.1s) β User sees first word!
β
"is a" (0.2s) β More text appears
β
"JavaScript" (0.3s) β Continuous feedback
β
[Continues token by token...]
Benefits:
- Immediate feedback
- Progress visible
- Can interrupt early
- Feels interactive
How Streaming Works
Token-by-Token Generation
LLMs generate one token at a time internally. Streaming exposes this:
Internal LLM Process:
βββββββββββββββββββββββββββββββββββββββ
β Token 1: "Hoisting" β
β Token 2: "is" β
β Token 3: "a" β
β Token 4: "JavaScript" β
β Token 5: "mechanism" β
β ... β
βββββββββββββββββββββββββββββββββββββββ
Without Streaming: With Streaming:
Wait for all tokens Emit each token immediately
βββ Buffer β Return βββ Callback β Display
The onTextChunk Callback
ββββββββββββββββββββββββββββββββββββββ
β Model Generation β
ββββββββββββββ¬ββββββββββββββββββββββββ
β
ββββββββββ΄ββββββββββ
β Each new token β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββββ
β onTextChunk(text) β β Your callback
ββββββββββ¬ββββββββββββ
β
Your code processes it:
β’ Display to user
β’ Send over network
β’ Log to file
β’ Analyze content
Token Limits: maxTokens
Why Limit Output?
Without limits, models might generate:
User: "Explain hoisting"
Model: [Generates 10,000 words including:
- Complete JavaScript history
- Every edge case
- Unrelated examples
- Never stops...]
With limits:
User: "Explain hoisting"
Model: [Generates ~1500 words
- Core concept
- Key examples
- Stops at 2000 tokens]
Token Budgeting
Context Window: 4096 tokens
ββ System Prompt: 200 tokens
ββ User Message: 100 tokens
ββ Response (maxTokens): 2000 tokens
ββ Remaining for history: 1796 tokens
Total used: 2300 tokens
Available: 1796 tokens for future conversation
Cost vs Quality
Token Limit Output Quality Use Case
βββββββββββ βββββββββββββββ βββββββββββββββββ
100 Brief, may be cut Quick answers
500 Concise but complete Short explanations
2000 (example) Detailed Full explanations
No limit Risk of rambling When length unknown
Real-Time Applications
Pattern 1: Interactive CLI
User: "Explain closures"
β
Terminal: "A closure is a function..."
(Appears word by word, like typing)
β
User sees progress, knows it's working
Pattern 2: Web Application
Browser Server
β β
ββββ Send prompt ββββββββββ
β β
ββββ Chunk 1: "Closures"βββ€
β (Display immediately) β
β β
ββββ Chunk 2: "are"ββββββββ€
β (Append to display) β
β β
ββββ Chunk 3: "functions"ββ€
β (Keep appending...) β
Implementation:
- Server-Sent Events (SSE)
- WebSockets
- HTTP streaming
Pattern 3: Multi-Consumer
onTextChunk(text)
β
βββββββββΌββββββββ
β β β
Console WebSocket Log File
Display β Client β Storage
Performance Characteristics
Latency vs Throughput
Time to First Token (TTFT):
ββ Small model (1.7B): ~100ms
ββ Medium model (8B): ~200ms
ββ Large model (20B): ~500ms
Tokens Per Second:
ββ Small model: 50-80 tok/s
ββ Medium model: 20-35 tok/s
ββ Large model: 10-15 tok/s
User Experience:
TTFT < 500ms β Feels instant
Tok/s > 20 β Reads naturally
Resource Trade-offs
Model Size Memory Speed Quality
ββββββββββ ββββββββ βββββ βββββββ
1.7B ~2GB Fast Good
8B ~6GB Medium Better
20B ~12GB Slower Best
Advanced Concepts
Buffering Strategies
No Buffer (Immediate)
Every token β callback β display
ββ Smoothest UX but more overhead
Line Buffer
Accumulate until newline β flush
ββ Better for paragraph-based output
Time Buffer
Accumulate for 50ms β flush batch
ββ Reduces callback frequency
Early Stopping
Generation in progress:
"The answer is clearly... wait, actually..."
β
onTextChunk detects issue
β
Stop generation
β
"Let me reconsider"
Useful for:
- Detecting off-topic responses
- Safety filters
- Relevance checking
Progressive Enhancement
Partial Response Analysis:
βββββββββββββββββββββββββββββββββββ
β "To implement this feature..." β
β β
β β Already useful information β
β β
β "...you'll need: 1) Node.js" β
β β
β β Can start acting on this β
β β
β "2) Express framework" β
βββββββββββββββββββββββββββββββββββ
Agent can begin working before response completes!
Context Size Awareness
Why It Matters
ββββββββββββββββββββββββββββββββββ
β Context Window (4096) β
ββββββββββββββββββββββββββββββββββ€
β System Prompt 200 tokens β
β Conversation History 1000 β
β Current Prompt 100 β
β Response Space 2796 β
ββββββββββββββββββββββββββββββββββ
If maxTokens > 2796:
βββ Error or truncation!
Dynamic Adjustment
Available = contextSize - (prompt + history)
if (maxTokens > available) {
maxTokens = available;
// or clear old history
}
Streaming in Agent Architectures
Simple Agent
User β LLM (streaming) β Display
ββ onTextChunk shows progress
Multi-Step Agent
Step 1: Plan (stream) β Show thinking
Step 2: Act (stream) β Show action
Step 3: Result (stream) β Show outcome
ββ User sees agent's process
Collaborative Agents
Agent A (streaming) βββ
βββ Coordinator β User
Agent B (streaming) βββ
ββ Both stream simultaneously
Best Practices
1. Always Set maxTokens
β Good:
session.prompt(query, { maxTokens: 2000 })
β Risky:
session.prompt(query)
ββ May use entire context!
2. Handle Partial Updates
let fullResponse = '';
onTextChunk: (chunk) => {
fullResponse += chunk;
display(chunk); // Show immediately
logComplete = false; // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);
3. Provide Feedback
onTextChunk: (chunk) => {
if (firstChunk) {
showLoadingDone();
firstChunk = false;
}
appendToDisplay(chunk);
}
4. Monitor Performance
const startTime = Date.now();
let tokenCount = 0;
onTextChunk: (chunk) => {
tokenCount += estimateTokens(chunk);
const elapsed = (Date.now() - startTime) / 1000;
const tokensPerSecond = tokenCount / elapsed;
updateMetrics(tokensPerSecond);
}
Key Takeaways
- Streaming improves UX: Users see progress immediately
- maxTokens controls cost: Prevents runaway generation
- Token-by-token generation: LLMs produce one token at a time
- onTextChunk callback: Your hook into the generation process
- Context awareness matters: Monitor available space
- Essential for production: Real-time systems need streaming
Comparison
Feature intro.js coding.js (this)
ββββββββββββββββ βββββββββ βββββββββββββββββ
Streaming β β
Token limit β β (2000)
Real-time output β β
Progress visible β β
User control β β
This pattern is foundational for building responsive, user-friendly AI agent interfaces.