Email / examples /06_coding /CONCEPT.md
lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified
# Concept: Streaming & Response Control
## Overview
This example demonstrates **streaming responses** and **token limits**, two essential techniques for building responsive AI agents with controlled output.
## The Streaming Problem
### Traditional (Non-Streaming) Approach
```
User sends prompt
↓
[Wait 10 seconds...]
↓
Complete response appears all at once
```
**Problems:**
- Poor user experience (long wait)
- No progress indication
- Can't interrupt bad responses
- Feels unresponsive
### Streaming Approach (This Example)
```
User sends prompt
↓
"Hoisting" (0.1s) β†’ User sees first word!
↓
"is a" (0.2s) β†’ More text appears
↓
"JavaScript" (0.3s) β†’ Continuous feedback
↓
[Continues token by token...]
```
**Benefits:**
- Immediate feedback
- Progress visible
- Can interrupt early
- Feels interactive
## How Streaming Works
### Token-by-Token Generation
LLMs generate one token at a time internally. Streaming exposes this:
```
Internal LLM Process:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Token 1: "Hoisting" β”‚
β”‚ Token 2: "is" β”‚
β”‚ Token 3: "a" β”‚
β”‚ Token 4: "JavaScript" β”‚
β”‚ Token 5: "mechanism" β”‚
β”‚ ... β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Without Streaming: With Streaming:
Wait for all tokens Emit each token immediately
└─→ Buffer β†’ Return └─→ Callback β†’ Display
```
### The onTextChunk Callback
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model Generation β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Each new token β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ onTextChunk(text) β”‚ ← Your callback
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Your code processes it:
β€’ Display to user
β€’ Send over network
β€’ Log to file
β€’ Analyze content
```
## Token Limits: maxTokens
### Why Limit Output?
Without limits, models might generate:
```
User: "Explain hoisting"
Model: [Generates 10,000 words including:
- Complete JavaScript history
- Every edge case
- Unrelated examples
- Never stops...]
```
With limits:
```
User: "Explain hoisting"
Model: [Generates ~1500 words
- Core concept
- Key examples
- Stops at 2000 tokens]
```
### Token Budgeting
```
Context Window: 4096 tokens
β”œβ”€ System Prompt: 200 tokens
β”œβ”€ User Message: 100 tokens
β”œβ”€ Response (maxTokens): 2000 tokens
└─ Remaining for history: 1796 tokens
Total used: 2300 tokens
Available: 1796 tokens for future conversation
```
### Cost vs Quality
```
Token Limit Output Quality Use Case
─────────── ─────────────── ─────────────────
100 Brief, may be cut Quick answers
500 Concise but complete Short explanations
2000 (example) Detailed Full explanations
No limit Risk of rambling When length unknown
```
## Real-Time Applications
### Pattern 1: Interactive CLI
```
User: "Explain closures"
↓
Terminal: "A closure is a function..."
(Appears word by word, like typing)
↓
User sees progress, knows it's working
```
### Pattern 2: Web Application
```
Browser Server
β”‚ β”‚
β”œβ”€β”€β”€ Send prompt ────────→│
β”‚ β”‚
│←── Chunk 1: "Closures"───
β”‚ (Display immediately) β”‚
β”‚ β”‚
│←── Chunk 2: "are"────────
β”‚ (Append to display) β”‚
β”‚ β”‚
│←── Chunk 3: "functions"──
β”‚ (Keep appending...) β”‚
```
Implementation:
- Server-Sent Events (SSE)
- WebSockets
- HTTP streaming
### Pattern 3: Multi-Consumer
```
onTextChunk(text)
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
Console WebSocket Log File
Display β†’ Client β†’ Storage
```
## Performance Characteristics
### Latency vs Throughput
```
Time to First Token (TTFT):
β”œβ”€ Small model (1.7B): ~100ms
β”œβ”€ Medium model (8B): ~200ms
└─ Large model (20B): ~500ms
Tokens Per Second:
β”œβ”€ Small model: 50-80 tok/s
β”œβ”€ Medium model: 20-35 tok/s
└─ Large model: 10-15 tok/s
User Experience:
TTFT < 500ms β†’ Feels instant
Tok/s > 20 β†’ Reads naturally
```
### Resource Trade-offs
```
Model Size Memory Speed Quality
────────── ──────── ───── ───────
1.7B ~2GB Fast Good
8B ~6GB Medium Better
20B ~12GB Slower Best
```
## Advanced Concepts
### Buffering Strategies
**No Buffer (Immediate)**
```
Every token β†’ callback β†’ display
└─ Smoothest UX but more overhead
```
**Line Buffer**
```
Accumulate until newline β†’ flush
└─ Better for paragraph-based output
```
**Time Buffer**
```
Accumulate for 50ms β†’ flush batch
└─ Reduces callback frequency
```
### Early Stopping
```
Generation in progress:
"The answer is clearly... wait, actually..."
↑
onTextChunk detects issue
↓
Stop generation
↓
"Let me reconsider"
```
Useful for:
- Detecting off-topic responses
- Safety filters
- Relevance checking
### Progressive Enhancement
```
Partial Response Analysis:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "To implement this feature..." β”‚
β”‚ β”‚
β”‚ ← Already useful information β”‚
β”‚ β”‚
β”‚ "...you'll need: 1) Node.js" β”‚
β”‚ β”‚
β”‚ ← Can start acting on this β”‚
β”‚ β”‚
β”‚ "2) Express framework" β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Agent can begin working before response completes!
```
## Context Size Awareness
### Why It Matters
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Context Window (4096) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ System Prompt 200 tokens β”‚
β”‚ Conversation History 1000 β”‚
β”‚ Current Prompt 100 β”‚
β”‚ Response Space 2796 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
If maxTokens > 2796:
└─→ Error or truncation!
```
### Dynamic Adjustment
```
Available = contextSize - (prompt + history)
if (maxTokens > available) {
maxTokens = available;
// or clear old history
}
```
## Streaming in Agent Architectures
### Simple Agent
```
User β†’ LLM (streaming) β†’ Display
└─ onTextChunk shows progress
```
### Multi-Step Agent
```
Step 1: Plan (stream) β†’ Show thinking
Step 2: Act (stream) β†’ Show action
Step 3: Result (stream) β†’ Show outcome
└─ User sees agent's process
```
### Collaborative Agents
```
Agent A (streaming) ──┐
β”œβ”€β†’ Coordinator β†’ User
Agent B (streaming) β”€β”€β”˜
└─ Both stream simultaneously
```
## Best Practices
### 1. Always Set maxTokens
```
βœ“ Good:
session.prompt(query, { maxTokens: 2000 })
βœ— Risky:
session.prompt(query)
└─ May use entire context!
```
### 2. Handle Partial Updates
```
let fullResponse = '';
onTextChunk: (chunk) => {
fullResponse += chunk;
display(chunk); // Show immediately
logComplete = false; // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);
```
### 3. Provide Feedback
```
onTextChunk: (chunk) => {
if (firstChunk) {
showLoadingDone();
firstChunk = false;
}
appendToDisplay(chunk);
}
```
### 4. Monitor Performance
```
const startTime = Date.now();
let tokenCount = 0;
onTextChunk: (chunk) => {
tokenCount += estimateTokens(chunk);
const elapsed = (Date.now() - startTime) / 1000;
const tokensPerSecond = tokenCount / elapsed;
updateMetrics(tokensPerSecond);
}
```
## Key Takeaways
1. **Streaming improves UX**: Users see progress immediately
2. **maxTokens controls cost**: Prevents runaway generation
3. **Token-by-token generation**: LLMs produce one token at a time
4. **onTextChunk callback**: Your hook into the generation process
5. **Context awareness matters**: Monitor available space
6. **Essential for production**: Real-time systems need streaming
## Comparison
```
Feature intro.js coding.js (this)
──────────────── ───────── ─────────────────
Streaming βœ— βœ“
Token limit βœ— βœ“ (2000)
Real-time output βœ— βœ“
Progress visible βœ— βœ“
User control βœ— βœ“
```
This pattern is foundational for building responsive, user-friendly AI agent interfaces.