Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / examples /06_coding /CONCEPT.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

10.4 kB

	# Concept: Streaming & Response Control

	## Overview

	This example demonstrates streaming responses and token limits, two essential techniques for building responsive AI agents with controlled output.

	## The Streaming Problem

	### Traditional (Non-Streaming) Approach

	```
	User sends prompt
	↓
	[Wait 10 seconds...]
	↓
	Complete response appears all at once
	```

	Problems:
	- Poor user experience (long wait)
	- No progress indication
	- Can't interrupt bad responses
	- Feels unresponsive

	### Streaming Approach (This Example)

	```
	User sends prompt
	↓
	"Hoisting" (0.1s) → User sees first word!
	↓
	"is a" (0.2s) → More text appears
	↓
	"JavaScript" (0.3s) → Continuous feedback
	↓
	[Continues token by token...]
	```

	Benefits:
	- Immediate feedback
	- Progress visible
	- Can interrupt early
	- Feels interactive

	## How Streaming Works

	### Token-by-Token Generation

	LLMs generate one token at a time internally. Streaming exposes this:

	```
	Internal LLM Process:
	┌─────────────────────────────────────┐
	│ Token 1: "Hoisting" │
	│ Token 2: "is" │
	│ Token 3: "a" │
	│ Token 4: "JavaScript" │
	│ Token 5: "mechanism" │
	│ ... │
	└─────────────────────────────────────┘

	Without Streaming: With Streaming:
	Wait for all tokens Emit each token immediately
	└─→ Buffer → Return └─→ Callback → Display
	```

	### The onTextChunk Callback

	```
	┌────────────────────────────────────┐
	│ Model Generation │
	└────────────┬───────────────────────┘
	│
	┌────────┴─────────┐
	│ Each new token │
	└────────┬─────────┘
	↓
	┌────────────────────┐
	│ onTextChunk(text) │ ← Your callback
	└────────┬───────────┘
	↓
	Your code processes it:
	• Display to user
	• Send over network
	• Log to file
	• Analyze content
	```

	## Token Limits: maxTokens

	### Why Limit Output?

	Without limits, models might generate:
	```
	User: "Explain hoisting"
	Model: [Generates 10,000 words including:
	- Complete JavaScript history
	- Every edge case
	- Unrelated examples
	- Never stops...]
	```

	With limits:
	```
	User: "Explain hoisting"
	Model: [Generates ~1500 words
	- Core concept
	- Key examples
	- Stops at 2000 tokens]
	```

	### Token Budgeting

	```
	Context Window: 4096 tokens
	├─ System Prompt: 200 tokens
	├─ User Message: 100 tokens
	├─ Response (maxTokens): 2000 tokens
	└─ Remaining for history: 1796 tokens

	Total used: 2300 tokens
	Available: 1796 tokens for future conversation
	```

	### Cost vs Quality

	```
	Token Limit Output Quality Use Case
	─────────── ─────────────── ─────────────────
	100 Brief, may be cut Quick answers
	500 Concise but complete Short explanations
	2000 (example) Detailed Full explanations
	No limit Risk of rambling When length unknown
	```

	## Real-Time Applications

	### Pattern 1: Interactive CLI

	```
	User: "Explain closures"
	↓
	Terminal: "A closure is a function..."
	(Appears word by word, like typing)
	↓
	User sees progress, knows it's working
	```

	### Pattern 2: Web Application

	```
	Browser Server
	│ │
	├─── Send prompt ────────→│
	│ │
	│←── Chunk 1: "Closures"──┤
	│ (Display immediately) │
	│ │
	│←── Chunk 2: "are"───────┤
	│ (Append to display) │
	│ │
	│←── Chunk 3: "functions"─┤
	│ (Keep appending...) │
	```

	Implementation:
	- Server-Sent Events (SSE)
	- WebSockets
	- HTTP streaming

	### Pattern 3: Multi-Consumer

	```
	onTextChunk(text)
	│
	┌───────┼───────┐
	↓ ↓ ↓
	Console WebSocket Log File
	Display → Client → Storage
	```

	## Performance Characteristics

	### Latency vs Throughput

	```
	Time to First Token (TTFT):
	├─ Small model (1.7B): ~100ms
	├─ Medium model (8B): ~200ms
	└─ Large model (20B): ~500ms

	Tokens Per Second:
	├─ Small model: 50-80 tok/s
	├─ Medium model: 20-35 tok/s
	└─ Large model: 10-15 tok/s

	User Experience:
	TTFT < 500ms → Feels instant
	Tok/s > 20 → Reads naturally
	```

	### Resource Trade-offs

	```
	Model Size Memory Speed Quality
	────────── ──────── ───── ───────
	1.7B ~2GB Fast Good
	8B ~6GB Medium Better
	20B ~12GB Slower Best
	```

	## Advanced Concepts

	### Buffering Strategies

	No Buffer (Immediate)
	```
	Every token → callback → display
	└─ Smoothest UX but more overhead
	```

	Line Buffer
	```
	Accumulate until newline → flush
	└─ Better for paragraph-based output
	```

	Time Buffer
	```
	Accumulate for 50ms → flush batch
	└─ Reduces callback frequency
	```

	### Early Stopping

	```
	Generation in progress:
	"The answer is clearly... wait, actually..."
	↑
	onTextChunk detects issue
	↓
	Stop generation
	↓
	"Let me reconsider"
	```

	Useful for:
	- Detecting off-topic responses
	- Safety filters
	- Relevance checking

	### Progressive Enhancement

	```
	Partial Response Analysis:
	┌─────────────────────────────────┐
	│ "To implement this feature..." │
	│ │
	│ ← Already useful information │
	│ │
	│ "...you'll need: 1) Node.js" │
	│ │
	│ ← Can start acting on this │
	│ │
	│ "2) Express framework" │
	└─────────────────────────────────┘

	Agent can begin working before response completes!
	```

	## Context Size Awareness

	### Why It Matters

	```
	┌────────────────────────────────┐
	│ Context Window (4096) │
	├────────────────────────────────┤
	│ System Prompt 200 tokens │
	│ Conversation History 1000 │
	│ Current Prompt 100 │
	│ Response Space 2796 │
	└────────────────────────────────┘

	If maxTokens > 2796:
	└─→ Error or truncation!
	```

	### Dynamic Adjustment

	```
	Available = contextSize - (prompt + history)

	if (maxTokens > available) {
	maxTokens = available;
	// or clear old history
	}
	```

	## Streaming in Agent Architectures

	### Simple Agent

	```
	User → LLM (streaming) → Display
	└─ onTextChunk shows progress
	```

	### Multi-Step Agent

	```
	Step 1: Plan (stream) → Show thinking
	Step 2: Act (stream) → Show action
	Step 3: Result (stream) → Show outcome
	└─ User sees agent's process
	```

	### Collaborative Agents

	```
	Agent A (streaming) ──┐
	├─→ Coordinator → User
	Agent B (streaming) ──┘
	└─ Both stream simultaneously
	```

	## Best Practices

	### 1. Always Set maxTokens

	```
	✓ Good:
	session.prompt(query, { maxTokens: 2000 })

	✗ Risky:
	session.prompt(query)
	└─ May use entire context!
	```

	### 2. Handle Partial Updates

	```
	let fullResponse = '';
	onTextChunk: (chunk) => {
	fullResponse += chunk;
	display(chunk); // Show immediately
	logComplete = false; // Mark incomplete
	}
	// After completion:
	saveToDatabase(fullResponse);
	```

	### 3. Provide Feedback

	```
	onTextChunk: (chunk) => {
	if (firstChunk) {
	showLoadingDone();
	firstChunk = false;
	}
	appendToDisplay(chunk);
	}
	```

	### 4. Monitor Performance

	```
	const startTime = Date.now();
	let tokenCount = 0;

	onTextChunk: (chunk) => {
	tokenCount += estimateTokens(chunk);
	const elapsed = (Date.now() - startTime) / 1000;
	const tokensPerSecond = tokenCount / elapsed;
	updateMetrics(tokensPerSecond);
	}
	```

	## Key Takeaways

	1. Streaming improves UX: Users see progress immediately
	2. maxTokens controls cost: Prevents runaway generation
	3. Token-by-token generation: LLMs produce one token at a time
	4. onTextChunk callback: Your hook into the generation process
	5. Context awareness matters: Monitor available space
	6. Essential for production: Real-time systems need streaming

	## Comparison

	```
	Feature intro.js coding.js (this)
	──────────────── ───────── ─────────────────
	Streaming ✗ ✓
	Token limit ✗ ✓ (2000)
	Real-time output ✗ ✓
	Progress visible ✗ ✓
	User control ✗ ✓
	```

	This pattern is foundational for building responsive, user-friendly AI agent interfaces.