# Concept: Streaming & Response Control

## Overview

This example demonstrates **streaming responses** and **token limits**, two essential techniques for building responsive AI agents with controlled output.

## The Streaming Problem

### Traditional (Non-Streaming) Approach

```
User sends prompt
       ↓
   [Wait 10 seconds...]
       ↓
Complete response appears all at once
```

**Problems:**
- Poor user experience (long wait)
- No progress indication
- Can't interrupt bad responses
- Feels unresponsive

### Streaming Approach (This Example)

```
User sends prompt
       ↓
"Hoisting" (0.1s) → User sees first word!
       ↓
"is a" (0.2s) → More text appears
       ↓
"JavaScript" (0.3s) → Continuous feedback
       ↓
[Continues token by token...]
```

**Benefits:**
- Immediate feedback
- Progress visible
- Can interrupt early
- Feels interactive

## How Streaming Works

### Token-by-Token Generation

LLMs generate one token at a time internally. Streaming exposes this:

```
Internal LLM Process:
┌─────────────────────────────────────┐
│  Token 1: "Hoisting"                │
│  Token 2: "is"                      │
│  Token 3: "a"                       │
│  Token 4: "JavaScript"              │
│  Token 5: "mechanism"               │
│  ...                                │
└─────────────────────────────────────┘

Without Streaming:        With Streaming:
Wait for all tokens       Emit each token immediately
└─→ Buffer → Return      └─→ Callback → Display
```

### The onTextChunk Callback

```
┌────────────────────────────────────┐
│        Model Generation            │
└────────────┬───────────────────────┘
             │
    ┌────────┴─────────┐
    │  Each new token  │
    └────────┬─────────┘
             ↓
    ┌────────────────────┐
    │ onTextChunk(text)  │  ← Your callback
    └────────┬───────────┘
             ↓
    Your code processes it:
    • Display to user
    • Send over network
    • Log to file
    • Analyze content
```

## Token Limits: maxTokens

### Why Limit Output?

Without limits, models might generate:
```
User: "Explain hoisting"
Model: [Generates 10,000 words including:
        - Complete JavaScript history
        - Every edge case
        - Unrelated examples
        - Never stops...]
```

With limits:
```
User: "Explain hoisting"
Model: [Generates ~1500 words
        - Core concept
        - Key examples
        - Stops at 2000 tokens]
```

### Token Budgeting

```
Context Window: 4096 tokens
├─ System Prompt: 200 tokens
├─ User Message: 100 tokens
├─ Response (maxTokens): 2000 tokens
└─ Remaining for history: 1796 tokens

Total used: 2300 tokens
Available: 1796 tokens for future conversation
```

### Cost vs Quality

```
Token Limit        Output Quality      Use Case
───────────       ───────────────     ─────────────────
100               Brief, may be cut   Quick answers
500               Concise but complete Short explanations
2000 (example)    Detailed            Full explanations
No limit          Risk of rambling    When length unknown
```

## Real-Time Applications

### Pattern 1: Interactive CLI

```
User: "Explain closures"
       ↓
Terminal: "A closure is a function..."
         (Appears word by word, like typing)
       ↓
User sees progress, knows it's working
```

### Pattern 2: Web Application

```
Browser                    Server
   │                         │
   ├─── Send prompt ────────→│
   │                         │
   │←── Chunk 1: "Closures"──┤
   │    (Display immediately) │
   │                         │
   │←── Chunk 2: "are"───────┤
   │    (Append to display)  │
   │                         │
   │←── Chunk 3: "functions"─┤
   │    (Keep appending...)  │
```

Implementation:
- Server-Sent Events (SSE)
- WebSockets
- HTTP streaming

### Pattern 3: Multi-Consumer

```
         onTextChunk(text)
                │
        ┌───────┼───────┐
        ↓       ↓       ↓
    Console  WebSocket  Log File
    Display  → Client   → Storage
```

## Performance Characteristics

### Latency vs Throughput

```
Time to First Token (TTFT):
├─ Small model (1.7B): ~100ms
├─ Medium model (8B): ~200ms
└─ Large model (20B): ~500ms

Tokens Per Second:
├─ Small model: 50-80 tok/s
├─ Medium model: 20-35 tok/s
└─ Large model: 10-15 tok/s

User Experience:
TTFT < 500ms → Feels instant
Tok/s > 20 → Reads naturally
```

### Resource Trade-offs

```
Model Size      Memory    Speed     Quality
──────────     ────────   ─────     ───────
1.7B           ~2GB       Fast      Good
8B             ~6GB       Medium    Better
20B            ~12GB      Slower    Best
```

## Advanced Concepts

### Buffering Strategies

**No Buffer (Immediate)**
```
Every token → callback → display
└─ Smoothest UX but more overhead
```

**Line Buffer**
```
Accumulate until newline → flush
└─ Better for paragraph-based output
```

**Time Buffer**
```
Accumulate for 50ms → flush batch
└─ Reduces callback frequency
```

### Early Stopping

```
Generation in progress:
"The answer is clearly... wait, actually..."
                         ↑
                  onTextChunk detects issue
                         ↓
                   Stop generation
                         ↓
              "Let me reconsider"
```

Useful for:
- Detecting off-topic responses
- Safety filters
- Relevance checking

### Progressive Enhancement

```
Partial Response Analysis:
┌─────────────────────────────────┐
│ "To implement this feature..."  │
│                                 │
│ ← Already useful information   │
│                                 │
│ "...you'll need: 1) Node.js"    │
│                                 │
│ ← Can start acting on this     │
│                                 │
│ "2) Express framework"          │
└─────────────────────────────────┘

Agent can begin working before response completes!
```

## Context Size Awareness

### Why It Matters

```
┌────────────────────────────────┐
│    Context Window (4096)       │
├────────────────────────────────┤
│ System Prompt       200 tokens │
│ Conversation History 1000      │
│ Current Prompt      100        │
│ Response Space      2796       │
└────────────────────────────────┘

If maxTokens > 2796:
└─→ Error or truncation!
```

### Dynamic Adjustment

```
Available = contextSize - (prompt + history)

if (maxTokens > available) {
    maxTokens = available;
    // or clear old history
}
```

## Streaming in Agent Architectures

### Simple Agent

```
User → LLM (streaming) → Display
       └─ onTextChunk shows progress
```

### Multi-Step Agent

```
Step 1: Plan (stream) → Show thinking
Step 2: Act (stream) → Show action
Step 3: Result (stream) → Show outcome
       └─ User sees agent's process
```

### Collaborative Agents

```
Agent A (streaming) ──┐
                      ├─→ Coordinator → User
Agent B (streaming) ──┘
       └─ Both stream simultaneously
```

## Best Practices

### 1. Always Set maxTokens

```
✓ Good:
session.prompt(query, { maxTokens: 2000 })

✗ Risky:
session.prompt(query)
└─ May use entire context!
```

### 2. Handle Partial Updates

```
let fullResponse = '';
onTextChunk: (chunk) => {
    fullResponse += chunk;
    display(chunk);        // Show immediately
    logComplete = false;   // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);
```

### 3. Provide Feedback

```
onTextChunk: (chunk) => {
    if (firstChunk) {
        showLoadingDone();
        firstChunk = false;
    }
    appendToDisplay(chunk);
}
```

### 4. Monitor Performance

```
const startTime = Date.now();
let tokenCount = 0;

onTextChunk: (chunk) => {
    tokenCount += estimateTokens(chunk);
    const elapsed = (Date.now() - startTime) / 1000;
    const tokensPerSecond = tokenCount / elapsed;
    updateMetrics(tokensPerSecond);
}
```

## Key Takeaways

1. **Streaming improves UX**: Users see progress immediately
2. **maxTokens controls cost**: Prevents runaway generation
3. **Token-by-token generation**: LLMs produce one token at a time
4. **onTextChunk callback**: Your hook into the generation process
5. **Context awareness matters**: Monitor available space
6. **Essential for production**: Real-time systems need streaming

## Comparison

```
Feature           intro.js    coding.js (this)
────────────────  ─────────   ─────────────────
Streaming         ✗           ✓
Token limit       ✗           ✓ (2000)
Real-time output  ✗           ✓
Progress visible  ✗           ✓
User control      ✗           ✓
```

This pattern is foundational for building responsive, user-friendly AI agent interfaces.