Email / examples /06_coding /CONCEPT.md
lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified

Concept: Streaming & Response Control

Overview

This example demonstrates streaming responses and token limits, two essential techniques for building responsive AI agents with controlled output.

The Streaming Problem

Traditional (Non-Streaming) Approach

User sends prompt
       ↓
   [Wait 10 seconds...]
       ↓
Complete response appears all at once

Problems:

  • Poor user experience (long wait)
  • No progress indication
  • Can't interrupt bad responses
  • Feels unresponsive

Streaming Approach (This Example)

User sends prompt
       ↓
"Hoisting" (0.1s) β†’ User sees first word!
       ↓
"is a" (0.2s) β†’ More text appears
       ↓
"JavaScript" (0.3s) β†’ Continuous feedback
       ↓
[Continues token by token...]

Benefits:

  • Immediate feedback
  • Progress visible
  • Can interrupt early
  • Feels interactive

How Streaming Works

Token-by-Token Generation

LLMs generate one token at a time internally. Streaming exposes this:

Internal LLM Process:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Token 1: "Hoisting"                β”‚
β”‚  Token 2: "is"                      β”‚
β”‚  Token 3: "a"                       β”‚
β”‚  Token 4: "JavaScript"              β”‚
β”‚  Token 5: "mechanism"               β”‚
β”‚  ...                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Without Streaming:        With Streaming:
Wait for all tokens       Emit each token immediately
└─→ Buffer β†’ Return      └─→ Callback β†’ Display

The onTextChunk Callback

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Model Generation            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Each new token  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ onTextChunk(text)  β”‚  ← Your callback
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
    Your code processes it:
    β€’ Display to user
    β€’ Send over network
    β€’ Log to file
    β€’ Analyze content

Token Limits: maxTokens

Why Limit Output?

Without limits, models might generate:

User: "Explain hoisting"
Model: [Generates 10,000 words including:
        - Complete JavaScript history
        - Every edge case
        - Unrelated examples
        - Never stops...]

With limits:

User: "Explain hoisting"
Model: [Generates ~1500 words
        - Core concept
        - Key examples
        - Stops at 2000 tokens]

Token Budgeting

Context Window: 4096 tokens
β”œβ”€ System Prompt: 200 tokens
β”œβ”€ User Message: 100 tokens
β”œβ”€ Response (maxTokens): 2000 tokens
└─ Remaining for history: 1796 tokens

Total used: 2300 tokens
Available: 1796 tokens for future conversation

Cost vs Quality

Token Limit        Output Quality      Use Case
───────────       ───────────────     ─────────────────
100               Brief, may be cut   Quick answers
500               Concise but complete Short explanations
2000 (example)    Detailed            Full explanations
No limit          Risk of rambling    When length unknown

Real-Time Applications

Pattern 1: Interactive CLI

User: "Explain closures"
       ↓
Terminal: "A closure is a function..."
         (Appears word by word, like typing)
       ↓
User sees progress, knows it's working

Pattern 2: Web Application

Browser                    Server
   β”‚                         β”‚
   β”œβ”€β”€β”€ Send prompt ────────→│
   β”‚                         β”‚
   │←── Chunk 1: "Closures"───
   β”‚    (Display immediately) β”‚
   β”‚                         β”‚
   │←── Chunk 2: "are"────────
   β”‚    (Append to display)  β”‚
   β”‚                         β”‚
   │←── Chunk 3: "functions"──
   β”‚    (Keep appending...)  β”‚

Implementation:

  • Server-Sent Events (SSE)
  • WebSockets
  • HTTP streaming

Pattern 3: Multi-Consumer

         onTextChunk(text)
                β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”
        ↓       ↓       ↓
    Console  WebSocket  Log File
    Display  β†’ Client   β†’ Storage

Performance Characteristics

Latency vs Throughput

Time to First Token (TTFT):
β”œβ”€ Small model (1.7B): ~100ms
β”œβ”€ Medium model (8B): ~200ms
└─ Large model (20B): ~500ms

Tokens Per Second:
β”œβ”€ Small model: 50-80 tok/s
β”œβ”€ Medium model: 20-35 tok/s
└─ Large model: 10-15 tok/s

User Experience:
TTFT < 500ms β†’ Feels instant
Tok/s > 20 β†’ Reads naturally

Resource Trade-offs

Model Size      Memory    Speed     Quality
──────────     ────────   ─────     ───────
1.7B           ~2GB       Fast      Good
8B             ~6GB       Medium    Better
20B            ~12GB      Slower    Best

Advanced Concepts

Buffering Strategies

No Buffer (Immediate)

Every token β†’ callback β†’ display
└─ Smoothest UX but more overhead

Line Buffer

Accumulate until newline β†’ flush
└─ Better for paragraph-based output

Time Buffer

Accumulate for 50ms β†’ flush batch
└─ Reduces callback frequency

Early Stopping

Generation in progress:
"The answer is clearly... wait, actually..."
                         ↑
                  onTextChunk detects issue
                         ↓
                   Stop generation
                         ↓
              "Let me reconsider"

Useful for:

  • Detecting off-topic responses
  • Safety filters
  • Relevance checking

Progressive Enhancement

Partial Response Analysis:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "To implement this feature..."  β”‚
β”‚                                 β”‚
β”‚ ← Already useful information   β”‚
β”‚                                 β”‚
β”‚ "...you'll need: 1) Node.js"    β”‚
β”‚                                 β”‚
β”‚ ← Can start acting on this     β”‚
β”‚                                 β”‚
β”‚ "2) Express framework"          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent can begin working before response completes!

Context Size Awareness

Why It Matters

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Context Window (4096)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ System Prompt       200 tokens β”‚
β”‚ Conversation History 1000      β”‚
β”‚ Current Prompt      100        β”‚
β”‚ Response Space      2796       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

If maxTokens > 2796:
└─→ Error or truncation!

Dynamic Adjustment

Available = contextSize - (prompt + history)

if (maxTokens > available) {
    maxTokens = available;
    // or clear old history
}

Streaming in Agent Architectures

Simple Agent

User β†’ LLM (streaming) β†’ Display
       └─ onTextChunk shows progress

Multi-Step Agent

Step 1: Plan (stream) β†’ Show thinking
Step 2: Act (stream) β†’ Show action
Step 3: Result (stream) β†’ Show outcome
       └─ User sees agent's process

Collaborative Agents

Agent A (streaming) ──┐
                      β”œβ”€β†’ Coordinator β†’ User
Agent B (streaming) β”€β”€β”˜
       └─ Both stream simultaneously

Best Practices

1. Always Set maxTokens

βœ“ Good:
session.prompt(query, { maxTokens: 2000 })

βœ— Risky:
session.prompt(query)
└─ May use entire context!

2. Handle Partial Updates

let fullResponse = '';
onTextChunk: (chunk) => {
    fullResponse += chunk;
    display(chunk);        // Show immediately
    logComplete = false;   // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);

3. Provide Feedback

onTextChunk: (chunk) => {
    if (firstChunk) {
        showLoadingDone();
        firstChunk = false;
    }
    appendToDisplay(chunk);
}

4. Monitor Performance

const startTime = Date.now();
let tokenCount = 0;

onTextChunk: (chunk) => {
    tokenCount += estimateTokens(chunk);
    const elapsed = (Date.now() - startTime) / 1000;
    const tokensPerSecond = tokenCount / elapsed;
    updateMetrics(tokensPerSecond);
}

Key Takeaways

  1. Streaming improves UX: Users see progress immediately
  2. maxTokens controls cost: Prevents runaway generation
  3. Token-by-token generation: LLMs produce one token at a time
  4. onTextChunk callback: Your hook into the generation process
  5. Context awareness matters: Monitor available space
  6. Essential for production: Real-time systems need streaming

Comparison

Feature           intro.js    coding.js (this)
────────────────  ─────────   ─────────────────
Streaming         βœ—           βœ“
Token limit       βœ—           βœ“ (2000)
Real-time output  βœ—           βœ“
Progress visible  βœ—           βœ“
User control      βœ—           βœ“

This pattern is foundational for building responsive, user-friendly AI agent interfaces.