Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / examples /06_coding /CONCEPT.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

10.4 kB

Concept: Streaming & Response Control

Overview

This example demonstrates streaming responses and token limits, two essential techniques for building responsive AI agents with controlled output.

The Streaming Problem

Traditional (Non-Streaming) Approach

User sends prompt
       ↓
   [Wait 10 seconds...]
       ↓
Complete response appears all at once

Problems:

Poor user experience (long wait)
No progress indication
Can't interrupt bad responses
Feels unresponsive

Streaming Approach (This Example)

User sends prompt
       ↓
"Hoisting" (0.1s) → User sees first word!
       ↓
"is a" (0.2s) → More text appears
       ↓
"JavaScript" (0.3s) → Continuous feedback
       ↓
[Continues token by token...]

Benefits:

Immediate feedback
Progress visible
Can interrupt early
Feels interactive

How Streaming Works

Token-by-Token Generation

LLMs generate one token at a time internally. Streaming exposes this:

Internal LLM Process:
┌─────────────────────────────────────┐
│  Token 1: "Hoisting"                │
│  Token 2: "is"                      │
│  Token 3: "a"                       │
│  Token 4: "JavaScript"              │
│  Token 5: "mechanism"               │
│  ...                                │
└─────────────────────────────────────┘

Without Streaming:        With Streaming:
Wait for all tokens       Emit each token immediately
└─→ Buffer → Return      └─→ Callback → Display

The onTextChunk Callback

┌────────────────────────────────────┐
│        Model Generation            │
└────────────┬───────────────────────┘
             │
    ┌────────┴─────────┐
    │  Each new token  │
    └────────┬─────────┘
             ↓
    ┌────────────────────┐
    │ onTextChunk(text)  │  ← Your callback
    └────────┬───────────┘
             ↓
    Your code processes it:
    • Display to user
    • Send over network
    • Log to file
    • Analyze content

Token Limits: maxTokens

Why Limit Output?

Without limits, models might generate:

User: "Explain hoisting"
Model: [Generates 10,000 words including:
        - Complete JavaScript history
        - Every edge case
        - Unrelated examples
        - Never stops...]

With limits:

User: "Explain hoisting"
Model: [Generates ~1500 words
        - Core concept
        - Key examples
        - Stops at 2000 tokens]

Token Budgeting

Context Window: 4096 tokens
├─ System Prompt: 200 tokens
├─ User Message: 100 tokens
├─ Response (maxTokens): 2000 tokens
└─ Remaining for history: 1796 tokens

Total used: 2300 tokens
Available: 1796 tokens for future conversation

Cost vs Quality

Token Limit        Output Quality      Use Case
───────────       ───────────────     ─────────────────
100               Brief, may be cut   Quick answers
500               Concise but complete Short explanations
2000 (example)    Detailed            Full explanations
No limit          Risk of rambling    When length unknown

Real-Time Applications

Pattern 1: Interactive CLI

User: "Explain closures"
       ↓
Terminal: "A closure is a function..."
         (Appears word by word, like typing)
       ↓
User sees progress, knows it's working

Pattern 2: Web Application

Browser                    Server
   │                         │
   ├─── Send prompt ────────→│
   │                         │
   │←── Chunk 1: "Closures"──┤
   │    (Display immediately) │
   │                         │
   │←── Chunk 2: "are"───────┤
   │    (Append to display)  │
   │                         │
   │←── Chunk 3: "functions"─┤
   │    (Keep appending...)  │

Implementation:

Server-Sent Events (SSE)
WebSockets
HTTP streaming

Pattern 3: Multi-Consumer

         onTextChunk(text)
                │
        ┌───────┼───────┐
        ↓       ↓       ↓
    Console  WebSocket  Log File
    Display  → Client   → Storage

Performance Characteristics

Latency vs Throughput

Time to First Token (TTFT):
├─ Small model (1.7B): ~100ms
├─ Medium model (8B): ~200ms
└─ Large model (20B): ~500ms

Tokens Per Second:
├─ Small model: 50-80 tok/s
├─ Medium model: 20-35 tok/s
└─ Large model: 10-15 tok/s

User Experience:
TTFT < 500ms → Feels instant
Tok/s > 20 → Reads naturally

Resource Trade-offs

Model Size      Memory    Speed     Quality
──────────     ────────   ─────     ───────
1.7B           ~2GB       Fast      Good
8B             ~6GB       Medium    Better
20B            ~12GB      Slower    Best

Advanced Concepts

Buffering Strategies

No Buffer (Immediate)

Every token → callback → display
└─ Smoothest UX but more overhead

Line Buffer

Accumulate until newline → flush
└─ Better for paragraph-based output

Time Buffer

Accumulate for 50ms → flush batch
└─ Reduces callback frequency

Early Stopping

Generation in progress:
"The answer is clearly... wait, actually..."
                         ↑
                  onTextChunk detects issue
                         ↓
                   Stop generation
                         ↓
              "Let me reconsider"

Useful for:

Detecting off-topic responses
Safety filters
Relevance checking

Progressive Enhancement

Partial Response Analysis:
┌─────────────────────────────────┐
│ "To implement this feature..."  │
│                                 │
│ ← Already useful information   │
│                                 │
│ "...you'll need: 1) Node.js"    │
│                                 │
│ ← Can start acting on this     │
│                                 │
│ "2) Express framework"          │
└─────────────────────────────────┘

Agent can begin working before response completes!

Context Size Awareness

Why It Matters

┌────────────────────────────────┐
│    Context Window (4096)       │
├────────────────────────────────┤
│ System Prompt       200 tokens │
│ Conversation History 1000      │
│ Current Prompt      100        │
│ Response Space      2796       │
└────────────────────────────────┘

If maxTokens > 2796:
└─→ Error or truncation!

Dynamic Adjustment

Available = contextSize - (prompt + history)

if (maxTokens > available) {
    maxTokens = available;
    // or clear old history
}

Streaming in Agent Architectures

Simple Agent

User → LLM (streaming) → Display
       └─ onTextChunk shows progress

Multi-Step Agent

Step 1: Plan (stream) → Show thinking
Step 2: Act (stream) → Show action
Step 3: Result (stream) → Show outcome
       └─ User sees agent's process

Collaborative Agents

Agent A (streaming) ──┐
                      ├─→ Coordinator → User
Agent B (streaming) ──┘
       └─ Both stream simultaneously

Best Practices

1. Always Set maxTokens

✓ Good:
session.prompt(query, { maxTokens: 2000 })

✗ Risky:
session.prompt(query)
└─ May use entire context!

2. Handle Partial Updates

let fullResponse = '';
onTextChunk: (chunk) => {
    fullResponse += chunk;
    display(chunk);        // Show immediately
    logComplete = false;   // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);

3. Provide Feedback

onTextChunk: (chunk) => {
    if (firstChunk) {
        showLoadingDone();
        firstChunk = false;
    }
    appendToDisplay(chunk);
}

4. Monitor Performance

const startTime = Date.now();
let tokenCount = 0;

onTextChunk: (chunk) => {
    tokenCount += estimateTokens(chunk);
    const elapsed = (Date.now() - startTime) / 1000;
    const tokensPerSecond = tokenCount / elapsed;
    updateMetrics(tokensPerSecond);
}

Key Takeaways

Streaming improves UX: Users see progress immediately
maxTokens controls cost: Prevents runaway generation
Token-by-token generation: LLMs produce one token at a time
onTextChunk callback: Your hook into the generation process
Context awareness matters: Monitor available space
Essential for production: Real-time systems need streaming

Comparison

Feature           intro.js    coding.js (this)
────────────────  ─────────   ─────────────────
Streaming         ✗           ✓
Token limit       ✗           ✓ (2000)
Real-time output  ✗           ✓
Progress visible  ✗           ✓
User control      ✗           ✓

This pattern is foundational for building responsive, user-friendly AI agent interfaces.