File size: 10,386 Bytes
e706de2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
# Concept: Streaming & Response Control

## Overview

This example demonstrates **streaming responses** and **token limits**, two essential techniques for building responsive AI agents with controlled output.

## The Streaming Problem

### Traditional (Non-Streaming) Approach

```

User sends prompt

       ↓

   [Wait 10 seconds...]

       ↓

Complete response appears all at once

```

**Problems:**
- Poor user experience (long wait)
- No progress indication
- Can't interrupt bad responses
- Feels unresponsive

### Streaming Approach (This Example)

```

User sends prompt

       ↓

"Hoisting" (0.1s) β†’ User sees first word!

       ↓

"is a" (0.2s) β†’ More text appears

       ↓

"JavaScript" (0.3s) β†’ Continuous feedback

       ↓

[Continues token by token...]

```

**Benefits:**
- Immediate feedback
- Progress visible
- Can interrupt early
- Feels interactive

## How Streaming Works

### Token-by-Token Generation

LLMs generate one token at a time internally. Streaming exposes this:

```

Internal LLM Process:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚  Token 1: "Hoisting"                β”‚

β”‚  Token 2: "is"                      β”‚

β”‚  Token 3: "a"                       β”‚

β”‚  Token 4: "JavaScript"              β”‚

β”‚  Token 5: "mechanism"               β”‚

β”‚  ...                                β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜



Without Streaming:        With Streaming:

Wait for all tokens       Emit each token immediately

└─→ Buffer β†’ Return      └─→ Callback β†’ Display

```

### The onTextChunk Callback

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚        Model Generation            β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

             β”‚

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

    β”‚  Each new token  β”‚

    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

             ↓

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

    β”‚ onTextChunk(text)  β”‚  ← Your callback

    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

             ↓

    Your code processes it:

    β€’ Display to user

    β€’ Send over network

    β€’ Log to file

    β€’ Analyze content

```

## Token Limits: maxTokens

### Why Limit Output?

Without limits, models might generate:
```

User: "Explain hoisting"

Model: [Generates 10,000 words including:

        - Complete JavaScript history

        - Every edge case

        - Unrelated examples

        - Never stops...]

```

With limits:
```

User: "Explain hoisting"

Model: [Generates ~1500 words

        - Core concept

        - Key examples

        - Stops at 2000 tokens]

```

### Token Budgeting

```

Context Window: 4096 tokens

β”œβ”€ System Prompt: 200 tokens

β”œβ”€ User Message: 100 tokens

β”œβ”€ Response (maxTokens): 2000 tokens

└─ Remaining for history: 1796 tokens



Total used: 2300 tokens

Available: 1796 tokens for future conversation

```

### Cost vs Quality

```

Token Limit        Output Quality      Use Case

───────────       ───────────────     ─────────────────

100               Brief, may be cut   Quick answers

500               Concise but complete Short explanations

2000 (example)    Detailed            Full explanations

No limit          Risk of rambling    When length unknown

```

## Real-Time Applications

### Pattern 1: Interactive CLI

```

User: "Explain closures"

       ↓

Terminal: "A closure is a function..."

         (Appears word by word, like typing)

       ↓

User sees progress, knows it's working

```

### Pattern 2: Web Application

```

Browser                    Server

   β”‚                         β”‚

   β”œβ”€β”€β”€ Send prompt ────────→│

   β”‚                         β”‚

   │←── Chunk 1: "Closures"───

   β”‚    (Display immediately) β”‚

   β”‚                         β”‚

   │←── Chunk 2: "are"────────

   β”‚    (Append to display)  β”‚

   β”‚                         β”‚

   │←── Chunk 3: "functions"──

   β”‚    (Keep appending...)  β”‚

```

Implementation:
- Server-Sent Events (SSE)
- WebSockets
- HTTP streaming

### Pattern 3: Multi-Consumer

```

         onTextChunk(text)

                β”‚

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”

        ↓       ↓       ↓

    Console  WebSocket  Log File

    Display  β†’ Client   β†’ Storage

```

## Performance Characteristics

### Latency vs Throughput

```

Time to First Token (TTFT):

β”œβ”€ Small model (1.7B): ~100ms

β”œβ”€ Medium model (8B): ~200ms

└─ Large model (20B): ~500ms



Tokens Per Second:

β”œβ”€ Small model: 50-80 tok/s

β”œβ”€ Medium model: 20-35 tok/s

└─ Large model: 10-15 tok/s



User Experience:

TTFT < 500ms β†’ Feels instant

Tok/s > 20 β†’ Reads naturally

```

### Resource Trade-offs

```

Model Size      Memory    Speed     Quality

──────────     ────────   ─────     ───────

1.7B           ~2GB       Fast      Good

8B             ~6GB       Medium    Better

20B            ~12GB      Slower    Best

```

## Advanced Concepts

### Buffering Strategies

**No Buffer (Immediate)**
```

Every token β†’ callback β†’ display

└─ Smoothest UX but more overhead

```

**Line Buffer**
```

Accumulate until newline β†’ flush

└─ Better for paragraph-based output

```

**Time Buffer**
```

Accumulate for 50ms β†’ flush batch

└─ Reduces callback frequency

```

### Early Stopping

```

Generation in progress:

"The answer is clearly... wait, actually..."

                         ↑

                  onTextChunk detects issue

                         ↓

                   Stop generation

                         ↓

              "Let me reconsider"

```

Useful for:
- Detecting off-topic responses
- Safety filters
- Relevance checking

### Progressive Enhancement

```

Partial Response Analysis:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ "To implement this feature..."  β”‚

β”‚                                 β”‚

β”‚ ← Already useful information   β”‚

β”‚                                 β”‚

β”‚ "...you'll need: 1) Node.js"    β”‚

β”‚                                 β”‚

β”‚ ← Can start acting on this     β”‚

β”‚                                 β”‚

β”‚ "2) Express framework"          β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜



Agent can begin working before response completes!

```

## Context Size Awareness

### Why It Matters

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚    Context Window (4096)       β”‚

β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€

β”‚ System Prompt       200 tokens β”‚

β”‚ Conversation History 1000      β”‚

β”‚ Current Prompt      100        β”‚

β”‚ Response Space      2796       β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜



If maxTokens > 2796:

└─→ Error or truncation!

```

### Dynamic Adjustment

```

Available = contextSize - (prompt + history)



if (maxTokens > available) {

    maxTokens = available;

    // or clear old history

}

```

## Streaming in Agent Architectures

### Simple Agent

```

User β†’ LLM (streaming) β†’ Display

       └─ onTextChunk shows progress

```

### Multi-Step Agent

```

Step 1: Plan (stream) β†’ Show thinking

Step 2: Act (stream) β†’ Show action

Step 3: Result (stream) β†’ Show outcome

       └─ User sees agent's process

```

### Collaborative Agents

```

Agent A (streaming) ──┐

                      β”œβ”€β†’ Coordinator β†’ User

Agent B (streaming) β”€β”€β”˜

       └─ Both stream simultaneously

```

## Best Practices

### 1. Always Set maxTokens

```

βœ“ Good:

session.prompt(query, { maxTokens: 2000 })



βœ— Risky:

session.prompt(query)

└─ May use entire context!

```

### 2. Handle Partial Updates

```

let fullResponse = '';

onTextChunk: (chunk) => {

    fullResponse += chunk;

    display(chunk);        // Show immediately

    logComplete = false;   // Mark incomplete

}

// After completion:

saveToDatabase(fullResponse);

```

### 3. Provide Feedback

```

onTextChunk: (chunk) => {

    if (firstChunk) {

        showLoadingDone();

        firstChunk = false;

    }

    appendToDisplay(chunk);

}

```

### 4. Monitor Performance

```

const startTime = Date.now();

let tokenCount = 0;



onTextChunk: (chunk) => {

    tokenCount += estimateTokens(chunk);

    const elapsed = (Date.now() - startTime) / 1000;

    const tokensPerSecond = tokenCount / elapsed;

    updateMetrics(tokensPerSecond);

}

```

## Key Takeaways

1. **Streaming improves UX**: Users see progress immediately
2. **maxTokens controls cost**: Prevents runaway generation
3. **Token-by-token generation**: LLMs produce one token at a time
4. **onTextChunk callback**: Your hook into the generation process
5. **Context awareness matters**: Monitor available space
6. **Essential for production**: Real-time systems need streaming

## Comparison

```

Feature           intro.js    coding.js (this)

────────────────  ─────────   ─────────────────

Streaming         βœ—           βœ“

Token limit       βœ—           βœ“ (2000)

Real-time output  βœ—           βœ“

Progress visible  βœ—           βœ“

User control      βœ—           βœ“

```

This pattern is foundational for building responsive, user-friendly AI agent interfaces.