File size: 11,385 Bytes
e706de2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 |
# Code Explanation: coding.js
This file demonstrates **streaming responses** with token limits and real-time output, showing how to get immediate feedback from the LLM as it generates text.
## Step-by-Step Code Breakdown
### 1. Import and Setup (Lines 1-8)
```javascript
import {
getLlama,
HarmonyChatWrapper,
LlamaChatSession,
} from "node-llama-cpp";
import {fileURLToPath} from "url";
import path from "path";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
```
- Standard setup for LLM interaction
- **HarmonyChatWrapper**: A chat format wrapper for models that use the Harmony format (more on this below)
### 2. Understanding the Harmony Chat Format
#### What is Harmony?
Harmony is a structured message format used for multi-role chat interactions designed by OpenAI for their gpt-oss models. It's not just a prompt format - it's a complete rethinking of how models should structure their outputs, especially for complex reasoning and tool use.
#### Harmony Format Structure
The format uses special tokens and syntax to define roles such as `system`, `developer`, `user`, `assistant`, and `tool`, as well as output "channels" (`analysis`, `commentary`, `final`) that let the model reason internally, call tools, and produce clean user-facing responses.
**Basic message structure:**
```
<|start|>ROLE<|message|>CONTENT<|end|>
<|start|>assistant<|channel|>CHANNEL<|message|>CONTENT<|end|>
```
**The five roles in hierarchy order** (system > developer > user > assistant > tool):
1. **system**: Global identity, guardrails, and model configuration
2. **developer**: Product policy and style instructions (what you typically think of as "system prompt")
3. **user**: User messages and queries
4. **assistant**: Model responses
5. **tool**: Tool execution results
**The three output channels:**
1. **analysis**: Private chain-of-thought reasoning not shown to users
2. **commentary**: Tool calling preambles and process updates
3. **final**: Clean user-facing responses
**Example of Harmony in action:**
```
<|start|>system<|message|>You are a helpful assistant.<|end|>
<|start|>developer<|message|>Always be concise.<|end|>
<|start|>user<|message|>What time is it?<|end|>
<|start|>assistant<|channel|>commentary<|message|>{"tool_use": {"name": "get_current_time", "arguments": {}}}<|end|>
<|start|>tool<|message|>{"time": "2025-10-25T13:47:00Z"}<|end|>
<|start|>assistant<|channel|>final<|message|>The current time is 1:47 PM UTC.<|end|>
```
#### Why Use Harmony?
Harmony separates how the model thinks, what actions it takes, and what finally goes to the user, resulting in cleaner tool use, safer defaults for UI, and better observability. For our translation example:
- The `final` channel ensures we only get the translation, not explanations
- The structured format helps the model follow instructions more reliably
- The role hierarchy prevents instruction conflicts
**Important Note**: Models need to be specifically trained or fine-tuned to produce Harmony output correctly. You can't just apply this format to any model. Apertus and other models not explicitly trained on Harmony may be confused by this structure, but the HarmonyChatWrapper in node-llama-cpp handles the necessary formatting automatically.
### 3. Load Model (Lines 10-18)
```javascript
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(
__dirname,
"../",
"models",
"hf_giladgd_gpt-oss-20b.MXFP4.gguf"
)
});
```
- Uses **gpt-oss-20b**: A 20 billion parameter model
- **MXFP4**: Mixed precision 4-bit quantization for smaller size
- Larger model = better code explanations
### 4. Create Context and Session (Lines 19-22)
```javascript
const context = await model.createContext();
const session = new LlamaChatSession({
chatWrapper: new HarmonyChatWrapper(),
contextSequence: context.getSequence(),
});
```
Basic session setup with no system prompt.
### 5. Define the Question (Line 24)
```javascript
const q1 = `What is hoisting in JavaScript? Explain with examples.`;
```
A technical programming question that requires detailed explanation.
### 6. Display Context Size (Line 26)
```javascript
console.log('context.contextSize', context.contextSize)
```
- Shows the maximum context window size
- Helps understand memory limitations
- Useful for debugging
### 7. Streaming Prompt Execution (Lines 28-36)
```javascript
const a1 = await session.prompt(q1, {
// Tip: let the lib choose or cap reasonably; using the whole context size can be wasteful
maxTokens: 2000,
// Fires as soon as the first characters arrive
onTextChunk: (text) => {
process.stdout.write(text); // optional: live print
},
});
```
**Key parameters:**
**maxTokens: 2000**
- Limits response length to 2000 tokens (~1500 words)
- Prevents runaway generation
- Saves time and compute
- Without limit: model uses entire context
**onTextChunk callback**
- Fires **as each token is generated**
- Receives text as it's produced
- `process.stdout.write()`: Prints without newlines
- Creates real-time "typing" effect
### How Streaming Works
```
Without streaming:
User β [Wait 10 seconds...] β Complete response appears
With streaming:
User β [Token 1] β [Token 2] β [Token 3] β ... β Complete
"What" "is" "hoisting"
(Immediate feedback!)
```
### 8. Display Final Answer (Line 38)
```javascript
console.log("\n\nFinal answer:\n", a1);
```
- Prints the complete response again
- Useful for logging or verification
- Shows full text after streaming
### 9. Cleanup (Lines 41-44)
```javascript
session.dispose()
context.dispose()
model.dispose()
llama.dispose()
```
Standard resource cleanup.
## Key Concepts Demonstrated
### 1. Streaming Responses
**Why streaming matters:**
- **Better UX**: Users see progress immediately
- **Early termination**: Can stop if response is off-track
- **Perceived speed**: Feels faster than waiting
- **Debugging**: See generation in real-time
**Comparison:**
```
Non-streaming: Streaming:
βββββββββββββββ βββββββββββββββ
Request sent Request sent
[10s wait...] "What" (0.1s)
Complete response "is" (0.2s)
"hoisting" (0.3s)
... continues
(Same total time, better experience!)
```
### 2. Token Limits
**maxTokens controls generation length:**
```
No limit: With limit (2000):
βββββββββ βββββββββββββββββ
May generate forever Stops at 2000 tokens
Uses entire context Saves computation
Unpredictable cost Predictable cost
```
**Token approximation:**
- 1 token β 0.75 words (English)
- 2000 tokens β 1500 words
- 4-5 paragraphs of detailed explanation
### 3. Real-Time Feedback Pattern
The `onTextChunk` callback enables:
```javascript
onTextChunk: (text) => {
// Do anything with each chunk:
process.stdout.write(text); // Console output
// socket.emit('chunk', text); // WebSocket to client
// buffer += text; // Accumulate for processing
// analyzePartial(text); // Real-time analysis
}
```
### 4. Context Size Awareness
```javascript
console.log('context.contextSize', context.contextSize)
```
Shows model's memory capacity:
- Small models: 2048-4096 tokens
- Medium models: 8192-16384 tokens
- Large models: 32768+ tokens
**Why it matters:**
```
Context Size: 4096 tokens
Prompt: 100 tokens
Max response: 2000 tokens
History: Up to 1996 tokens
```
## Use Cases
### 1. Code Explanations (This Example)
```javascript
prompt: "Explain hoisting in JavaScript"
β Streams detailed explanation with examples
```
### 2. Long-Form Content Generation
```javascript
prompt: "Write a blog post about AI agents"
maxTokens: 3000
β Streams article as it's written
```
### 3. Interactive Tutoring
```javascript
// User sees explanation being built
prompt: "Teach me about closures"
onTextChunk: (text) => displayToUser(text)
```
### 4. Web Applications
```javascript
// Server-Sent Events or WebSocket
onTextChunk: (text) => {
websocket.send(text); // Send to browser
}
```
## Performance Considerations
### Token Generation Speed
Depends on:
- **Model size**: Larger = slower per token
- **Hardware**: GPU > CPU
- **Quantization**: Lower bits = faster
- **Context length**: Longer context = slower
**Typical speeds:**
```
Model Size GPU (RTX 4090) CPU (M2 Max)
ββββββββββ ββββββββββββββ ββββββββββββ
1.7B 50-80 tok/s 15-25 tok/s
8B 20-35 tok/s 5-10 tok/s
20B 10-15 tok/s 2-4 tok/s
```
### When to Use maxTokens
```
β Use maxTokens when:
β’ Response length is predictable
β’ You want to save computation
β’ Testing/debugging
β’ API rate limiting
β Don't limit when:
β’ Need complete answer
β’ Length varies greatly
β’ Using stop sequences instead
```
## Advanced Streaming Patterns
### Pattern 1: Progressive Enhancement
```javascript
let buffer = '';
onTextChunk: (text) => {
buffer += text;
if (buffer.includes('\n\n')) {
// Complete paragraph ready
processParagraph(buffer);
buffer = '';
}
}
```
### Pattern 2: Early Stopping
```javascript
let isRelevant = true;
onTextChunk: (text) => {
if (text.includes('irrelevant_keyword')) {
isRelevant = false;
// Stop generation (would need additional API)
}
}
```
### Pattern 3: Multi-Consumer
```javascript
onTextChunk: (text) => {
console.log(text); // Console
logFile.write(text); // File
websocket.send(text); // Client
analyzer.process(text); // Analysis
}
```
## Expected Output
When run, you'll see:
1. Context size logged (e.g., "context.contextSize 32768")
2. Streaming response appearing token-by-token
3. Complete final answer printed again
Example output flow:
```
context.contextSize 32768
Hoisting is a JavaScript mechanism where variables and function
declarations are moved to the top of their scope before code
execution. For example:
console.log(x); // undefined (not an error!)
var x = 5;
This works because...
[continues streaming...]
Final answer:
[Complete response printed again]
```
## Why This Matters for AI Agents
### User Experience
- Real-time agents feel more responsive
- Users can interrupt if going wrong direction
- Better for conversational interfaces
### Resource Management
- Token limits prevent runaway generation
- Predictable costs and timing
- Can cancel expensive operations early
### Integration Patterns
- Web UIs show "typing" effect
- CLIs display progressive output
- APIs stream to clients efficiently
This pattern is essential for production agent systems where user experience and resource control matter.
|