File size: 11,385 Bytes
e706de2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
# Code Explanation: coding.js

This file demonstrates **streaming responses** with token limits and real-time output, showing how to get immediate feedback from the LLM as it generates text.

## Step-by-Step Code Breakdown

### 1. Import and Setup (Lines 1-8)
```javascript

import {

    getLlama,

    HarmonyChatWrapper,

    LlamaChatSession,

} from "node-llama-cpp";

import {fileURLToPath} from "url";

import path from "path";



const __dirname = path.dirname(fileURLToPath(import.meta.url));

```
- Standard setup for LLM interaction
- **HarmonyChatWrapper**: A chat format wrapper for models that use the Harmony format (more on this below)

### 2. Understanding the Harmony Chat Format

#### What is Harmony?
Harmony is a structured message format used for multi-role chat interactions designed by OpenAI for their gpt-oss models. It's not just a prompt format - it's a complete rethinking of how models should structure their outputs, especially for complex reasoning and tool use.

#### Harmony Format Structure

The format uses special tokens and syntax to define roles such as `system`, `developer`, `user`, `assistant`, and `tool`, as well as output "channels" (`analysis`, `commentary`, `final`) that let the model reason internally, call tools, and produce clean user-facing responses.

**Basic message structure:**
```

<|start|>ROLE<|message|>CONTENT<|end|>

<|start|>assistant<|channel|>CHANNEL<|message|>CONTENT<|end|>

```

**The five roles in hierarchy order** (system > developer > user > assistant > tool):

1. **system**: Global identity, guardrails, and model configuration
2. **developer**: Product policy and style instructions (what you typically think of as "system prompt")
3. **user**: User messages and queries
4. **assistant**: Model responses
5. **tool**: Tool execution results

**The three output channels:**

1. **analysis**: Private chain-of-thought reasoning not shown to users
2. **commentary**: Tool calling preambles and process updates
3. **final**: Clean user-facing responses

**Example of Harmony in action:**
```

<|start|>system<|message|>You are a helpful assistant.<|end|>

<|start|>developer<|message|>Always be concise.<|end|>

<|start|>user<|message|>What time is it?<|end|>

<|start|>assistant<|channel|>commentary<|message|>{"tool_use": {"name": "get_current_time", "arguments": {}}}<|end|>

<|start|>tool<|message|>{"time": "2025-10-25T13:47:00Z"}<|end|>

<|start|>assistant<|channel|>final<|message|>The current time is 1:47 PM UTC.<|end|>

```

#### Why Use Harmony?

Harmony separates how the model thinks, what actions it takes, and what finally goes to the user, resulting in cleaner tool use, safer defaults for UI, and better observability. For our translation example:

- The `final` channel ensures we only get the translation, not explanations
- The structured format helps the model follow instructions more reliably
- The role hierarchy prevents instruction conflicts

**Important Note**: Models need to be specifically trained or fine-tuned to produce Harmony output correctly. You can't just apply this format to any model. Apertus and other models not explicitly trained on Harmony may be confused by this structure, but the HarmonyChatWrapper in node-llama-cpp handles the necessary formatting automatically.


### 3. Load Model (Lines 10-18)
```javascript

const llama = await getLlama();

const model = await llama.loadModel({

    modelPath: path.join(

        __dirname,

        "../",

        "models",

        "hf_giladgd_gpt-oss-20b.MXFP4.gguf"

    )

});

```
- Uses **gpt-oss-20b**: A 20 billion parameter model
- **MXFP4**: Mixed precision 4-bit quantization for smaller size
- Larger model = better code explanations

### 4. Create Context and Session (Lines 19-22)
```javascript

const context = await model.createContext();

const session = new LlamaChatSession({

    chatWrapper: new HarmonyChatWrapper(),

    contextSequence: context.getSequence(),

});

```
Basic session setup with no system prompt.

### 5. Define the Question (Line 24)
```javascript

const q1 = `What is hoisting in JavaScript? Explain with examples.`;

```
A technical programming question that requires detailed explanation.

### 6. Display Context Size (Line 26)
```javascript

console.log('context.contextSize', context.contextSize)

```
- Shows the maximum context window size
- Helps understand memory limitations
- Useful for debugging

### 7. Streaming Prompt Execution (Lines 28-36)
```javascript

const a1 = await session.prompt(q1, {

    // Tip: let the lib choose or cap reasonably; using the whole context size can be wasteful

    maxTokens: 2000,



    // Fires as soon as the first characters arrive

    onTextChunk: (text) => {

        process.stdout.write(text); // optional: live print

    },

});

```

**Key parameters:**

**maxTokens: 2000**
- Limits response length to 2000 tokens (~1500 words)
- Prevents runaway generation
- Saves time and compute
- Without limit: model uses entire context

**onTextChunk callback**
- Fires **as each token is generated**
- Receives text as it's produced
- `process.stdout.write()`: Prints without newlines
- Creates real-time "typing" effect

### How Streaming Works

```

Without streaming:

User β†’ [Wait 10 seconds...] β†’ Complete response appears



With streaming:

User β†’ [Token 1] β†’ [Token 2] β†’ [Token 3] β†’ ... β†’ Complete

       "What"      "is"        "hoisting"

       (Immediate feedback!)

```

### 8. Display Final Answer (Line 38)
```javascript

console.log("\n\nFinal answer:\n", a1);

```
- Prints the complete response again
- Useful for logging or verification
- Shows full text after streaming

### 9. Cleanup (Lines 41-44)
```javascript

session.dispose()

context.dispose()

model.dispose()

llama.dispose()

```
Standard resource cleanup.

## Key Concepts Demonstrated

### 1. Streaming Responses

**Why streaming matters:**
- **Better UX**: Users see progress immediately
- **Early termination**: Can stop if response is off-track
- **Perceived speed**: Feels faster than waiting
- **Debugging**: See generation in real-time

**Comparison:**
```

Non-streaming:           Streaming:

═══════════════         ═══════════════

Request sent            Request sent

[10s wait...]           "What" (0.1s)

Complete response       "is" (0.2s)

                        "hoisting" (0.3s)

                        ... continues

                        (Same total time, better experience!)

```

### 2. Token Limits

**maxTokens controls generation length:**

```

No limit:               With limit (2000):

─────────             ─────────────────

May generate forever   Stops at 2000 tokens

Uses entire context    Saves computation

Unpredictable cost     Predictable cost

```

**Token approximation:**
- 1 token β‰ˆ 0.75 words (English)
- 2000 tokens β‰ˆ 1500 words
- 4-5 paragraphs of detailed explanation

### 3. Real-Time Feedback Pattern

The `onTextChunk` callback enables:
```javascript

onTextChunk: (text) => {

    // Do anything with each chunk:

    process.stdout.write(text);      // Console output

    // socket.emit('chunk', text);   // WebSocket to client

    // buffer += text;               // Accumulate for processing

    // analyzePartial(text);         // Real-time analysis

}

```

### 4. Context Size Awareness

```javascript

console.log('context.contextSize', context.contextSize)

```

Shows model's memory capacity:
- Small models: 2048-4096 tokens
- Medium models: 8192-16384 tokens  
- Large models: 32768+ tokens

**Why it matters:**
```

Context Size: 4096 tokens

Prompt: 100 tokens

Max response: 2000 tokens

History: Up to 1996 tokens

```

## Use Cases

### 1. Code Explanations (This Example)
```javascript

prompt: "Explain hoisting in JavaScript"

β†’ Streams detailed explanation with examples

```

### 2. Long-Form Content Generation
```javascript

prompt: "Write a blog post about AI agents"

maxTokens: 3000

β†’ Streams article as it's written

```

### 3. Interactive Tutoring
```javascript

// User sees explanation being built

prompt: "Teach me about closures"

onTextChunk: (text) => displayToUser(text)

```

### 4. Web Applications
```javascript

// Server-Sent Events or WebSocket

onTextChunk: (text) => {

    websocket.send(text);  // Send to browser

}

```

## Performance Considerations

### Token Generation Speed

Depends on:
- **Model size**: Larger = slower per token
- **Hardware**: GPU > CPU
- **Quantization**: Lower bits = faster
- **Context length**: Longer context = slower

**Typical speeds:**
```

Model Size    GPU (RTX 4090)    CPU (M2 Max)

──────────    ──────────────    ────────────

1.7B          50-80 tok/s       15-25 tok/s

8B            20-35 tok/s       5-10 tok/s

20B           10-15 tok/s       2-4 tok/s

```

### When to Use maxTokens

```

βœ“ Use maxTokens when:

  β€’ Response length is predictable

  β€’ You want to save computation

  β€’ Testing/debugging

  β€’ API rate limiting



βœ— Don't limit when:

  β€’ Need complete answer

  β€’ Length varies greatly

  β€’ Using stop sequences instead

```

## Advanced Streaming Patterns

### Pattern 1: Progressive Enhancement
```javascript

let buffer = '';

onTextChunk: (text) => {

    buffer += text;

    if (buffer.includes('\n\n')) {

        // Complete paragraph ready

        processParagraph(buffer);

        buffer = '';

    }

}

```

### Pattern 2: Early Stopping
```javascript

let isRelevant = true;

onTextChunk: (text) => {

    if (text.includes('irrelevant_keyword')) {

        isRelevant = false;

        // Stop generation (would need additional API)

    }

}

```

### Pattern 3: Multi-Consumer
```javascript

onTextChunk: (text) => {

    console.log(text);           // Console

    logFile.write(text);         // File

    websocket.send(text);        // Client

    analyzer.process(text);      // Analysis

}

```

## Expected Output

When run, you'll see:
1. Context size logged (e.g., "context.contextSize 32768")
2. Streaming response appearing token-by-token
3. Complete final answer printed again

Example output flow:
```

context.contextSize 32768

Hoisting is a JavaScript mechanism where variables and function 

declarations are moved to the top of their scope before code 

execution. For example:



console.log(x); // undefined (not an error!)

var x = 5;



This works because...

[continues streaming...]



Final answer:

[Complete response printed again]

```

## Why This Matters for AI Agents

### User Experience
- Real-time agents feel more responsive
- Users can interrupt if going wrong direction
- Better for conversational interfaces

### Resource Management
- Token limits prevent runaway generation
- Predictable costs and timing
- Can cancel expensive operations early

### Integration Patterns
- Web UIs show "typing" effect
- CLIs display progressive output
- APIs stream to clients efficiently

This pattern is essential for production agent systems where user experience and resource control matter.