File size: 9,606 Bytes
e706de2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
# Code Explanation: batch.js

This file demonstrates **parallel execution** of multiple LLM prompts using separate context sequences, enabling concurrent processing for better performance.

## Step-by-Step Code Breakdown

### 1. Import and Setup (Lines 1-10)
```javascript

import {getLlama, LlamaChatSession} from "node-llama-cpp";

import path from "path";

import {fileURLToPath} from "url";



/**

 * Asynchronous execution improves performance in GAIA benchmarks,

 * multi-agent applications, and other high-throughput scenarios.

 */



const __dirname = path.dirname(fileURLToPath(import.meta.url));

```
- Standard imports for LLM interaction
- Comment explains the performance benefit
- **GAIA benchmark**: A standard for testing AI agent performance
- Useful for multi-agent systems that need to handle many requests

### 2. Model Path Configuration (Lines 11-16)
```javascript

const modelPath = path.join(

    __dirname,

    "../",

    "models",

    "DeepSeek-R1-0528-Qwen3-8B-Q6_K.gguf"

)

```
- Uses **DeepSeek-R1**: An 8B parameter model optimized for reasoning
- **Q6_K quantization**: Balance between quality and size

- Model is loaded once and shared between sequences



### 3. Initialize Llama and Load Model (Lines 18-19)

```javascript

const llama = await getLlama();

const model = await llama.loadModel({modelPath});

```

- Standard initialization

- Model is loaded into memory once

- Will be used by multiple sequences simultaneously



### 4. Create Context with Multiple Sequences (Lines 20-23)

```javascript

const context = await model.createContext({

    sequences: 2,

    batchSize: 1024 // The number of tokens that can be processed at once by the GPU.

});

```



**Key parameters:**



- **sequences: 2**: Creates 2 independent conversation sequences

  - Each sequence has its own conversation history

  - Both share the same model and context memory pool

  - Can be processed in parallel



- **batchSize: 1024**: Maximum tokens processed per GPU batch

  - Larger = better GPU utilization

  - Smaller = lower memory usage

  - 1024 is a good balance for most GPUs



### Why Multiple Sequences?



```

Single Sequence (Sequential)     Multiple Sequences (Parallel)

─────────────────────────       ──────────────────────────────

Process Prompt 1 β†’ Response 1    Process Prompt 1 ──┐

Wait...                                              β”œβ†’ Both responses

Process Prompt 2 β†’ Response 2    Process Prompt 2 β”€β”€β”˜   in parallel!



Total Time: T1 + T2              Total Time: max(T1, T2)

```



### 5. Get Individual Sequences (Lines 25-26)

```javascript

const sequence1 = context.getSequence();

const sequence2 = context.getSequence();

```

- Retrieves two separate sequence objects from the context

- Each sequence maintains its own state

- They can be used independently for different conversations



### 6. Create Separate Sessions (Lines 28-33)

```javascript

const session1 = new LlamaChatSession({

    contextSequence: sequence1

});

const session2 = new LlamaChatSession({

    contextSequence: sequence2

});

```

- Creates a chat session for each sequence

- Each session has its own conversation history

- Sessions are completely independent

- No system prompts in this example (could be added)



### 7. Define Questions (Lines 35-36)

```javascript

const q1 = "Hi there, how are you?";

const q2 = "How much is 6+6?";

```

- Two completely different questions

- Will be processed simultaneously

- Different types: conversational vs. computational



### 8. Parallel Execution with Promise.all (Lines 38-44)

```javascript

const [

    a1,

    a2

] = await Promise.all([

    session1.prompt(q1),

    session2.prompt(q2)

]);

```



**How this works:**



1. `session1.prompt(q1)` starts asynchronously

2. `session2.prompt(q2)` starts asynchronously (doesn't wait for #1)

3. `Promise.all()` waits for BOTH to complete

4. Returns results in array: [response1, response2]

5. Destructures into `a1` and `a2`



**Key benefit**: Both prompts are processed at the same time, not one after another!



### 9. Display Results (Lines 46-50)

```javascript

console.log("User: " + q1);

console.log("AI: " + a1);



console.log("User: " + q2);

console.log("AI: " + a2);

```

- Outputs both question-answer pairs

- Results appear in order despite parallel processing



## Key Concepts Demonstrated



### 1. Parallel Processing

Instead of:

```javascript

// Sequential (slow)

const a1 = await session1.prompt(q1);  // Wait

const a2 = await session2.prompt(q2);  // Wait again

```



We use:

```javascript

// Parallel (fast)

const [a1, a2] = await Promise.all([

    session1.prompt(q1),

    session2.prompt(q2)

]);

```



### 2. Context Sequences

A context can hold multiple independent sequences:



```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚          Context (Shared)           β”‚

β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚

β”‚  β”‚  Model Weights (8B params)    β”‚  β”‚

β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚

β”‚                                     β”‚

β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚

β”‚  β”‚ Sequence 1  β”‚  β”‚ Sequence 2  β”‚  β”‚

β”‚  β”‚ "Hi there"  β”‚  β”‚ "6+6?"      β”‚  β”‚

β”‚  β”‚ History...  β”‚  β”‚ History...  β”‚  β”‚

β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```



## Performance Comparison



### Sequential Execution

```

Request 1: 2 seconds

Request 2: 2 seconds

Total: 4 seconds

```



### Parallel Execution (This Example)

```

Request 1: 2 seconds ──┐

Request 2: 2 seconds ─── Both running

Total: ~2 seconds      └─ simultaneously

```



**Speedup**: ~2x for 2 sequences, scales with more sequences



## Use Cases



### 1. Multi-User Applications

```javascript

// Handle multiple users simultaneously

const [user1Response, user2Response, user3Response] = await Promise.all([

    session1.prompt(user1Query),

    session2.prompt(user2Query),

    session3.prompt(user3Query)

]);

```



### 2. Multi-Agent Systems

```javascript

// Multiple agents working on different tasks

const [

    plannerResponse,

    analyzerResponse,

    executorResponse

] = await Promise.all([

    plannerSession.prompt("Plan the task"),

    analyzerSession.prompt("Analyze the data"),

    executorSession.prompt("Execute step 1")

]);

```



### 3. Benchmarking

```javascript

// Test multiple prompts for evaluation

const results = await Promise.all(

    testPrompts.map(prompt => session.prompt(prompt))

);

```



### 4. A/B Testing

```javascript

// Test different system prompts

const [responseA, responseB] = await Promise.all([

    sessionWithPromptA.prompt(query),

    sessionWithPromptB.prompt(query)

]);

```



## Resource Considerations



### Memory Usage

Each sequence needs memory for:

- Conversation history

- Intermediate computations

- KV cache (key-value cache for transformer attention)



**Rule of thumb**: More sequences = more memory needed



### GPU Utilization

- **Single sequence**: May underutilize GPU

- **Multiple sequences**: Better GPU utilization

- **Too many sequences**: May exceed VRAM, causing slowdown



### Optimal Number of Sequences

Depends on:

- Available VRAM

- Model size

- Context length

- Batch size



**Typical**: 2-8 sequences for consumer GPUs



## Limitations & Considerations



### 1. Shared Context Limit

All sequences share the same context memory pool:

```

Total context size: 8192 tokens

Sequence 1: 4096 tokens

Sequence 2: 4096 tokens

Maximum distribution!

```



### 2. Not True Parallelism for CPU

On CPU-only systems, sequences are interleaved, not truly parallel. Still provides better overall throughput.



### 3. Model Loading Overhead

The model is loaded once and shared, which is efficient. But initial loading still takes time.



## Why This Matters for AI Agents



### Efficiency in Production

Real-world agent systems need to:

- Handle multiple requests concurrently

- Respond quickly to users

- Make efficient use of hardware



### Multi-Agent Architectures

Complex agent systems often have:

- **Planner agent**: Thinks about strategy

- **Executor agent**: Takes actions

- **Critic agent**: Evaluates results



These can run in parallel using separate sequences.



### Scalability

This pattern is the foundation for:

- Web services with multiple users

- Batch processing of data

- Distributed agent systems



## Best Practices



1. **Match sequences to workload**: Don't create more than you need

2. **Monitor memory usage**: Each sequence consumes VRAM

3. **Use appropriate batch size**: Balance speed vs. memory

4. **Clean up resources**: Always dispose when done

5. **Handle errors**: Wrap Promise.all in try-catch



## Expected Output



Running this script should output something like:

```

User: Hi there, how are you?

AI: Hello! I'm doing well, thank you for asking...



User: How much is 6+6?

AI: 12

```



Both responses appear quickly because they were processed simultaneously!