lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified
# Code Explanation: batch.js
This file demonstrates **parallel execution** of multiple LLM prompts using separate context sequences, enabling concurrent processing for better performance.
## Step-by-Step Code Breakdown
### 1. Import and Setup (Lines 1-10)
```javascript
import {getLlama, LlamaChatSession} from "node-llama-cpp";
import path from "path";
import {fileURLToPath} from "url";
/**
* Asynchronous execution improves performance in GAIA benchmarks,
* multi-agent applications, and other high-throughput scenarios.
*/
const __dirname = path.dirname(fileURLToPath(import.meta.url));
```
- Standard imports for LLM interaction
- Comment explains the performance benefit
- **GAIA benchmark**: A standard for testing AI agent performance
- Useful for multi-agent systems that need to handle many requests
### 2. Model Path Configuration (Lines 11-16)
```javascript
const modelPath = path.join(
__dirname,
"../",
"models",
"DeepSeek-R1-0528-Qwen3-8B-Q6_K.gguf"
)
```
- Uses **DeepSeek-R1**: An 8B parameter model optimized for reasoning
- **Q6_K quantization**: Balance between quality and size
- Model is loaded once and shared between sequences
### 3. Initialize Llama and Load Model (Lines 18-19)
```javascript
const llama = await getLlama();
const model = await llama.loadModel({modelPath});
```
- Standard initialization
- Model is loaded into memory once
- Will be used by multiple sequences simultaneously
### 4. Create Context with Multiple Sequences (Lines 20-23)
```javascript
const context = await model.createContext({
sequences: 2,
batchSize: 1024 // The number of tokens that can be processed at once by the GPU.
});
```
**Key parameters:**
- **sequences: 2**: Creates 2 independent conversation sequences
- Each sequence has its own conversation history
- Both share the same model and context memory pool
- Can be processed in parallel
- **batchSize: 1024**: Maximum tokens processed per GPU batch
- Larger = better GPU utilization
- Smaller = lower memory usage
- 1024 is a good balance for most GPUs
### Why Multiple Sequences?
```
Single Sequence (Sequential) Multiple Sequences (Parallel)
───────────────────────── ──────────────────────────────
Process Prompt 1 β†’ Response 1 Process Prompt 1 ──┐
Wait... β”œβ†’ Both responses
Process Prompt 2 β†’ Response 2 Process Prompt 2 β”€β”€β”˜ in parallel!
Total Time: T1 + T2 Total Time: max(T1, T2)
```
### 5. Get Individual Sequences (Lines 25-26)
```javascript
const sequence1 = context.getSequence();
const sequence2 = context.getSequence();
```
- Retrieves two separate sequence objects from the context
- Each sequence maintains its own state
- They can be used independently for different conversations
### 6. Create Separate Sessions (Lines 28-33)
```javascript
const session1 = new LlamaChatSession({
contextSequence: sequence1
});
const session2 = new LlamaChatSession({
contextSequence: sequence2
});
```
- Creates a chat session for each sequence
- Each session has its own conversation history
- Sessions are completely independent
- No system prompts in this example (could be added)
### 7. Define Questions (Lines 35-36)
```javascript
const q1 = "Hi there, how are you?";
const q2 = "How much is 6+6?";
```
- Two completely different questions
- Will be processed simultaneously
- Different types: conversational vs. computational
### 8. Parallel Execution with Promise.all (Lines 38-44)
```javascript
const [
a1,
a2
] = await Promise.all([
session1.prompt(q1),
session2.prompt(q2)
]);
```
**How this works:**
1. `session1.prompt(q1)` starts asynchronously
2. `session2.prompt(q2)` starts asynchronously (doesn't wait for #1)
3. `Promise.all()` waits for BOTH to complete
4. Returns results in array: [response1, response2]
5. Destructures into `a1` and `a2`
**Key benefit**: Both prompts are processed at the same time, not one after another!
### 9. Display Results (Lines 46-50)
```javascript
console.log("User: " + q1);
console.log("AI: " + a1);
console.log("User: " + q2);
console.log("AI: " + a2);
```
- Outputs both question-answer pairs
- Results appear in order despite parallel processing
## Key Concepts Demonstrated
### 1. Parallel Processing
Instead of:
```javascript
// Sequential (slow)
const a1 = await session1.prompt(q1); // Wait
const a2 = await session2.prompt(q2); // Wait again
```
We use:
```javascript
// Parallel (fast)
const [a1, a2] = await Promise.all([
session1.prompt(q1),
session2.prompt(q2)
]);
```
### 2. Context Sequences
A context can hold multiple independent sequences:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Context (Shared) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Model Weights (8B params) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Sequence 1 β”‚ β”‚ Sequence 2 β”‚ β”‚
β”‚ β”‚ "Hi there" β”‚ β”‚ "6+6?" β”‚ β”‚
β”‚ β”‚ History... β”‚ β”‚ History... β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Performance Comparison
### Sequential Execution
```
Request 1: 2 seconds
Request 2: 2 seconds
Total: 4 seconds
```
### Parallel Execution (This Example)
```
Request 1: 2 seconds ──┐
Request 2: 2 seconds ─── Both running
Total: ~2 seconds └─ simultaneously
```
**Speedup**: ~2x for 2 sequences, scales with more sequences
## Use Cases
### 1. Multi-User Applications
```javascript
// Handle multiple users simultaneously
const [user1Response, user2Response, user3Response] = await Promise.all([
session1.prompt(user1Query),
session2.prompt(user2Query),
session3.prompt(user3Query)
]);
```
### 2. Multi-Agent Systems
```javascript
// Multiple agents working on different tasks
const [
plannerResponse,
analyzerResponse,
executorResponse
] = await Promise.all([
plannerSession.prompt("Plan the task"),
analyzerSession.prompt("Analyze the data"),
executorSession.prompt("Execute step 1")
]);
```
### 3. Benchmarking
```javascript
// Test multiple prompts for evaluation
const results = await Promise.all(
testPrompts.map(prompt => session.prompt(prompt))
);
```
### 4. A/B Testing
```javascript
// Test different system prompts
const [responseA, responseB] = await Promise.all([
sessionWithPromptA.prompt(query),
sessionWithPromptB.prompt(query)
]);
```
## Resource Considerations
### Memory Usage
Each sequence needs memory for:
- Conversation history
- Intermediate computations
- KV cache (key-value cache for transformer attention)
**Rule of thumb**: More sequences = more memory needed
### GPU Utilization
- **Single sequence**: May underutilize GPU
- **Multiple sequences**: Better GPU utilization
- **Too many sequences**: May exceed VRAM, causing slowdown
### Optimal Number of Sequences
Depends on:
- Available VRAM
- Model size
- Context length
- Batch size
**Typical**: 2-8 sequences for consumer GPUs
## Limitations & Considerations
### 1. Shared Context Limit
All sequences share the same context memory pool:
```
Total context size: 8192 tokens
Sequence 1: 4096 tokens
Sequence 2: 4096 tokens
Maximum distribution!
```
### 2. Not True Parallelism for CPU
On CPU-only systems, sequences are interleaved, not truly parallel. Still provides better overall throughput.
### 3. Model Loading Overhead
The model is loaded once and shared, which is efficient. But initial loading still takes time.
## Why This Matters for AI Agents
### Efficiency in Production
Real-world agent systems need to:
- Handle multiple requests concurrently
- Respond quickly to users
- Make efficient use of hardware
### Multi-Agent Architectures
Complex agent systems often have:
- **Planner agent**: Thinks about strategy
- **Executor agent**: Takes actions
- **Critic agent**: Evaluates results
These can run in parallel using separate sequences.
### Scalability
This pattern is the foundation for:
- Web services with multiple users
- Batch processing of data
- Distributed agent systems
## Best Practices
1. **Match sequences to workload**: Don't create more than you need
2. **Monitor memory usage**: Each sequence consumes VRAM
3. **Use appropriate batch size**: Balance speed vs. memory
4. **Clean up resources**: Always dispose when done
5. **Handle errors**: Wrap Promise.all in try-catch
## Expected Output
Running this script should output something like:
```
User: Hi there, how are you?
AI: Hello! I'm doing well, thank you for asking...
User: How much is 6+6?
AI: 12
```
Both responses appear quickly because they were processed simultaneously!