# Code Explanation: batch.js

This file demonstrates **parallel execution** of multiple LLM prompts using separate context sequences, enabling concurrent processing for better performance.

## Step-by-Step Code Breakdown

### 1. Import and Setup (Lines 1-10)
```javascript
import {getLlama, LlamaChatSession} from "node-llama-cpp";
import path from "path";
import {fileURLToPath} from "url";

/**
 * Asynchronous execution improves performance in GAIA benchmarks,
 * multi-agent applications, and other high-throughput scenarios.
 */

const __dirname = path.dirname(fileURLToPath(import.meta.url));
```
- Standard imports for LLM interaction
- Comment explains the performance benefit
- **GAIA benchmark**: A standard for testing AI agent performance
- Useful for multi-agent systems that need to handle many requests

### 2. Model Path Configuration (Lines 11-16)
```javascript
const modelPath = path.join(
    __dirname,
    "../",
    "models",
    "DeepSeek-R1-0528-Qwen3-8B-Q6_K.gguf"
)
```
- Uses **DeepSeek-R1**: An 8B parameter model optimized for reasoning
- **Q6_K quantization**: Balance between quality and size
- Model is loaded once and shared between sequences

### 3. Initialize Llama and Load Model (Lines 18-19)
```javascript
const llama = await getLlama();
const model = await llama.loadModel({modelPath});
```
- Standard initialization
- Model is loaded into memory once
- Will be used by multiple sequences simultaneously

### 4. Create Context with Multiple Sequences (Lines 20-23)
```javascript
const context = await model.createContext({
    sequences: 2,
    batchSize: 1024 // The number of tokens that can be processed at once by the GPU.
});
```

**Key parameters:**

- **sequences: 2**: Creates 2 independent conversation sequences
  - Each sequence has its own conversation history
  - Both share the same model and context memory pool
  - Can be processed in parallel

- **batchSize: 1024**: Maximum tokens processed per GPU batch
  - Larger = better GPU utilization
  - Smaller = lower memory usage
  - 1024 is a good balance for most GPUs

### Why Multiple Sequences?

```
Single Sequence (Sequential)     Multiple Sequences (Parallel)
─────────────────────────       ──────────────────────────────
Process Prompt 1 → Response 1    Process Prompt 1 ──┐
Wait...                                              ├→ Both responses
Process Prompt 2 → Response 2    Process Prompt 2 ──┘   in parallel!

Total Time: T1 + T2              Total Time: max(T1, T2)
```

### 5. Get Individual Sequences (Lines 25-26)
```javascript
const sequence1 = context.getSequence();
const sequence2 = context.getSequence();
```
- Retrieves two separate sequence objects from the context
- Each sequence maintains its own state
- They can be used independently for different conversations

### 6. Create Separate Sessions (Lines 28-33)
```javascript
const session1 = new LlamaChatSession({
    contextSequence: sequence1
});
const session2 = new LlamaChatSession({
    contextSequence: sequence2
});
```
- Creates a chat session for each sequence
- Each session has its own conversation history
- Sessions are completely independent
- No system prompts in this example (could be added)

### 7. Define Questions (Lines 35-36)
```javascript
const q1 = "Hi there, how are you?";
const q2 = "How much is 6+6?";
```
- Two completely different questions
- Will be processed simultaneously
- Different types: conversational vs. computational

### 8. Parallel Execution with Promise.all (Lines 38-44)
```javascript
const [
    a1,
    a2
] = await Promise.all([
    session1.prompt(q1),
    session2.prompt(q2)
]);
```

**How this works:**

1. `session1.prompt(q1)` starts asynchronously
2. `session2.prompt(q2)` starts asynchronously (doesn't wait for #1)
3. `Promise.all()` waits for BOTH to complete
4. Returns results in array: [response1, response2]
5. Destructures into `a1` and `a2`

**Key benefit**: Both prompts are processed at the same time, not one after another!

### 9. Display Results (Lines 46-50)
```javascript
console.log("User: " + q1);
console.log("AI: " + a1);

console.log("User: " + q2);
console.log("AI: " + a2);
```
- Outputs both question-answer pairs
- Results appear in order despite parallel processing

## Key Concepts Demonstrated

### 1. Parallel Processing
Instead of:
```javascript
// Sequential (slow)
const a1 = await session1.prompt(q1);  // Wait
const a2 = await session2.prompt(q2);  // Wait again
```

We use:
```javascript
// Parallel (fast)
const [a1, a2] = await Promise.all([
    session1.prompt(q1),
    session2.prompt(q2)
]);
```

### 2. Context Sequences
A context can hold multiple independent sequences:

```
┌─────────────────────────────────────┐
│          Context (Shared)           │
│  ┌───────────────────────────────┐  │
│  │  Model Weights (8B params)    │  │
│  └───────────────────────────────┘  │
│                                     │
│  ┌─────────────┐  ┌─────────────┐  │
│  │ Sequence 1  │  │ Sequence 2  │  │
│  │ "Hi there"  │  │ "6+6?"      │  │
│  │ History...  │  │ History...  │  │
│  └─────────────┘  └─────────────┘  │
└─────────────────────────────────────┘
```

## Performance Comparison

### Sequential Execution
```
Request 1: 2 seconds
Request 2: 2 seconds
Total: 4 seconds
```

### Parallel Execution (This Example)
```
Request 1: 2 seconds ──┐
Request 2: 2 seconds ──┤ Both running
Total: ~2 seconds      └─ simultaneously
```

**Speedup**: ~2x for 2 sequences, scales with more sequences

## Use Cases

### 1. Multi-User Applications
```javascript
// Handle multiple users simultaneously
const [user1Response, user2Response, user3Response] = await Promise.all([
    session1.prompt(user1Query),
    session2.prompt(user2Query),
    session3.prompt(user3Query)
]);
```

### 2. Multi-Agent Systems
```javascript
// Multiple agents working on different tasks
const [
    plannerResponse,
    analyzerResponse,
    executorResponse
] = await Promise.all([
    plannerSession.prompt("Plan the task"),
    analyzerSession.prompt("Analyze the data"),
    executorSession.prompt("Execute step 1")
]);
```

### 3. Benchmarking
```javascript
// Test multiple prompts for evaluation
const results = await Promise.all(
    testPrompts.map(prompt => session.prompt(prompt))
);
```

### 4. A/B Testing
```javascript
// Test different system prompts
const [responseA, responseB] = await Promise.all([
    sessionWithPromptA.prompt(query),
    sessionWithPromptB.prompt(query)
]);
```

## Resource Considerations

### Memory Usage
Each sequence needs memory for:
- Conversation history
- Intermediate computations
- KV cache (key-value cache for transformer attention)

**Rule of thumb**: More sequences = more memory needed

### GPU Utilization
- **Single sequence**: May underutilize GPU
- **Multiple sequences**: Better GPU utilization
- **Too many sequences**: May exceed VRAM, causing slowdown

### Optimal Number of Sequences
Depends on:
- Available VRAM
- Model size
- Context length
- Batch size

**Typical**: 2-8 sequences for consumer GPUs

## Limitations & Considerations

### 1. Shared Context Limit
All sequences share the same context memory pool:
```
Total context size: 8192 tokens
Sequence 1: 4096 tokens
Sequence 2: 4096 tokens
Maximum distribution!
```

### 2. Not True Parallelism for CPU
On CPU-only systems, sequences are interleaved, not truly parallel. Still provides better overall throughput.

### 3. Model Loading Overhead
The model is loaded once and shared, which is efficient. But initial loading still takes time.

## Why This Matters for AI Agents

### Efficiency in Production
Real-world agent systems need to:
- Handle multiple requests concurrently
- Respond quickly to users
- Make efficient use of hardware

### Multi-Agent Architectures
Complex agent systems often have:
- **Planner agent**: Thinks about strategy
- **Executor agent**: Takes actions
- **Critic agent**: Evaluates results

These can run in parallel using separate sequences.

### Scalability
This pattern is the foundation for:
- Web services with multiple users
- Batch processing of data
- Distributed agent systems

## Best Practices

1. **Match sequences to workload**: Don't create more than you need
2. **Monitor memory usage**: Each sequence consumes VRAM
3. **Use appropriate batch size**: Balance speed vs. memory
4. **Clean up resources**: Always dispose when done
5. **Handle errors**: Wrap Promise.all in try-catch

## Expected Output

Running this script should output something like:
```
User: Hi there, how are you?
AI: Hello! I'm doing well, thank you for asking...

User: How much is 6+6?
AI: 12
```

Both responses appear quickly because they were processed simultaneously!