Spaces:

lenzcom
/

Email

Running

File size: 9,606 Bytes

e706de2

# Code Explanation: batch.js

This file demonstrates **parallel execution** of multiple LLM prompts using separate context sequences, enabling concurrent processing for better performance.

## Step-by-Step Code Breakdown

### 1. Import and Setup (Lines 1-10)
```javascript

import {getLlama, LlamaChatSession} from "node-llama-cpp";

import path from "path";

import {fileURLToPath} from "url";



/**

 * Asynchronous execution improves performance in GAIA benchmarks,

 * multi-agent applications, and other high-throughput scenarios.

 */



const __dirname = path.dirname(fileURLToPath(import.meta.url));

```
- Standard imports for LLM interaction
- Comment explains the performance benefit
- **GAIA benchmark**: A standard for testing AI agent performance
- Useful for multi-agent systems that need to handle many requests

### 2. Model Path Configuration (Lines 11-16)
```javascript

const modelPath = path.join(

    __dirname,

    "../",

    "models",

    "DeepSeek-R1-0528-Qwen3-8B-Q6_K.gguf"

)

```
- Uses **DeepSeek-R1**: An 8B parameter model optimized for reasoning
- **Q6_K quantization**: Balance between quality and size

- Model is loaded once and shared between sequences



### 3. Initialize Llama and Load Model (Lines 18-19)

```javascript

const llama = await getLlama();

const model = await llama.loadModel({modelPath});

```

- Standard initialization

- Model is loaded into memory once

- Will be used by multiple sequences simultaneously



### 4. Create Context with Multiple Sequences (Lines 20-23)

```javascript

const context = await model.createContext({

    sequences: 2,

    batchSize: 1024 // The number of tokens that can be processed at once by the GPU.

});

```



**Key parameters:**



- **sequences: 2**: Creates 2 independent conversation sequences

  - Each sequence has its own conversation history

  - Both share the same model and context memory pool

  - Can be processed in parallel



- **batchSize: 1024**: Maximum tokens processed per GPU batch

  - Larger = better GPU utilization

  - Smaller = lower memory usage

  - 1024 is a good balance for most GPUs



### Why Multiple Sequences?



```

Single Sequence (Sequential)     Multiple Sequences (Parallel)

─────────────────────────       ──────────────────────────────

Process Prompt 1 → Response 1    Process Prompt 1 ──┐

Wait...                                              ├→ Both responses

Process Prompt 2 → Response 2    Process Prompt 2 ──┘   in parallel!



Total Time: T1 + T2              Total Time: max(T1, T2)

```



### 5. Get Individual Sequences (Lines 25-26)

```javascript

const sequence1 = context.getSequence();

const sequence2 = context.getSequence();

```

- Retrieves two separate sequence objects from the context

- Each sequence maintains its own state

- They can be used independently for different conversations



### 6. Create Separate Sessions (Lines 28-33)

```javascript

const session1 = new LlamaChatSession({

    contextSequence: sequence1

});

const session2 = new LlamaChatSession({

    contextSequence: sequence2

});

```

- Creates a chat session for each sequence

- Each session has its own conversation history

- Sessions are completely independent

- No system prompts in this example (could be added)



### 7. Define Questions (Lines 35-36)

```javascript

const q1 = "Hi there, how are you?";

const q2 = "How much is 6+6?";

```

- Two completely different questions

- Will be processed simultaneously

- Different types: conversational vs. computational



### 8. Parallel Execution with Promise.all (Lines 38-44)

```javascript

const [

    a1,

    a2

] = await Promise.all([

    session1.prompt(q1),

    session2.prompt(q2)

]);

```



**How this works:**



1. `session1.prompt(q1)` starts asynchronously

2. `session2.prompt(q2)` starts asynchronously (doesn't wait for #1)

3. `Promise.all()` waits for BOTH to complete

4. Returns results in array: [response1, response2]

5. Destructures into `a1` and `a2`



**Key benefit**: Both prompts are processed at the same time, not one after another!



### 9. Display Results (Lines 46-50)

```javascript

console.log("User: " + q1);

console.log("AI: " + a1);



console.log("User: " + q2);

console.log("AI: " + a2);

```

- Outputs both question-answer pairs

- Results appear in order despite parallel processing



## Key Concepts Demonstrated



### 1. Parallel Processing

Instead of:

```javascript

// Sequential (slow)

const a1 = await session1.prompt(q1);  // Wait

const a2 = await session2.prompt(q2);  // Wait again

```



We use:

```javascript

// Parallel (fast)

const [a1, a2] = await Promise.all([

    session1.prompt(q1),

    session2.prompt(q2)

]);

```



### 2. Context Sequences

A context can hold multiple independent sequences:



```

┌─────────────────────────────────────┐

│          Context (Shared)           │

│  ┌───────────────────────────────┐  │

│  │  Model Weights (8B params)    │  │

│  └───────────────────────────────┘  │

│                                     │

│  ┌─────────────┐  ┌─────────────┐  │

│  │ Sequence 1  │  │ Sequence 2  │  │

│  │ "Hi there"  │  │ "6+6?"      │  │

│  │ History...  │  │ History...  │  │

│  └─────────────┘  └─────────────┘  │

└─────────────────────────────────────┘

```



## Performance Comparison



### Sequential Execution

```

Request 1: 2 seconds

Request 2: 2 seconds

Total: 4 seconds

```



### Parallel Execution (This Example)

```

Request 1: 2 seconds ──┐

Request 2: 2 seconds ──┤ Both running

Total: ~2 seconds      └─ simultaneously

```



**Speedup**: ~2x for 2 sequences, scales with more sequences



## Use Cases



### 1. Multi-User Applications

```javascript

// Handle multiple users simultaneously

const [user1Response, user2Response, user3Response] = await Promise.all([

    session1.prompt(user1Query),

    session2.prompt(user2Query),

    session3.prompt(user3Query)

]);

```



### 2. Multi-Agent Systems

```javascript

// Multiple agents working on different tasks

const [

    plannerResponse,

    analyzerResponse,

    executorResponse

] = await Promise.all([

    plannerSession.prompt("Plan the task"),

    analyzerSession.prompt("Analyze the data"),

    executorSession.prompt("Execute step 1")

]);

```



### 3. Benchmarking

```javascript

// Test multiple prompts for evaluation

const results = await Promise.all(

    testPrompts.map(prompt => session.prompt(prompt))

);

```



### 4. A/B Testing

```javascript

// Test different system prompts

const [responseA, responseB] = await Promise.all([

    sessionWithPromptA.prompt(query),

    sessionWithPromptB.prompt(query)

]);

```



## Resource Considerations



### Memory Usage

Each sequence needs memory for:

- Conversation history

- Intermediate computations

- KV cache (key-value cache for transformer attention)



**Rule of thumb**: More sequences = more memory needed



### GPU Utilization

- **Single sequence**: May underutilize GPU

- **Multiple sequences**: Better GPU utilization

- **Too many sequences**: May exceed VRAM, causing slowdown



### Optimal Number of Sequences

Depends on:

- Available VRAM

- Model size

- Context length

- Batch size



**Typical**: 2-8 sequences for consumer GPUs



## Limitations & Considerations



### 1. Shared Context Limit

All sequences share the same context memory pool:

```

Total context size: 8192 tokens

Sequence 1: 4096 tokens

Sequence 2: 4096 tokens

Maximum distribution!

```



### 2. Not True Parallelism for CPU

On CPU-only systems, sequences are interleaved, not truly parallel. Still provides better overall throughput.



### 3. Model Loading Overhead

The model is loaded once and shared, which is efficient. But initial loading still takes time.



## Why This Matters for AI Agents



### Efficiency in Production

Real-world agent systems need to:

- Handle multiple requests concurrently

- Respond quickly to users

- Make efficient use of hardware



### Multi-Agent Architectures

Complex agent systems often have:

- **Planner agent**: Thinks about strategy

- **Executor agent**: Takes actions

- **Critic agent**: Evaluates results



These can run in parallel using separate sequences.



### Scalability

This pattern is the foundation for:

- Web services with multiple users

- Batch processing of data

- Distributed agent systems



## Best Practices



1. **Match sequences to workload**: Don't create more than you need

2. **Monitor memory usage**: Each sequence consumes VRAM

3. **Use appropriate batch size**: Balance speed vs. memory

4. **Clean up resources**: Always dispose when done

5. **Handle errors**: Wrap Promise.all in try-catch



## Expected Output



Running this script should output something like:

```

User: Hi there, how are you?

AI: Hello! I'm doing well, thank you for asking...



User: How much is 6+6?

AI: 12

```



Both responses appear quickly because they were processed simultaneously!