| # Code Explanation: batch.js | |
| This file demonstrates **parallel execution** of multiple LLM prompts using separate context sequences, enabling concurrent processing for better performance. | |
| ## Step-by-Step Code Breakdown | |
| ### 1. Import and Setup (Lines 1-10) | |
| ```javascript | |
| import {getLlama, LlamaChatSession} from "node-llama-cpp"; | |
| import path from "path"; | |
| import {fileURLToPath} from "url"; | |
| /** | |
| * Asynchronous execution improves performance in GAIA benchmarks, | |
| * multi-agent applications, and other high-throughput scenarios. | |
| */ | |
| const __dirname = path.dirname(fileURLToPath(import.meta.url)); | |
| ``` | |
| - Standard imports for LLM interaction | |
| - Comment explains the performance benefit | |
| - **GAIA benchmark**: A standard for testing AI agent performance | |
| - Useful for multi-agent systems that need to handle many requests | |
| ### 2. Model Path Configuration (Lines 11-16) | |
| ```javascript | |
| const modelPath = path.join( | |
| __dirname, | |
| "../", | |
| "models", | |
| "DeepSeek-R1-0528-Qwen3-8B-Q6_K.gguf" | |
| ) | |
| ``` | |
| - Uses **DeepSeek-R1**: An 8B parameter model optimized for reasoning | |
| - **Q6_K quantization**: Balance between quality and size | |
| - Model is loaded once and shared between sequences | |
| ### 3. Initialize Llama and Load Model (Lines 18-19) | |
| ```javascript | |
| const llama = await getLlama(); | |
| const model = await llama.loadModel({modelPath}); | |
| ``` | |
| - Standard initialization | |
| - Model is loaded into memory once | |
| - Will be used by multiple sequences simultaneously | |
| ### 4. Create Context with Multiple Sequences (Lines 20-23) | |
| ```javascript | |
| const context = await model.createContext({ | |
| sequences: 2, | |
| batchSize: 1024 // The number of tokens that can be processed at once by the GPU. | |
| }); | |
| ``` | |
| **Key parameters:** | |
| - **sequences: 2**: Creates 2 independent conversation sequences | |
| - Each sequence has its own conversation history | |
| - Both share the same model and context memory pool | |
| - Can be processed in parallel | |
| - **batchSize: 1024**: Maximum tokens processed per GPU batch | |
| - Larger = better GPU utilization | |
| - Smaller = lower memory usage | |
| - 1024 is a good balance for most GPUs | |
| ### Why Multiple Sequences? | |
| ``` | |
| Single Sequence (Sequential) Multiple Sequences (Parallel) | |
| βββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ | |
| Process Prompt 1 β Response 1 Process Prompt 1 βββ | |
| Wait... ββ Both responses | |
| Process Prompt 2 β Response 2 Process Prompt 2 βββ in parallel! | |
| Total Time: T1 + T2 Total Time: max(T1, T2) | |
| ``` | |
| ### 5. Get Individual Sequences (Lines 25-26) | |
| ```javascript | |
| const sequence1 = context.getSequence(); | |
| const sequence2 = context.getSequence(); | |
| ``` | |
| - Retrieves two separate sequence objects from the context | |
| - Each sequence maintains its own state | |
| - They can be used independently for different conversations | |
| ### 6. Create Separate Sessions (Lines 28-33) | |
| ```javascript | |
| const session1 = new LlamaChatSession({ | |
| contextSequence: sequence1 | |
| }); | |
| const session2 = new LlamaChatSession({ | |
| contextSequence: sequence2 | |
| }); | |
| ``` | |
| - Creates a chat session for each sequence | |
| - Each session has its own conversation history | |
| - Sessions are completely independent | |
| - No system prompts in this example (could be added) | |
| ### 7. Define Questions (Lines 35-36) | |
| ```javascript | |
| const q1 = "Hi there, how are you?"; | |
| const q2 = "How much is 6+6?"; | |
| ``` | |
| - Two completely different questions | |
| - Will be processed simultaneously | |
| - Different types: conversational vs. computational | |
| ### 8. Parallel Execution with Promise.all (Lines 38-44) | |
| ```javascript | |
| const [ | |
| a1, | |
| a2 | |
| ] = await Promise.all([ | |
| session1.prompt(q1), | |
| session2.prompt(q2) | |
| ]); | |
| ``` | |
| **How this works:** | |
| 1. `session1.prompt(q1)` starts asynchronously | |
| 2. `session2.prompt(q2)` starts asynchronously (doesn't wait for #1) | |
| 3. `Promise.all()` waits for BOTH to complete | |
| 4. Returns results in array: [response1, response2] | |
| 5. Destructures into `a1` and `a2` | |
| **Key benefit**: Both prompts are processed at the same time, not one after another! | |
| ### 9. Display Results (Lines 46-50) | |
| ```javascript | |
| console.log("User: " + q1); | |
| console.log("AI: " + a1); | |
| console.log("User: " + q2); | |
| console.log("AI: " + a2); | |
| ``` | |
| - Outputs both question-answer pairs | |
| - Results appear in order despite parallel processing | |
| ## Key Concepts Demonstrated | |
| ### 1. Parallel Processing | |
| Instead of: | |
| ```javascript | |
| // Sequential (slow) | |
| const a1 = await session1.prompt(q1); // Wait | |
| const a2 = await session2.prompt(q2); // Wait again | |
| ``` | |
| We use: | |
| ```javascript | |
| // Parallel (fast) | |
| const [a1, a2] = await Promise.all([ | |
| session1.prompt(q1), | |
| session2.prompt(q2) | |
| ]); | |
| ``` | |
| ### 2. Context Sequences | |
| A context can hold multiple independent sequences: | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Context (Shared) β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Model Weights (8B params) β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β | |
| β βββββββββββββββ βββββββββββββββ β | |
| β β Sequence 1 β β Sequence 2 β β | |
| β β "Hi there" β β "6+6?" β β | |
| β β History... β β History... β β | |
| β βββββββββββββββ βββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Performance Comparison | |
| ### Sequential Execution | |
| ``` | |
| Request 1: 2 seconds | |
| Request 2: 2 seconds | |
| Total: 4 seconds | |
| ``` | |
| ### Parallel Execution (This Example) | |
| ``` | |
| Request 1: 2 seconds βββ | |
| Request 2: 2 seconds βββ€ Both running | |
| Total: ~2 seconds ββ simultaneously | |
| ``` | |
| **Speedup**: ~2x for 2 sequences, scales with more sequences | |
| ## Use Cases | |
| ### 1. Multi-User Applications | |
| ```javascript | |
| // Handle multiple users simultaneously | |
| const [user1Response, user2Response, user3Response] = await Promise.all([ | |
| session1.prompt(user1Query), | |
| session2.prompt(user2Query), | |
| session3.prompt(user3Query) | |
| ]); | |
| ``` | |
| ### 2. Multi-Agent Systems | |
| ```javascript | |
| // Multiple agents working on different tasks | |
| const [ | |
| plannerResponse, | |
| analyzerResponse, | |
| executorResponse | |
| ] = await Promise.all([ | |
| plannerSession.prompt("Plan the task"), | |
| analyzerSession.prompt("Analyze the data"), | |
| executorSession.prompt("Execute step 1") | |
| ]); | |
| ``` | |
| ### 3. Benchmarking | |
| ```javascript | |
| // Test multiple prompts for evaluation | |
| const results = await Promise.all( | |
| testPrompts.map(prompt => session.prompt(prompt)) | |
| ); | |
| ``` | |
| ### 4. A/B Testing | |
| ```javascript | |
| // Test different system prompts | |
| const [responseA, responseB] = await Promise.all([ | |
| sessionWithPromptA.prompt(query), | |
| sessionWithPromptB.prompt(query) | |
| ]); | |
| ``` | |
| ## Resource Considerations | |
| ### Memory Usage | |
| Each sequence needs memory for: | |
| - Conversation history | |
| - Intermediate computations | |
| - KV cache (key-value cache for transformer attention) | |
| **Rule of thumb**: More sequences = more memory needed | |
| ### GPU Utilization | |
| - **Single sequence**: May underutilize GPU | |
| - **Multiple sequences**: Better GPU utilization | |
| - **Too many sequences**: May exceed VRAM, causing slowdown | |
| ### Optimal Number of Sequences | |
| Depends on: | |
| - Available VRAM | |
| - Model size | |
| - Context length | |
| - Batch size | |
| **Typical**: 2-8 sequences for consumer GPUs | |
| ## Limitations & Considerations | |
| ### 1. Shared Context Limit | |
| All sequences share the same context memory pool: | |
| ``` | |
| Total context size: 8192 tokens | |
| Sequence 1: 4096 tokens | |
| Sequence 2: 4096 tokens | |
| Maximum distribution! | |
| ``` | |
| ### 2. Not True Parallelism for CPU | |
| On CPU-only systems, sequences are interleaved, not truly parallel. Still provides better overall throughput. | |
| ### 3. Model Loading Overhead | |
| The model is loaded once and shared, which is efficient. But initial loading still takes time. | |
| ## Why This Matters for AI Agents | |
| ### Efficiency in Production | |
| Real-world agent systems need to: | |
| - Handle multiple requests concurrently | |
| - Respond quickly to users | |
| - Make efficient use of hardware | |
| ### Multi-Agent Architectures | |
| Complex agent systems often have: | |
| - **Planner agent**: Thinks about strategy | |
| - **Executor agent**: Takes actions | |
| - **Critic agent**: Evaluates results | |
| These can run in parallel using separate sequences. | |
| ### Scalability | |
| This pattern is the foundation for: | |
| - Web services with multiple users | |
| - Batch processing of data | |
| - Distributed agent systems | |
| ## Best Practices | |
| 1. **Match sequences to workload**: Don't create more than you need | |
| 2. **Monitor memory usage**: Each sequence consumes VRAM | |
| 3. **Use appropriate batch size**: Balance speed vs. memory | |
| 4. **Clean up resources**: Always dispose when done | |
| 5. **Handle errors**: Wrap Promise.all in try-catch | |
| ## Expected Output | |
| Running this script should output something like: | |
| ``` | |
| User: Hi there, how are you? | |
| AI: Hello! I'm doing well, thank you for asking... | |
| User: How much is 6+6? | |
| AI: 12 | |
| ``` | |
| Both responses appear quickly because they were processed simultaneously! | |