# Concept: Parallel Processing & Performance Optimization ## Overview This example demonstrates **concurrent execution** of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems. ## The Performance Problem ### Sequential Processing (Slow) Traditional approach processes one request at a time: ``` Request 1 ────────→ Response 1 (2s) ↓ Request 2 ────────→ Response 2 (2s) ↓ Total: 4 seconds ``` ### Parallel Processing (Fast) This example processes multiple requests simultaneously: ``` Request 1 ────────→ Response 1 (2s) ──┐ ├→ Total: 2 seconds Request 2 ────────→ Response 2 (2s) ──┘ (Both running at the same time) ``` **Performance gain: 2x speedup!** ## Core Concept: Context Sequences ### Single vs. Multiple Sequences ``` ┌────────────────────────────────────────────────┐ │ Model (Loaded Once) │ ├────────────────────────────────────────────────┤ │ Context │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Sequence 1 │ │ Sequence 2 │ │ │ │ │ │ │ │ │ │ Conversation │ │ Conversation │ │ │ │ History A │ │ History B │ │ │ └──────────────┘ └──────────────┘ │ └────────────────────────────────────────────────┘ ``` **Key insights:** - Model weights are shared (memory efficient) - Each sequence has independent history - Sequences can process in parallel - Both use the same underlying model ## How Parallel Processing Works ### Promise.all Pattern JavaScript's `Promise.all()` enables concurrent execution: ``` Sequential: ──────────────────────────────────── await fn1(); // Wait 2s await fn2(); // Wait 2s more Total: 4s Parallel: ──────────────────────────────────── await Promise.all([ fn1(), // Start immediately fn2() // Start immediately (don't wait!) ]); Total: 2s (whichever finishes last) ``` ### Execution Timeline ``` Time → 0s 1s 2s 3s 4s │ │ │ │ │ Seq 1: ├───────Processing───────┤ │ └─ Response 1 │ Seq 2: ├───────Processing───────┤ └─ Response 2 Both complete at ~2s instead of 4s! ``` ## GPU Batch Processing ### Why Batching Matters Modern GPUs process multiple operations efficiently: ``` Without Batching (Inefficient) ────────────────────────────── GPU: [Token 1] ... wait ... GPU: [Token 2] ... wait ... GPU: [Token 3] ... wait ... └─ GPU underutilized With Batching (Efficient) ───────────────────────── GPU: [Tokens 1-1024] ← Full batch └─ GPU fully utilized! ``` **batchSize parameter**: Controls how many tokens process together. ### Trade-offs ``` Small Batch (e.g., 128) Large Batch (e.g., 2048) ─────────────────────── ──────────────────────── ✓ Lower memory ✓ Better GPU utilization ✓ More flexible ✓ Faster throughput ✗ Slower throughput ✗ Higher memory usage ✗ GPU underutilized ✗ May exceed VRAM ``` **Sweet spot**: Usually 512-1024 for consumer GPUs. ## Architecture Patterns ### Pattern 1: Multi-User Service ``` ┌─────────┐ ┌─────────┐ ┌─────────┐ │ User A │ │ User B │ │ User C │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └────────────┼────────────┘ ↓ ┌────────────────┐ │ Load Balancer │ └────────────────┘ ↓ ┌────────────┼────────────┐ ↓ ↓ ↓ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Seq 1 │ │ Seq 2 │ │ Seq 3 │ └─────────┘ └─────────┘ └─────────┘ └────────────┼────────────┘ ↓ ┌────────────────┐ │ Shared Model │ └────────────────┘ ``` ### Pattern 2: Multi-Agent System ``` ┌──────────────┐ │ Task │ └──────┬───────┘ │ ┌────────┼────────┐ ↓ ↓ ↓ ┌────────┐ ┌──────┐ ┌──────────┐ │Planner │ │Critic│ │ Executor │ │ Agent │ │Agent │ │ Agent │ └───┬────┘ └──┬───┘ └────┬─────┘ │ │ │ └─────────┼──────────┘ ↓ (All run in parallel) ``` ### Pattern 3: Pipeline Processing ``` Input Queue: [Task1, Task2, Task3, ...] ↓ ┌───────────────┐ │ Dispatcher │ └───────────────┘ ↓ ┌───────────┼───────────┐ ↓ ↓ ↓ Sequence 1 Sequence 2 Sequence 3 ↓ ↓ ↓ └───────────┼───────────┘ ↓ Output: [R1, R2, R3] ``` ## Resource Management ### Memory Allocation Each sequence consumes memory: ``` ┌──────────────────────────────────┐ │ Total VRAM: 8GB │ ├──────────────────────────────────┤ │ Model Weights: 4.0 GB │ │ Context Base: 1.0 GB │ │ Sequence 1 (KV Cache): 0.8 GB │ │ Sequence 2 (KV Cache): 0.8 GB │ │ Sequence 3 (KV Cache): 0.8 GB │ │ Overhead: 0.6 GB │ ├──────────────────────────────────┤ │ Total Used: 8.0 GB │ │ Remaining: 0.0 GB │ └──────────────────────────────────┘ Maximum capacity! ``` **Formula**: ``` Required VRAM = Model + Context + (NumSequences × KVCache) ``` ### Finding Optimal Sequence Count ``` Too Few (1-2) Optimal (4-8) Too Many (16+) ───────────── ───────────── ────────────── GPU underutilized Balanced use Memory overflow ↓ ↓ ↓ Slow throughput Best performance Thrashing/crashes ``` **Test your system**: 1. Start with 2 sequences 2. Monitor VRAM usage 3. Increase until performance plateaus 4. Back off if memory issues occur ## Real-World Scenarios ### Scenario 1: Chatbot Service ``` Challenge: 100 users, each waiting 2s per response Sequential: 100 × 2s = 200s (3.3 minutes!) Parallel (10 seq): 10 batches × 2s = 20s 10x speedup! ``` ### Scenario 2: Batch Analysis ``` Task: Analyze 1000 documents Sequential: 1000 × 3s = 50 minutes Parallel (8 seq): 125 batches × 3s = 6.25 minutes 8x speedup! ``` ### Scenario 3: Multi-Agent Collaboration ``` Agents: Planner, Analyzer, Executor (all needed) Sequential: Wait for each → Slow pipeline Parallel: All work together → Fast decision-making ``` ## Limitations & Considerations ### 1. Context Capacity Sharing ``` Problem: Sequences share total context space ─────────────────────────────────────────── Total context: 4096 tokens 2 sequences: Each gets ~2048 tokens max 4 sequences: Each gets ~1024 tokens max More sequences = Less history per sequence! ``` ### 2. CPU vs GPU Parallelism ``` With GPU: CPU Only: True parallel processing Interleaved processing Multiple CUDA streams Single thread context-switching (Still helps throughput!) ``` ### 3. Not Always Faster ``` When parallel helps: When it doesn't: • Independent requests • Dependent requests (must wait) • I/O-bound operations • Very short prompts (overhead) • Multiple users • Single sequential conversation ``` ## Best Practices ### 1. Design for Independence ``` ✓ Good: Separate user conversations ✓ Good: Independent analysis tasks ✗ Bad: Sequential reasoning steps (use ReAct instead) ``` ### 2. Monitor Resources ``` Track: • VRAM usage per sequence • Processing time per request • Queue depths • Error rates ``` ### 3. Implement Graceful Degradation ``` if (vramExceeded) { reduceSequenceCount(); // or queue requests instead } ``` ### 4. Handle Errors Properly ```javascript try { const results = await Promise.all([...]); } catch (error) { // One failure doesn't crash all sequences handlePartialResults(); } ``` ## Comparison: Evolution of Performance ``` Stage Requests/Min Pattern ───────────────── ───────────── ─────────────── 1. Basic (intro) 30 Sequential 2. Batch (this) 120 4 sequences 3. Load balanced 240 8 sequences + queue 4. Distributed 1000+ Multiple machines ``` ## Key Takeaways 1. **Parallelism is essential** for production AI agent systems 2. **Sequences share model** but maintain independent state 3. **Promise.all** enables concurrent JavaScript execution 4. **Batch size** affects GPU utilization and throughput 5. **Memory is the limit** - more sequences need more VRAM 6. **Not magic** - only helps with independent tasks ## Practical Formula ``` Speedup = min( Number_of_Sequences, Available_VRAM / Memory_Per_Sequence, GPU_Compute_Limit ) ``` Typically: 2-10x speedup for well-designed systems. This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.