| # Concept: Parallel Processing & Performance Optimization | |
| ## Overview | |
| This example demonstrates **concurrent execution** of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems. | |
| ## The Performance Problem | |
| ### Sequential Processing (Slow) | |
| Traditional approach processes one request at a time: | |
| ``` | |
| Request 1 βββββββββ Response 1 (2s) | |
| β | |
| Request 2 βββββββββ Response 2 (2s) | |
| β | |
| Total: 4 seconds | |
| ``` | |
| ### Parallel Processing (Fast) | |
| This example processes multiple requests simultaneously: | |
| ``` | |
| Request 1 βββββββββ Response 1 (2s) βββ | |
| ββ Total: 2 seconds | |
| Request 2 βββββββββ Response 2 (2s) βββ | |
| (Both running at the same time) | |
| ``` | |
| **Performance gain: 2x speedup!** | |
| ## Core Concept: Context Sequences | |
| ### Single vs. Multiple Sequences | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Model (Loaded Once) β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Context β | |
| β ββββββββββββββββ ββββββββββββββββ β | |
| β β Sequence 1 β β Sequence 2 β β | |
| β β β β β β | |
| β β Conversation β β Conversation β β | |
| β β History A β β History B β β | |
| β ββββββββββββββββ ββββββββββββββββ β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Key insights:** | |
| - Model weights are shared (memory efficient) | |
| - Each sequence has independent history | |
| - Sequences can process in parallel | |
| - Both use the same underlying model | |
| ## How Parallel Processing Works | |
| ### Promise.all Pattern | |
| JavaScript's `Promise.all()` enables concurrent execution: | |
| ``` | |
| Sequential: | |
| ββββββββββββββββββββββββββββββββββββ | |
| await fn1(); // Wait 2s | |
| await fn2(); // Wait 2s more | |
| Total: 4s | |
| Parallel: | |
| ββββββββββββββββββββββββββββββββββββ | |
| await Promise.all([ | |
| fn1(), // Start immediately | |
| fn2() // Start immediately (don't wait!) | |
| ]); | |
| Total: 2s (whichever finishes last) | |
| ``` | |
| ### Execution Timeline | |
| ``` | |
| Time β 0s 1s 2s 3s 4s | |
| β β β β β | |
| Seq 1: ββββββββProcessingββββββββ€ | |
| β ββ Response 1 | |
| β | |
| Seq 2: ββββββββProcessingββββββββ€ | |
| ββ Response 2 | |
| Both complete at ~2s instead of 4s! | |
| ``` | |
| ## GPU Batch Processing | |
| ### Why Batching Matters | |
| Modern GPUs process multiple operations efficiently: | |
| ``` | |
| Without Batching (Inefficient) | |
| ββββββββββββββββββββββββββββββ | |
| GPU: [Token 1] ... wait ... | |
| GPU: [Token 2] ... wait ... | |
| GPU: [Token 3] ... wait ... | |
| ββ GPU underutilized | |
| With Batching (Efficient) | |
| βββββββββββββββββββββββββ | |
| GPU: [Tokens 1-1024] β Full batch | |
| ββ GPU fully utilized! | |
| ``` | |
| **batchSize parameter**: Controls how many tokens process together. | |
| ### Trade-offs | |
| ``` | |
| Small Batch (e.g., 128) Large Batch (e.g., 2048) | |
| βββββββββββββββββββββββ ββββββββββββββββββββββββ | |
| β Lower memory β Better GPU utilization | |
| β More flexible β Faster throughput | |
| β Slower throughput β Higher memory usage | |
| β GPU underutilized β May exceed VRAM | |
| ``` | |
| **Sweet spot**: Usually 512-1024 for consumer GPUs. | |
| ## Architecture Patterns | |
| ### Pattern 1: Multi-User Service | |
| ``` | |
| βββββββββββ βββββββββββ βββββββββββ | |
| β User A β β User B β β User C β | |
| ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ | |
| β β β | |
| ββββββββββββββΌβββββββββββββ | |
| β | |
| ββββββββββββββββββ | |
| β Load Balancer β | |
| ββββββββββββββββββ | |
| β | |
| ββββββββββββββΌβββββββββββββ | |
| β β β | |
| βββββββββββ βββββββββββ βββββββββββ | |
| β Seq 1 β β Seq 2 β β Seq 3 β | |
| βββββββββββ βββββββββββ βββββββββββ | |
| ββββββββββββββΌβββββββββββββ | |
| β | |
| ββββββββββββββββββ | |
| β Shared Model β | |
| ββββββββββββββββββ | |
| ``` | |
| ### Pattern 2: Multi-Agent System | |
| ``` | |
| ββββββββββββββββ | |
| β Task β | |
| ββββββββ¬ββββββββ | |
| β | |
| ββββββββββΌβββββββββ | |
| β β β | |
| ββββββββββ ββββββββ ββββββββββββ | |
| βPlanner β βCriticβ β Executor β | |
| β Agent β βAgent β β Agent β | |
| βββββ¬βββββ ββββ¬ββββ ββββββ¬ββββββ | |
| β β β | |
| βββββββββββΌβββββββββββ | |
| β | |
| (All run in parallel) | |
| ``` | |
| ### Pattern 3: Pipeline Processing | |
| ``` | |
| Input Queue: [Task1, Task2, Task3, ...] | |
| β | |
| βββββββββββββββββ | |
| β Dispatcher β | |
| βββββββββββββββββ | |
| β | |
| βββββββββββββΌββββββββββββ | |
| β β β | |
| Sequence 1 Sequence 2 Sequence 3 | |
| β β β | |
| βββββββββββββΌββββββββββββ | |
| β | |
| Output: [R1, R2, R3] | |
| ``` | |
| ## Resource Management | |
| ### Memory Allocation | |
| Each sequence consumes memory: | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββ | |
| β Total VRAM: 8GB β | |
| ββββββββββββββββββββββββββββββββββββ€ | |
| β Model Weights: 4.0 GB β | |
| β Context Base: 1.0 GB β | |
| β Sequence 1 (KV Cache): 0.8 GB β | |
| β Sequence 2 (KV Cache): 0.8 GB β | |
| β Sequence 3 (KV Cache): 0.8 GB β | |
| β Overhead: 0.6 GB β | |
| ββββββββββββββββββββββββββββββββββββ€ | |
| β Total Used: 8.0 GB β | |
| β Remaining: 0.0 GB β | |
| ββββββββββββββββββββββββββββββββββββ | |
| Maximum capacity! | |
| ``` | |
| **Formula**: | |
| ``` | |
| Required VRAM = Model + Context + (NumSequences Γ KVCache) | |
| ``` | |
| ### Finding Optimal Sequence Count | |
| ``` | |
| Too Few (1-2) Optimal (4-8) Too Many (16+) | |
| βββββββββββββ βββββββββββββ ββββββββββββββ | |
| GPU underutilized Balanced use Memory overflow | |
| β β β | |
| Slow throughput Best performance Thrashing/crashes | |
| ``` | |
| **Test your system**: | |
| 1. Start with 2 sequences | |
| 2. Monitor VRAM usage | |
| 3. Increase until performance plateaus | |
| 4. Back off if memory issues occur | |
| ## Real-World Scenarios | |
| ### Scenario 1: Chatbot Service | |
| ``` | |
| Challenge: 100 users, each waiting 2s per response | |
| Sequential: 100 Γ 2s = 200s (3.3 minutes!) | |
| Parallel (10 seq): 10 batches Γ 2s = 20s | |
| 10x speedup! | |
| ``` | |
| ### Scenario 2: Batch Analysis | |
| ``` | |
| Task: Analyze 1000 documents | |
| Sequential: 1000 Γ 3s = 50 minutes | |
| Parallel (8 seq): 125 batches Γ 3s = 6.25 minutes | |
| 8x speedup! | |
| ``` | |
| ### Scenario 3: Multi-Agent Collaboration | |
| ``` | |
| Agents: Planner, Analyzer, Executor (all needed) | |
| Sequential: Wait for each β Slow pipeline | |
| Parallel: All work together β Fast decision-making | |
| ``` | |
| ## Limitations & Considerations | |
| ### 1. Context Capacity Sharing | |
| ``` | |
| Problem: Sequences share total context space | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| Total context: 4096 tokens | |
| 2 sequences: Each gets ~2048 tokens max | |
| 4 sequences: Each gets ~1024 tokens max | |
| More sequences = Less history per sequence! | |
| ``` | |
| ### 2. CPU vs GPU Parallelism | |
| ``` | |
| With GPU: CPU Only: | |
| True parallel processing Interleaved processing | |
| Multiple CUDA streams Single thread context-switching | |
| (Still helps throughput!) | |
| ``` | |
| ### 3. Not Always Faster | |
| ``` | |
| When parallel helps: When it doesn't: | |
| β’ Independent requests β’ Dependent requests (must wait) | |
| β’ I/O-bound operations β’ Very short prompts (overhead) | |
| β’ Multiple users β’ Single sequential conversation | |
| ``` | |
| ## Best Practices | |
| ### 1. Design for Independence | |
| ``` | |
| β Good: Separate user conversations | |
| β Good: Independent analysis tasks | |
| β Bad: Sequential reasoning steps (use ReAct instead) | |
| ``` | |
| ### 2. Monitor Resources | |
| ``` | |
| Track: | |
| β’ VRAM usage per sequence | |
| β’ Processing time per request | |
| β’ Queue depths | |
| β’ Error rates | |
| ``` | |
| ### 3. Implement Graceful Degradation | |
| ``` | |
| if (vramExceeded) { | |
| reduceSequenceCount(); | |
| // or queue requests instead | |
| } | |
| ``` | |
| ### 4. Handle Errors Properly | |
| ```javascript | |
| try { | |
| const results = await Promise.all([...]); | |
| } catch (error) { | |
| // One failure doesn't crash all sequences | |
| handlePartialResults(); | |
| } | |
| ``` | |
| ## Comparison: Evolution of Performance | |
| ``` | |
| Stage Requests/Min Pattern | |
| βββββββββββββββββ βββββββββββββ βββββββββββββββ | |
| 1. Basic (intro) 30 Sequential | |
| 2. Batch (this) 120 4 sequences | |
| 3. Load balanced 240 8 sequences + queue | |
| 4. Distributed 1000+ Multiple machines | |
| ``` | |
| ## Key Takeaways | |
| 1. **Parallelism is essential** for production AI agent systems | |
| 2. **Sequences share model** but maintain independent state | |
| 3. **Promise.all** enables concurrent JavaScript execution | |
| 4. **Batch size** affects GPU utilization and throughput | |
| 5. **Memory is the limit** - more sequences need more VRAM | |
| 6. **Not magic** - only helps with independent tasks | |
| ## Practical Formula | |
| ``` | |
| Speedup = min( | |
| Number_of_Sequences, | |
| Available_VRAM / Memory_Per_Sequence, | |
| GPU_Compute_Limit | |
| ) | |
| ``` | |
| Typically: 2-10x speedup for well-designed systems. | |
| This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently. | |