Concept: Parallel Processing & Performance Optimization
Overview
This example demonstrates concurrent execution of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems.
The Performance Problem
Sequential Processing (Slow)
Traditional approach processes one request at a time:
Request 1 βββββββββ Response 1 (2s)
β
Request 2 βββββββββ Response 2 (2s)
β
Total: 4 seconds
Parallel Processing (Fast)
This example processes multiple requests simultaneously:
Request 1 βββββββββ Response 1 (2s) βββ
ββ Total: 2 seconds
Request 2 βββββββββ Response 2 (2s) βββ
(Both running at the same time)
Performance gain: 2x speedup!
Core Concept: Context Sequences
Single vs. Multiple Sequences
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Model (Loaded Once) β
ββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Context β
β ββββββββββββββββ ββββββββββββββββ β
β β Sequence 1 β β Sequence 2 β β
β β β β β β
β β Conversation β β Conversation β β
β β History A β β History B β β
β ββββββββββββββββ ββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Key insights:
- Model weights are shared (memory efficient)
- Each sequence has independent history
- Sequences can process in parallel
- Both use the same underlying model
How Parallel Processing Works
Promise.all Pattern
JavaScript's Promise.all() enables concurrent execution:
Sequential:
ββββββββββββββββββββββββββββββββββββ
await fn1(); // Wait 2s
await fn2(); // Wait 2s more
Total: 4s
Parallel:
ββββββββββββββββββββββββββββββββββββ
await Promise.all([
fn1(), // Start immediately
fn2() // Start immediately (don't wait!)
]);
Total: 2s (whichever finishes last)
Execution Timeline
Time β 0s 1s 2s 3s 4s
β β β β β
Seq 1: ββββββββProcessingββββββββ€
β ββ Response 1
β
Seq 2: ββββββββProcessingββββββββ€
ββ Response 2
Both complete at ~2s instead of 4s!
GPU Batch Processing
Why Batching Matters
Modern GPUs process multiple operations efficiently:
Without Batching (Inefficient)
ββββββββββββββββββββββββββββββ
GPU: [Token 1] ... wait ...
GPU: [Token 2] ... wait ...
GPU: [Token 3] ... wait ...
ββ GPU underutilized
With Batching (Efficient)
βββββββββββββββββββββββββ
GPU: [Tokens 1-1024] β Full batch
ββ GPU fully utilized!
batchSize parameter: Controls how many tokens process together.
Trade-offs
Small Batch (e.g., 128) Large Batch (e.g., 2048)
βββββββββββββββββββββββ ββββββββββββββββββββββββ
β Lower memory β Better GPU utilization
β More flexible β Faster throughput
β Slower throughput β Higher memory usage
β GPU underutilized β May exceed VRAM
Sweet spot: Usually 512-1024 for consumer GPUs.
Architecture Patterns
Pattern 1: Multi-User Service
βββββββββββ βββββββββββ βββββββββββ
β User A β β User B β β User C β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
ββββββββββββββΌβββββββββββββ
β
ββββββββββββββββββ
β Load Balancer β
ββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
βββββββββββ βββββββββββ βββββββββββ
β Seq 1 β β Seq 2 β β Seq 3 β
βββββββββββ βββββββββββ βββββββββββ
ββββββββββββββΌβββββββββββββ
β
ββββββββββββββββββ
β Shared Model β
ββββββββββββββββββ
Pattern 2: Multi-Agent System
ββββββββββββββββ
β Task β
ββββββββ¬ββββββββ
β
ββββββββββΌβββββββββ
β β β
ββββββββββ ββββββββ ββββββββββββ
βPlanner β βCriticβ β Executor β
β Agent β βAgent β β Agent β
βββββ¬βββββ ββββ¬ββββ ββββββ¬ββββββ
β β β
βββββββββββΌβββββββββββ
β
(All run in parallel)
Pattern 3: Pipeline Processing
Input Queue: [Task1, Task2, Task3, ...]
β
βββββββββββββββββ
β Dispatcher β
βββββββββββββββββ
β
βββββββββββββΌββββββββββββ
β β β
Sequence 1 Sequence 2 Sequence 3
β β β
βββββββββββββΌββββββββββββ
β
Output: [R1, R2, R3]
Resource Management
Memory Allocation
Each sequence consumes memory:
ββββββββββββββββββββββββββββββββββββ
β Total VRAM: 8GB β
ββββββββββββββββββββββββββββββββββββ€
β Model Weights: 4.0 GB β
β Context Base: 1.0 GB β
β Sequence 1 (KV Cache): 0.8 GB β
β Sequence 2 (KV Cache): 0.8 GB β
β Sequence 3 (KV Cache): 0.8 GB β
β Overhead: 0.6 GB β
ββββββββββββββββββββββββββββββββββββ€
β Total Used: 8.0 GB β
β Remaining: 0.0 GB β
ββββββββββββββββββββββββββββββββββββ
Maximum capacity!
Formula:
Required VRAM = Model + Context + (NumSequences Γ KVCache)
Finding Optimal Sequence Count
Too Few (1-2) Optimal (4-8) Too Many (16+)
βββββββββββββ βββββββββββββ ββββββββββββββ
GPU underutilized Balanced use Memory overflow
β β β
Slow throughput Best performance Thrashing/crashes
Test your system:
- Start with 2 sequences
- Monitor VRAM usage
- Increase until performance plateaus
- Back off if memory issues occur
Real-World Scenarios
Scenario 1: Chatbot Service
Challenge: 100 users, each waiting 2s per response
Sequential: 100 Γ 2s = 200s (3.3 minutes!)
Parallel (10 seq): 10 batches Γ 2s = 20s
10x speedup!
Scenario 2: Batch Analysis
Task: Analyze 1000 documents
Sequential: 1000 Γ 3s = 50 minutes
Parallel (8 seq): 125 batches Γ 3s = 6.25 minutes
8x speedup!
Scenario 3: Multi-Agent Collaboration
Agents: Planner, Analyzer, Executor (all needed)
Sequential: Wait for each β Slow pipeline
Parallel: All work together β Fast decision-making
Limitations & Considerations
1. Context Capacity Sharing
Problem: Sequences share total context space
βββββββββββββββββββββββββββββββββββββββββββ
Total context: 4096 tokens
2 sequences: Each gets ~2048 tokens max
4 sequences: Each gets ~1024 tokens max
More sequences = Less history per sequence!
2. CPU vs GPU Parallelism
With GPU: CPU Only:
True parallel processing Interleaved processing
Multiple CUDA streams Single thread context-switching
(Still helps throughput!)
3. Not Always Faster
When parallel helps: When it doesn't:
β’ Independent requests β’ Dependent requests (must wait)
β’ I/O-bound operations β’ Very short prompts (overhead)
β’ Multiple users β’ Single sequential conversation
Best Practices
1. Design for Independence
β Good: Separate user conversations
β Good: Independent analysis tasks
β Bad: Sequential reasoning steps (use ReAct instead)
2. Monitor Resources
Track:
β’ VRAM usage per sequence
β’ Processing time per request
β’ Queue depths
β’ Error rates
3. Implement Graceful Degradation
if (vramExceeded) {
reduceSequenceCount();
// or queue requests instead
}
4. Handle Errors Properly
try {
const results = await Promise.all([...]);
} catch (error) {
// One failure doesn't crash all sequences
handlePartialResults();
}
Comparison: Evolution of Performance
Stage Requests/Min Pattern
βββββββββββββββββ βββββββββββββ βββββββββββββββ
1. Basic (intro) 30 Sequential
2. Batch (this) 120 4 sequences
3. Load balanced 240 8 sequences + queue
4. Distributed 1000+ Multiple machines
Key Takeaways
- Parallelism is essential for production AI agent systems
- Sequences share model but maintain independent state
- Promise.all enables concurrent JavaScript execution
- Batch size affects GPU utilization and throughput
- Memory is the limit - more sequences need more VRAM
- Not magic - only helps with independent tasks
Practical Formula
Speedup = min(
Number_of_Sequences,
Available_VRAM / Memory_Per_Sequence,
GPU_Compute_Limit
)
Typically: 2-10x speedup for well-designed systems.
This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.