Email / examples /05_batch /CONCEPT.md
lenzcom's picture
Upload folder using huggingface_hub
e706de2 verified
# Concept: Parallel Processing & Performance Optimization
## Overview
This example demonstrates **concurrent execution** of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems.
## The Performance Problem
### Sequential Processing (Slow)
Traditional approach processes one request at a time:
```
Request 1 ────────→ Response 1 (2s)
↓
Request 2 ────────→ Response 2 (2s)
↓
Total: 4 seconds
```
### Parallel Processing (Fast)
This example processes multiple requests simultaneously:
```
Request 1 ────────→ Response 1 (2s) ──┐
β”œβ†’ Total: 2 seconds
Request 2 ────────→ Response 2 (2s) β”€β”€β”˜
(Both running at the same time)
```
**Performance gain: 2x speedup!**
## Core Concept: Context Sequences
### Single vs. Multiple Sequences
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model (Loaded Once) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Context β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Sequence 1 β”‚ β”‚ Sequence 2 β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Conversation β”‚ β”‚ Conversation β”‚ β”‚
β”‚ β”‚ History A β”‚ β”‚ History B β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Key insights:**
- Model weights are shared (memory efficient)
- Each sequence has independent history
- Sequences can process in parallel
- Both use the same underlying model
## How Parallel Processing Works
### Promise.all Pattern
JavaScript's `Promise.all()` enables concurrent execution:
```
Sequential:
────────────────────────────────────
await fn1(); // Wait 2s
await fn2(); // Wait 2s more
Total: 4s
Parallel:
────────────────────────────────────
await Promise.all([
fn1(), // Start immediately
fn2() // Start immediately (don't wait!)
]);
Total: 2s (whichever finishes last)
```
### Execution Timeline
```
Time β†’ 0s 1s 2s 3s 4s
β”‚ β”‚ β”‚ β”‚ β”‚
Seq 1: β”œβ”€β”€β”€β”€β”€β”€β”€Processing────────
β”‚ └─ Response 1
β”‚
Seq 2: β”œβ”€β”€β”€β”€β”€β”€β”€Processing────────
└─ Response 2
Both complete at ~2s instead of 4s!
```
## GPU Batch Processing
### Why Batching Matters
Modern GPUs process multiple operations efficiently:
```
Without Batching (Inefficient)
──────────────────────────────
GPU: [Token 1] ... wait ...
GPU: [Token 2] ... wait ...
GPU: [Token 3] ... wait ...
└─ GPU underutilized
With Batching (Efficient)
─────────────────────────
GPU: [Tokens 1-1024] ← Full batch
└─ GPU fully utilized!
```
**batchSize parameter**: Controls how many tokens process together.
### Trade-offs
```
Small Batch (e.g., 128) Large Batch (e.g., 2048)
─────────────────────── ────────────────────────
βœ“ Lower memory βœ“ Better GPU utilization
βœ“ More flexible βœ“ Faster throughput
βœ— Slower throughput βœ— Higher memory usage
βœ— GPU underutilized βœ— May exceed VRAM
```
**Sweet spot**: Usually 512-1024 for consumer GPUs.
## Architecture Patterns
### Pattern 1: Multi-User Service
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User A β”‚ β”‚ User B β”‚ β”‚ User C β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Load Balancer β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Seq 1 β”‚ β”‚ Seq 2 β”‚ β”‚ Seq 3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shared Model β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Pattern 2: Multi-Agent System
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Task β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Planner β”‚ β”‚Criticβ”‚ β”‚ Executor β”‚
β”‚ Agent β”‚ β”‚Agent β”‚ β”‚ Agent β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
(All run in parallel)
```
### Pattern 3: Pipeline Processing
```
Input Queue: [Task1, Task2, Task3, ...]
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Dispatcher β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
Sequence 1 Sequence 2 Sequence 3
↓ ↓ ↓
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
Output: [R1, R2, R3]
```
## Resource Management
### Memory Allocation
Each sequence consumes memory:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Total VRAM: 8GB β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Model Weights: 4.0 GB β”‚
β”‚ Context Base: 1.0 GB β”‚
β”‚ Sequence 1 (KV Cache): 0.8 GB β”‚
β”‚ Sequence 2 (KV Cache): 0.8 GB β”‚
β”‚ Sequence 3 (KV Cache): 0.8 GB β”‚
β”‚ Overhead: 0.6 GB β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Used: 8.0 GB β”‚
β”‚ Remaining: 0.0 GB β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Maximum capacity!
```
**Formula**:
```
Required VRAM = Model + Context + (NumSequences Γ— KVCache)
```
### Finding Optimal Sequence Count
```
Too Few (1-2) Optimal (4-8) Too Many (16+)
───────────── ───────────── ──────────────
GPU underutilized Balanced use Memory overflow
↓ ↓ ↓
Slow throughput Best performance Thrashing/crashes
```
**Test your system**:
1. Start with 2 sequences
2. Monitor VRAM usage
3. Increase until performance plateaus
4. Back off if memory issues occur
## Real-World Scenarios
### Scenario 1: Chatbot Service
```
Challenge: 100 users, each waiting 2s per response
Sequential: 100 Γ— 2s = 200s (3.3 minutes!)
Parallel (10 seq): 10 batches Γ— 2s = 20s
10x speedup!
```
### Scenario 2: Batch Analysis
```
Task: Analyze 1000 documents
Sequential: 1000 Γ— 3s = 50 minutes
Parallel (8 seq): 125 batches Γ— 3s = 6.25 minutes
8x speedup!
```
### Scenario 3: Multi-Agent Collaboration
```
Agents: Planner, Analyzer, Executor (all needed)
Sequential: Wait for each β†’ Slow pipeline
Parallel: All work together β†’ Fast decision-making
```
## Limitations & Considerations
### 1. Context Capacity Sharing
```
Problem: Sequences share total context space
───────────────────────────────────────────
Total context: 4096 tokens
2 sequences: Each gets ~2048 tokens max
4 sequences: Each gets ~1024 tokens max
More sequences = Less history per sequence!
```
### 2. CPU vs GPU Parallelism
```
With GPU: CPU Only:
True parallel processing Interleaved processing
Multiple CUDA streams Single thread context-switching
(Still helps throughput!)
```
### 3. Not Always Faster
```
When parallel helps: When it doesn't:
β€’ Independent requests β€’ Dependent requests (must wait)
β€’ I/O-bound operations β€’ Very short prompts (overhead)
β€’ Multiple users β€’ Single sequential conversation
```
## Best Practices
### 1. Design for Independence
```
βœ“ Good: Separate user conversations
βœ“ Good: Independent analysis tasks
βœ— Bad: Sequential reasoning steps (use ReAct instead)
```
### 2. Monitor Resources
```
Track:
β€’ VRAM usage per sequence
β€’ Processing time per request
β€’ Queue depths
β€’ Error rates
```
### 3. Implement Graceful Degradation
```
if (vramExceeded) {
reduceSequenceCount();
// or queue requests instead
}
```
### 4. Handle Errors Properly
```javascript
try {
const results = await Promise.all([...]);
} catch (error) {
// One failure doesn't crash all sequences
handlePartialResults();
}
```
## Comparison: Evolution of Performance
```
Stage Requests/Min Pattern
───────────────── ───────────── ───────────────
1. Basic (intro) 30 Sequential
2. Batch (this) 120 4 sequences
3. Load balanced 240 8 sequences + queue
4. Distributed 1000+ Multiple machines
```
## Key Takeaways
1. **Parallelism is essential** for production AI agent systems
2. **Sequences share model** but maintain independent state
3. **Promise.all** enables concurrent JavaScript execution
4. **Batch size** affects GPU utilization and throughput
5. **Memory is the limit** - more sequences need more VRAM
6. **Not magic** - only helps with independent tasks
## Practical Formula
```
Speedup = min(
Number_of_Sequences,
Available_VRAM / Memory_Per_Sequence,
GPU_Compute_Limit
)
```
Typically: 2-10x speedup for well-designed systems.
This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.