Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / examples /05_batch /CONCEPT.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

12.4 kB

	# Concept: Parallel Processing & Performance Optimization

	## Overview

	This example demonstrates concurrent execution of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems.

	## The Performance Problem

	### Sequential Processing (Slow)

	Traditional approach processes one request at a time:

	```
	Request 1 ────────→ Response 1 (2s)
	↓
	Request 2 ────────→ Response 2 (2s)
	↓
	Total: 4 seconds
	```

	### Parallel Processing (Fast)

	This example processes multiple requests simultaneously:

	```
	Request 1 ────────→ Response 1 (2s) ──┐
	├→ Total: 2 seconds
	Request 2 ────────→ Response 2 (2s) ──┘
	(Both running at the same time)
	```

	Performance gain: 2x speedup!

	## Core Concept: Context Sequences

	### Single vs. Multiple Sequences

	```
	┌────────────────────────────────────────────────┐
	│ Model (Loaded Once) │
	├────────────────────────────────────────────────┤
	│ Context │
	│ ┌──────────────┐ ┌──────────────┐ │
	│ │ Sequence 1 │ │ Sequence 2 │ │
	│ │ │ │ │ │
	│ │ Conversation │ │ Conversation │ │
	│ │ History A │ │ History B │ │
	│ └──────────────┘ └──────────────┘ │
	└────────────────────────────────────────────────┘
	```

	Key insights:
	- Model weights are shared (memory efficient)
	- Each sequence has independent history
	- Sequences can process in parallel
	- Both use the same underlying model

	## How Parallel Processing Works

	### Promise.all Pattern

	JavaScript's `Promise.all()` enables concurrent execution:

	```
	Sequential:
	────────────────────────────────────
	await fn1(); // Wait 2s
	await fn2(); // Wait 2s more
	Total: 4s

	Parallel:
	────────────────────────────────────
	await Promise.all([
	fn1(), // Start immediately
	fn2() // Start immediately (don't wait!)
	]);
	Total: 2s (whichever finishes last)
	```

	### Execution Timeline

	```
	Time → 0s 1s 2s 3s 4s
	│ │ │ │ │
	Seq 1: ├───────Processing───────┤
	│ └─ Response 1
	│
	Seq 2: ├───────Processing───────┤
	└─ Response 2

	Both complete at ~2s instead of 4s!
	```

	## GPU Batch Processing

	### Why Batching Matters

	Modern GPUs process multiple operations efficiently:

	```
	Without Batching (Inefficient)
	──────────────────────────────
	GPU: [Token 1] ... wait ...
	GPU: [Token 2] ... wait ...
	GPU: [Token 3] ... wait ...
	└─ GPU underutilized

	With Batching (Efficient)
	─────────────────────────
	GPU: [Tokens 1-1024] ← Full batch
	└─ GPU fully utilized!
	```

	batchSize parameter: Controls how many tokens process together.

	### Trade-offs

	```
	Small Batch (e.g., 128) Large Batch (e.g., 2048)
	─────────────────────── ────────────────────────
	✓ Lower memory ✓ Better GPU utilization
	✓ More flexible ✓ Faster throughput
	✗ Slower throughput ✗ Higher memory usage
	✗ GPU underutilized ✗ May exceed VRAM
	```

	Sweet spot: Usually 512-1024 for consumer GPUs.

	## Architecture Patterns

	### Pattern 1: Multi-User Service

	```
	┌─────────┐ ┌─────────┐ ┌─────────┐
	│ User A │ │ User B │ │ User C │
	└────┬────┘ └────┬────┘ └────┬────┘
	│ │ │
	└────────────┼────────────┘
	↓
	┌────────────────┐
	│ Load Balancer │
	└────────────────┘
	↓
	┌────────────┼────────────┐
	↓ ↓ ↓
	┌─────────┐ ┌─────────┐ ┌─────────┐
	│ Seq 1 │ │ Seq 2 │ │ Seq 3 │
	└─────────┘ └─────────┘ └─────────┘
	└────────────┼────────────┘
	↓
	┌────────────────┐
	│ Shared Model │
	└────────────────┘
	```

	### Pattern 2: Multi-Agent System

	```
	┌──────────────┐
	│ Task │
	└──────┬───────┘
	│
	┌────────┼────────┐
	↓ ↓ ↓
	┌────────┐ ┌──────┐ ┌──────────┐
	│Planner │ │Critic│ │ Executor │
	│ Agent │ │Agent │ │ Agent │
	└───┬────┘ └──┬───┘ └────┬─────┘
	│ │ │
	└─────────┼──────────┘
	↓
	(All run in parallel)
	```

	### Pattern 3: Pipeline Processing

	```
	Input Queue: [Task1, Task2, Task3, ...]
	↓
	┌───────────────┐
	│ Dispatcher │
	└───────────────┘
	↓
	┌───────────┼───────────┐
	↓ ↓ ↓
	Sequence 1 Sequence 2 Sequence 3
	↓ ↓ ↓
	└───────────┼───────────┘
	↓
	Output: [R1, R2, R3]
	```

	## Resource Management

	### Memory Allocation

	Each sequence consumes memory:

	```
	┌──────────────────────────────────┐
	│ Total VRAM: 8GB │
	├──────────────────────────────────┤
	│ Model Weights: 4.0 GB │
	│ Context Base: 1.0 GB │
	│ Sequence 1 (KV Cache): 0.8 GB │
	│ Sequence 2 (KV Cache): 0.8 GB │
	│ Sequence 3 (KV Cache): 0.8 GB │
	│ Overhead: 0.6 GB │
	├──────────────────────────────────┤
	│ Total Used: 8.0 GB │
	│ Remaining: 0.0 GB │
	└──────────────────────────────────┘
	Maximum capacity!
	```

	Formula:
	```
	Required VRAM = Model + Context + (NumSequences × KVCache)
	```

	### Finding Optimal Sequence Count

	```
	Too Few (1-2) Optimal (4-8) Too Many (16+)
	───────────── ───────────── ──────────────
	GPU underutilized Balanced use Memory overflow
	↓ ↓ ↓
	Slow throughput Best performance Thrashing/crashes
	```

	Test your system:
	1. Start with 2 sequences
	2. Monitor VRAM usage
	3. Increase until performance plateaus
	4. Back off if memory issues occur

	## Real-World Scenarios

	### Scenario 1: Chatbot Service

	```
	Challenge: 100 users, each waiting 2s per response
	Sequential: 100 × 2s = 200s (3.3 minutes!)
	Parallel (10 seq): 10 batches × 2s = 20s
	10x speedup!
	```

	### Scenario 2: Batch Analysis

	```
	Task: Analyze 1000 documents
	Sequential: 1000 × 3s = 50 minutes
	Parallel (8 seq): 125 batches × 3s = 6.25 minutes
	8x speedup!
	```

	### Scenario 3: Multi-Agent Collaboration

	```
	Agents: Planner, Analyzer, Executor (all needed)
	Sequential: Wait for each → Slow pipeline
	Parallel: All work together → Fast decision-making
	```

	## Limitations & Considerations

	### 1. Context Capacity Sharing

	```
	Problem: Sequences share total context space
	───────────────────────────────────────────
	Total context: 4096 tokens
	2 sequences: Each gets ~2048 tokens max
	4 sequences: Each gets ~1024 tokens max

	More sequences = Less history per sequence!
	```

	### 2. CPU vs GPU Parallelism

	```
	With GPU: CPU Only:
	True parallel processing Interleaved processing
	Multiple CUDA streams Single thread context-switching
	(Still helps throughput!)
	```

	### 3. Not Always Faster

	```
	When parallel helps: When it doesn't:
	• Independent requests • Dependent requests (must wait)
	• I/O-bound operations • Very short prompts (overhead)
	• Multiple users • Single sequential conversation
	```

	## Best Practices

	### 1. Design for Independence
	```
	✓ Good: Separate user conversations
	✓ Good: Independent analysis tasks
	✗ Bad: Sequential reasoning steps (use ReAct instead)
	```

	### 2. Monitor Resources
	```
	Track:
	• VRAM usage per sequence
	• Processing time per request
	• Queue depths
	• Error rates
	```

	### 3. Implement Graceful Degradation
	```
	if (vramExceeded) {
	reduceSequenceCount();
	// or queue requests instead
	}
	```

	### 4. Handle Errors Properly
	```javascript
	try {
	const results = await Promise.all([...]);
	} catch (error) {
	// One failure doesn't crash all sequences
	handlePartialResults();
	}
	```

	## Comparison: Evolution of Performance

	```
	Stage Requests/Min Pattern
	───────────────── ───────────── ───────────────
	1. Basic (intro) 30 Sequential
	2. Batch (this) 120 4 sequences
	3. Load balanced 240 8 sequences + queue
	4. Distributed 1000+ Multiple machines
	```

	## Key Takeaways

	1. Parallelism is essential for production AI agent systems
	2. Sequences share model but maintain independent state
	3. Promise.all enables concurrent JavaScript execution
	4. Batch size affects GPU utilization and throughput
	5. Memory is the limit - more sequences need more VRAM
	6. Not magic - only helps with independent tasks

	## Practical Formula

	```
	Speedup = min(
	Number_of_Sequences,
	Available_VRAM / Memory_Per_Sequence,
	GPU_Compute_Limit
	)
	```

	Typically: 2-10x speedup for well-designed systems.

	This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.