Spaces:

lenzcom
/

Email

Running

App Files Files Community

Email / examples /05_batch /CONCEPT.md

lenzcom

Upload folder using huggingface_hub

e706de2 verified 1 day ago

preview code

raw

history blame contribute delete

12.4 kB

Concept: Parallel Processing & Performance Optimization

Overview

This example demonstrates concurrent execution of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems.

The Performance Problem

Sequential Processing (Slow)

Traditional approach processes one request at a time:

Request 1 ────────→ Response 1 (2s)
                        ↓
                    Request 2 ────────→ Response 2 (2s)
                                            ↓
                                        Total: 4 seconds

Parallel Processing (Fast)

This example processes multiple requests simultaneously:

Request 1 ────────→ Response 1 (2s) ──┐
                                       ├→ Total: 2 seconds
Request 2 ────────→ Response 2 (2s) ──┘
     (Both running at the same time)

Performance gain: 2x speedup!

Core Concept: Context Sequences

Single vs. Multiple Sequences

┌────────────────────────────────────────────────┐
│              Model (Loaded Once)               │
├────────────────────────────────────────────────┤
│                   Context                      │
│  ┌──────────────┐          ┌──────────────┐   │
│  │  Sequence 1  │          │  Sequence 2  │   │
│  │              │          │              │   │
│  │ Conversation │          │ Conversation │   │
│  │  History A   │          │  History B   │   │
│  └──────────────┘          └──────────────┘   │
└────────────────────────────────────────────────┘

Key insights:

Model weights are shared (memory efficient)
Each sequence has independent history
Sequences can process in parallel
Both use the same underlying model

How Parallel Processing Works

Promise.all Pattern

JavaScript's Promise.all() enables concurrent execution:

Sequential:
────────────────────────────────────
await fn1();  // Wait 2s
await fn2();  // Wait 2s more
Total: 4s

Parallel:
────────────────────────────────────
await Promise.all([
    fn1(),    // Start immediately
    fn2()     // Start immediately (don't wait!)
]);
Total: 2s (whichever finishes last)

Execution Timeline

Time →  0s      1s      2s      3s      4s
        │       │       │       │       │
Seq 1:  ├───────Processing───────┤
        │                        └─ Response 1
        │
Seq 2:  ├───────Processing───────┤
                                 └─ Response 2
                                 
        Both complete at ~2s instead of 4s!

GPU Batch Processing

Why Batching Matters

Modern GPUs process multiple operations efficiently:

Without Batching (Inefficient)
──────────────────────────────
GPU: [Token 1] ... wait ...
GPU: [Token 2] ... wait ...
GPU: [Token 3] ... wait ...
     └─ GPU underutilized

With Batching (Efficient)
─────────────────────────
GPU: [Tokens 1-1024]  ← Full batch
     └─ GPU fully utilized!

batchSize parameter: Controls how many tokens process together.

Trade-offs

Small Batch (e.g., 128)     Large Batch (e.g., 2048)
───────────────────────     ────────────────────────
✓ Lower memory              ✓ Better GPU utilization
✓ More flexible             ✓ Faster throughput
✗ Slower throughput         ✗ Higher memory usage
✗ GPU underutilized         ✗ May exceed VRAM

Sweet spot: Usually 512-1024 for consumer GPUs.

Architecture Patterns

Pattern 1: Multi-User Service

┌─────────┐  ┌─────────┐  ┌─────────┐
│ User A  │  │ User B  │  │ User C  │
└────┬────┘  └────┬────┘  └────┬────┘
     │            │            │
     └────────────┼────────────┘
                  ↓
         ┌────────────────┐
         │  Load Balancer │
         └────────────────┘
                  ↓
     ┌────────────┼────────────┐
     ↓            ↓            ↓
┌─────────┐  ┌─────────┐  ┌─────────┐
│  Seq 1  │  │  Seq 2  │  │  Seq 3  │
└─────────┘  └─────────┘  └─────────┘
     └────────────┼────────────┘
                  ↓
         ┌────────────────┐
         │  Shared Model  │
         └────────────────┘

Pattern 2: Multi-Agent System

         ┌──────────────┐
         │     Task     │
         └──────┬───────┘
                │
       ┌────────┼────────┐
       ↓        ↓        ↓
  ┌────────┐ ┌──────┐ ┌──────────┐
  │Planner │ │Critic│ │ Executor │
  │ Agent  │ │Agent │ │  Agent   │
  └───┬────┘ └──┬───┘ └────┬─────┘
      │         │          │
      └─────────┼──────────┘
                ↓
       (All run in parallel)

Pattern 3: Pipeline Processing

Input Queue: [Task1, Task2, Task3, ...]
                    ↓
            ┌───────────────┐
            │  Dispatcher   │
            └───────────────┘
                    ↓
        ┌───────────┼───────────┐
        ↓           ↓           ↓
    Sequence 1  Sequence 2  Sequence 3
        ↓           ↓           ↓
        └───────────┼───────────┘
                    ↓
            Output: [R1, R2, R3]

Resource Management

Memory Allocation

Each sequence consumes memory:

┌──────────────────────────────────┐
│        Total VRAM: 8GB           │
├──────────────────────────────────┤
│  Model Weights:        4.0 GB    │
│  Context Base:         1.0 GB    │
│  Sequence 1 (KV Cache): 0.8 GB   │
│  Sequence 2 (KV Cache): 0.8 GB   │
│  Sequence 3 (KV Cache): 0.8 GB   │
│  Overhead:             0.6 GB    │
├──────────────────────────────────┤
│  Total Used:           8.0 GB    │
│  Remaining:            0.0 GB    │
└──────────────────────────────────┘
        Maximum capacity!

Formula:

Required VRAM = Model + Context + (NumSequences × KVCache)

Finding Optimal Sequence Count

Too Few (1-2)              Optimal (4-8)           Too Many (16+)
─────────────              ─────────────           ──────────────
GPU underutilized          Balanced use            Memory overflow
↓                          ↓                       ↓
Slow throughput            Best performance        Thrashing/crashes

Test your system:

Start with 2 sequences
Monitor VRAM usage
Increase until performance plateaus
Back off if memory issues occur

Real-World Scenarios

Scenario 1: Chatbot Service

Challenge: 100 users, each waiting 2s per response
Sequential: 100 × 2s = 200s (3.3 minutes!)
Parallel (10 seq): 10 batches × 2s = 20s
                   10x speedup!

Scenario 2: Batch Analysis

Task: Analyze 1000 documents
Sequential: 1000 × 3s = 50 minutes
Parallel (8 seq): 125 batches × 3s = 6.25 minutes
                  8x speedup!

Scenario 3: Multi-Agent Collaboration

Agents: Planner, Analyzer, Executor (all needed)
Sequential: Wait for each → Slow pipeline
Parallel: All work together → Fast decision-making

Limitations & Considerations

1. Context Capacity Sharing

Problem: Sequences share total context space
───────────────────────────────────────────
Total context: 4096 tokens
2 sequences: Each gets ~2048 tokens max
4 sequences: Each gets ~1024 tokens max

More sequences = Less history per sequence!

2. CPU vs GPU Parallelism

With GPU:                    CPU Only:
True parallel processing     Interleaved processing
Multiple CUDA streams        Single thread context-switching
                            (Still helps throughput!)

3. Not Always Faster

When parallel helps:         When it doesn't:
• Independent requests       • Dependent requests (must wait)
• I/O-bound operations      • Very short prompts (overhead)
• Multiple users            • Single sequential conversation

Best Practices

1. Design for Independence

✓ Good: Separate user conversations
✓ Good: Independent analysis tasks
✗ Bad: Sequential reasoning steps (use ReAct instead)

2. Monitor Resources

Track:
• VRAM usage per sequence
• Processing time per request
• Queue depths
• Error rates

3. Implement Graceful Degradation

if (vramExceeded) {
    reduceSequenceCount();
    // or queue requests instead
}

4. Handle Errors Properly

try {
    const results = await Promise.all([...]);
} catch (error) {
    // One failure doesn't crash all sequences
    handlePartialResults();
}

Comparison: Evolution of Performance

Stage              Requests/Min    Pattern
─────────────────  ─────────────   ───────────────
1. Basic (intro)        30          Sequential
2. Batch (this)        120          4 sequences
3. Load balanced       240          8 sequences + queue
4. Distributed        1000+         Multiple machines

Key Takeaways

Parallelism is essential for production AI agent systems
Sequences share model but maintain independent state
Promise.all enables concurrent JavaScript execution
Batch size affects GPU utilization and throughput
Memory is the limit - more sequences need more VRAM
Not magic - only helps with independent tasks

Practical Formula

Speedup = min(
    Number_of_Sequences,
    Available_VRAM / Memory_Per_Sequence,
    GPU_Compute_Limit
)

Typically: 2-10x speedup for well-designed systems.

This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.