Spaces:

lenzcom
/

Email

Running

File size: 12,396 Bytes

e706de2

# Concept: Parallel Processing & Performance Optimization

## Overview

This example demonstrates **concurrent execution** of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems.

## The Performance Problem

### Sequential Processing (Slow)

Traditional approach processes one request at a time:

```

Request 1 ────────→ Response 1 (2s)

                        ↓

                    Request 2 ────────→ Response 2 (2s)

                                            ↓

                                        Total: 4 seconds

```

### Parallel Processing (Fast)

This example processes multiple requests simultaneously:

```

Request 1 ────────→ Response 1 (2s) ──┐

                                       ├→ Total: 2 seconds

Request 2 ────────→ Response 2 (2s) ──┘

     (Both running at the same time)

```

**Performance gain: 2x speedup!**

## Core Concept: Context Sequences

### Single vs. Multiple Sequences

```

┌────────────────────────────────────────────────┐

│              Model (Loaded Once)               │

├────────────────────────────────────────────────┤

│                   Context                      │

│  ┌──────────────┐          ┌──────────────┐   │

│  │  Sequence 1  │          │  Sequence 2  │   │

│  │              │          │              │   │

│  │ Conversation │          │ Conversation │   │

│  │  History A   │          │  History B   │   │

│  └──────────────┘          └──────────────┘   │

└────────────────────────────────────────────────┘

```

**Key insights:**
- Model weights are shared (memory efficient)
- Each sequence has independent history
- Sequences can process in parallel
- Both use the same underlying model

## How Parallel Processing Works

### Promise.all Pattern

JavaScript's `Promise.all()` enables concurrent execution:

```

Sequential:

────────────────────────────────────

await fn1();  // Wait 2s

await fn2();  // Wait 2s more

Total: 4s



Parallel:

────────────────────────────────────

await Promise.all([

    fn1(),    // Start immediately

    fn2()     // Start immediately (don't wait!)

]);

Total: 2s (whichever finishes last)

```

### Execution Timeline

```

Time →  0s      1s      2s      3s      4s

        │       │       │       │       │

Seq 1:  ├───────Processing───────┤

        │                        └─ Response 1

        │

Seq 2:  ├───────Processing───────┤

                                 └─ Response 2

                                 

        Both complete at ~2s instead of 4s!

```

## GPU Batch Processing

### Why Batching Matters

Modern GPUs process multiple operations efficiently:

```

Without Batching (Inefficient)

──────────────────────────────

GPU: [Token 1] ... wait ...

GPU: [Token 2] ... wait ...

GPU: [Token 3] ... wait ...

     └─ GPU underutilized



With Batching (Efficient)

─────────────────────────

GPU: [Tokens 1-1024]  ← Full batch

     └─ GPU fully utilized!

```

**batchSize parameter**: Controls how many tokens process together.

### Trade-offs

```

Small Batch (e.g., 128)     Large Batch (e.g., 2048)

───────────────────────     ────────────────────────

✓ Lower memory              ✓ Better GPU utilization

✓ More flexible             ✓ Faster throughput

✗ Slower throughput         ✗ Higher memory usage

✗ GPU underutilized         ✗ May exceed VRAM

```

**Sweet spot**: Usually 512-1024 for consumer GPUs.

## Architecture Patterns

### Pattern 1: Multi-User Service

```

┌─────────┐  ┌─────────┐  ┌─────────┐

│ User A  │  │ User B  │  │ User C  │

└────┬────┘  └────┬────┘  └────┬────┘

     │            │            │

     └────────────┼────────────┘

                  ↓

         ┌────────────────┐

         │  Load Balancer │

         └────────────────┘

                  ↓

     ┌────────────┼────────────┐

     ↓            ↓            ↓

┌─────────┐  ┌─────────┐  ┌─────────┐

│  Seq 1  │  │  Seq 2  │  │  Seq 3  │

└─────────┘  └─────────┘  └─────────┘

     └────────────┼────────────┘

                  ↓

         ┌────────────────┐

         │  Shared Model  │

         └────────────────┘

```

### Pattern 2: Multi-Agent System

```

         ┌──────────────┐

         │     Task     │

         └──────┬───────┘

                │

       ┌────────┼────────┐

       ↓        ↓        ↓

  ┌────────┐ ┌──────┐ ┌──────────┐

  │Planner │ │Critic│ │ Executor │

  │ Agent  │ │Agent │ │  Agent   │

  └───┬────┘ └──┬───┘ └────┬─────┘

      │         │          │

      └─────────┼──────────┘

                ↓

       (All run in parallel)

```

### Pattern 3: Pipeline Processing

```

Input Queue: [Task1, Task2, Task3, ...]

                    ↓

            ┌───────────────┐

            │  Dispatcher   │

            └───────────────┘

                    ↓

        ┌───────────┼───────────┐

        ↓           ↓           ↓

    Sequence 1  Sequence 2  Sequence 3

        ↓           ↓           ↓

        └───────────┼───────────┘

                    ↓

            Output: [R1, R2, R3]

```

## Resource Management

### Memory Allocation

Each sequence consumes memory:

```

┌──────────────────────────────────┐

│        Total VRAM: 8GB           │

├──────────────────────────────────┤

│  Model Weights:        4.0 GB    │

│  Context Base:         1.0 GB    │

│  Sequence 1 (KV Cache): 0.8 GB   │

│  Sequence 2 (KV Cache): 0.8 GB   │

│  Sequence 3 (KV Cache): 0.8 GB   │

│  Overhead:             0.6 GB    │

├──────────────────────────────────┤

│  Total Used:           8.0 GB    │

│  Remaining:            0.0 GB    │

└──────────────────────────────────┘

        Maximum capacity!

```

**Formula**: 
```

Required VRAM = Model + Context + (NumSequences × KVCache)

```

### Finding Optimal Sequence Count

```

Too Few (1-2)              Optimal (4-8)           Too Many (16+)

─────────────              ─────────────           ──────────────

GPU underutilized          Balanced use            Memory overflow

↓                          ↓                       ↓

Slow throughput            Best performance        Thrashing/crashes

```

**Test your system**:
1. Start with 2 sequences
2. Monitor VRAM usage
3. Increase until performance plateaus
4. Back off if memory issues occur

## Real-World Scenarios

### Scenario 1: Chatbot Service

```

Challenge: 100 users, each waiting 2s per response

Sequential: 100 × 2s = 200s (3.3 minutes!)

Parallel (10 seq): 10 batches × 2s = 20s

                   10x speedup!

```

### Scenario 2: Batch Analysis

```

Task: Analyze 1000 documents

Sequential: 1000 × 3s = 50 minutes

Parallel (8 seq): 125 batches × 3s = 6.25 minutes

                  8x speedup!

```

### Scenario 3: Multi-Agent Collaboration

```

Agents: Planner, Analyzer, Executor (all needed)

Sequential: Wait for each → Slow pipeline

Parallel: All work together → Fast decision-making

```

## Limitations & Considerations

### 1. Context Capacity Sharing

```

Problem: Sequences share total context space

───────────────────────────────────────────

Total context: 4096 tokens

2 sequences: Each gets ~2048 tokens max

4 sequences: Each gets ~1024 tokens max



More sequences = Less history per sequence!

```

### 2. CPU vs GPU Parallelism

```

With GPU:                    CPU Only:

True parallel processing     Interleaved processing

Multiple CUDA streams        Single thread context-switching

                            (Still helps throughput!)

```

### 3. Not Always Faster

```

When parallel helps:         When it doesn't:

• Independent requests       • Dependent requests (must wait)

• I/O-bound operations      • Very short prompts (overhead)

• Multiple users            • Single sequential conversation

```

## Best Practices

### 1. Design for Independence
```

✓ Good: Separate user conversations

✓ Good: Independent analysis tasks

✗ Bad: Sequential reasoning steps (use ReAct instead)

```

### 2. Monitor Resources
```

Track:

• VRAM usage per sequence

• Processing time per request

• Queue depths

• Error rates

```

### 3. Implement Graceful Degradation
```

if (vramExceeded) {

    reduceSequenceCount();

    // or queue requests instead

}

```

### 4. Handle Errors Properly
```javascript

try {

    const results = await Promise.all([...]);

} catch (error) {

    // One failure doesn't crash all sequences

    handlePartialResults();

}

```

## Comparison: Evolution of Performance

```

Stage              Requests/Min    Pattern

─────────────────  ─────────────   ───────────────

1. Basic (intro)        30          Sequential

2. Batch (this)        120          4 sequences

3. Load balanced       240          8 sequences + queue

4. Distributed        1000+         Multiple machines

```

## Key Takeaways

1. **Parallelism is essential** for production AI agent systems
2. **Sequences share model** but maintain independent state
3. **Promise.all** enables concurrent JavaScript execution
4. **Batch size** affects GPU utilization and throughput
5. **Memory is the limit** - more sequences need more VRAM
6. **Not magic** - only helps with independent tasks

## Practical Formula

```

Speedup = min(

    Number_of_Sequences,

    Available_VRAM / Memory_Per_Sequence,

    GPU_Compute_Limit

)

```

Typically: 2-10x speedup for well-designed systems.

This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.