# Tutorial 11: Inference Optimization & Deployment

## Overview
Training a model is only half the battle. Deploying Large Language Models (LLMs) for inference presents unique challenges: high memory bandwidth requirements, latency constraints, and the need to serve multiple users concurrently. This tutorial covers **Quantization**, **KV Caching**, **Speculative Decoding**, and production serving engines like **vLLM** and **TGI**.

## Prerequisites
- Understanding of Transformer architecture (Tutorial 01)
- Basic PyTorch knowledge
- Familiarity with Hugging Face `transformers`

---

## 1. The Inference Bottleneck

Unlike training, which is compute-bound (matrix multiplications), inference is often **memory-bandwidth bound**.

### The Memory Wall
To generate one token:
1. Load all model weights from VRAM to Compute Units.
2. Perform calculation.
3. Store result.

For a 7B model (14GB in fp16):
- To generate 1 token at batch size 1, you must read 14GB of data.
- On an A100 (1.5 TB/s bandwidth), this takes ~9ms just for memory transfer.
- Compute time is negligible (~0.1ms).
- **Result**: You can only generate ~100 tokens/sec regardless of compute power.

**Solution**: Reduce memory footprint (Quantization) and reuse memory (KV Cache).

---

## 2. Quantization: Reducing Precision

Quantization reduces the number of bits used to represent weights and activations.

### Types of Quantization

#### 2.1 Post-Training Quantization (PTQ)
Quantize a pre-trained model without retraining.
- **INT8**: Weights scaled to 8-bit integers. Minimal accuracy loss.
- **FP8**: New format supported by H100s. Good balance of range and precision.
- **INT4**: Aggressive compression (e.g., GPTQ, AWQ). Requires careful calibration.

#### 2.2 Quantization-Aware Training (QAT)
Simulate quantization noise during training/fine-tuning.
- Model learns to compensate for precision loss.
- Best for INT4 or lower, but computationally expensive.

### Implementing INT8 with BitsAndBytes
The easiest way to quantize for inference using Hugging Face.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate import infer_auto_device_map

model_id = "meta-llama/Llama-2-7b-hf"

# 4-bit configuration (QLoRA style inference)
bnb_config = {
    "load_in_4bit": True,
    "bnb_4bit_quant_type": "nf4",  # Normalized Float 4
    "bnb_4bit_compute_dtype": torch.float16,
    "bnb_4bit_use_double_quant": True, # Nested quantization
}

# 8-bit configuration
# bnb_config = {"load_in_8bit": True}

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",  # Automatically spread across GPUs
    quantization_config=bnb_config if "load_in_4bit" in bnb_config else None,
    torch_dtype=torch.float16 if not bnb_config.get("load_in_8bit") else None,
    low_cpu_mem_usage=True
)

# Inference
input_text = "Explain quantum entanglement."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### GPTQ & AWQ (GPU-Aware Quantization)
For INT4, generic quantization fails. **GPTQ** and **AWQ** use activation-aware weight selection.
- **AWQ (Activation-Aware Weight Quantization)**: Preserves salient weights (those causing large activations) in FP16, quantizes the rest.
- **Tool**: Use `auto-gptq` or `llama-cpp-python` for GGUF formats.

```bash
# Install auto-gptq
pip install auto-gptq optimum

# Quantize script example
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantize_config)
# ... calibrate on dataset ...
model.quantize(calibration_data)
model.save_quantized("llama-2-7b-gptq")
```

---

## 3. KV Cache: Speeding Up Generation

Transformers are autoregressive. To generate token $t$, we need all previous tokens $0 \dots t-1$.
Naive approach: Re-compute the entire sequence history for every new token. $O(N^2)$.

**Optimization**: Cache the Key (K) and Value (V) matrices from previous steps.
- At step $t$, only compute Q, K, V for the *new* token.
- Retrieve cached K, V for previous tokens.
- Perform Attention.

### Memory Cost of KV Cache
For a model with $L$ layers, $H$ heads, hidden size $D$, sequence length $S$, batch size $B$:
$$ \text{Memory} = 2 \times L \times H \times D_{head} \times S \times B \times \text{precision\_bytes} $$

For Llama-2-7B (32 layers, 4096 hidden, fp16):
- Approx 0.5 MB per token per batch item.
- For Batch=32, SeqLen=2048: ~32GB just for KV cache!
- **Implication**: KV cache often limits batch size more than model weights.

### Paged Attention (vLLM Innovation)
Traditional KV cache allocation is static (pre-allocate max seq len). Wasteful.
**vLLM** uses OS-style virtual memory paging:
- Split KV cache into fixed-size blocks.
- Dynamically allocate blocks as tokens are generated.
- Share blocks between sequences (useful for beam search or same prompt).
- **Result**: 100% memory utilization, higher throughput.

---

## 4. Speculative Decoding

Idea: Use a small "draft" model to guess the next $K$ tokens, then verify them with the large "target" model in parallel.

1. Draft model (fast) generates $x_1, x_2, x_3$.
2. Target model (slow) computes probabilities for all 3 in one forward pass.
3. Accept/Reject logic:
   - If $P_{target}(x_1) \approx P_{draft}(x_1)$, accept.
   - If rejected, resample.
4. **Speedup**: If draft is accurate, we generate $K$ tokens in the time of 1 target forward pass.

```python
# Native PyTorch Speculative Decoding (Accelerate)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

draft_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0").to("cuda")
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").to("cuda")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Write a python function to sort a list.", return_tensors="pt").to("cuda")

# Assisted generation
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    num_assistant_tokens=5,  # How many tokens to draft
    max_new_tokens=50
)
```

---

## 5. Production Serving Engines

Do not use `model.generate()` in production. It lacks concurrency control, batching, and streaming optimization. Use dedicated servers.

### 5.1 vLLM (High Throughput)
Best for high-throughput scenarios (batch processing, heavy traffic). Implements PagedAttention.

**Installation**:
```bash
pip install vllm
```

**Running a Server**:
```bash
python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9
```

**Client Usage**:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")
```

**Key Features**:
- Continuous batching: As soon as one request finishes, start a new one in the same batch slot.
- PagedAttention: Efficient memory usage.
- Supports quantization (AWQ, GPTQ, SqueezeLLM).

### 5.2 Text Generation Inference (TGI)
Developed by Hugging Face. Optimized for text generation, supports Tensor Parallelism out of the box. Written in Rust/C++.

**Run with Docker**:
```bash
docker run --gpus all --shm-size 1g -p 8080:80 \
  -e MODEL_ID=meta-llama/Llama-2-7b-hf \
  -e QUANTIZE=bitsandbytes \
  ghcr.io/huggingface/text-generation-inference:latest
```

**Key Features**:
- Flash Attention integration.
- Token streaming.
- Watermarking (for AI detection).
- Logits warping (temperature, top_p) handled efficiently.

### 5.3 Llama.cpp (CPU/Mac Inference)
Uses GGUF format (highly quantized). Runs on CPU, Apple Silicon, or GPU.
- Ideal for local deployment, edge devices, or MacBooks.
- Supports Q4_K_M, Q5_K_M quantizations.

```bash
# Convert HF model to GGUF
python convert.py meta-llama/Llama-2-7b-hf --outfile llama-2-7b.gguf --outtype q4_0

# Run inference
./main -m llama-2-7b.gguf -p "Hello world" -n 128
```

---

## 6. Deployment Strategies

### 6.1 Latency vs. Throughput
- **Latency**: Time to first token (TTFT) + time per token. Critical for chatbots.
  - *Optimization*: Smaller models, fewer layers, speculative decoding.
- **Throughput**: Total tokens generated per second across all users. Critical for batch analysis.
  - *Optimization*: Large batch sizes, vLLM, Tensor Parallelism.

### 6.2 Scaling Topology
1. **Replica Scaling**: Run multiple instances behind a load balancer (Kubernetes). Simplest.
2. **Tensor Parallelism**: Split one model across 4-8 GPUs for a single request. Needed for >30B models.
3. **Pipeline Parallelism**: Rare for inference due to bubble overhead, but useful for massive models on limited GPUs.

### 6.3 Monitoring Metrics
Track these in production:
- **TTFT (Time To First Token)**: User perceived latency.
- **TPOT (Time Per Output Token)**: Reading speed match (aim for <50ms/token).
- **Queue Depth**: How many requests waiting?
- **GPU Utilization**: Should be high for throughput, moderate for latency-sensitive apps.
- **KV Cache Hit Rate**: (If using prefix caching for system prompts).

---

## 7. Practical Exercise: Deploying a Quantized Model

**Goal**: Serve a 4-bit Llama-2-7B model using vLLM with high concurrency.

**Step 1: Prepare the Quantized Model**
Use `TheBloke` models from HuggingFace (pre-quantized) or quantize your own.
```bash
# Example: Using a pre-quantized AWQ model
MODEL="TheBloke/Llama-2-7B-Chat-AWQ"
```

**Step 2: Launch vLLM**
```bash
python -m vllm.entrypoints.api_server \
    --model $MODEL \
    --quantization awq \
    --port 8000 \
    --max-num-seqs 128 \
    --max-model-len 4096
```

**Step 3: Load Test**
Use `locust` or a simple script to simulate 50 concurrent users.
```python
# load_test.py
import concurrent.futures
import requests
import time

def send_request(i):
    start = time.time()
    resp = requests.post("http://localhost:8000/generate", json={
        "prompt": f"Question {i}: What is AI?",
        "max_tokens": 50
    })
    duration = time.time() - start
    return duration

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    futures = [executor.submit(send_request, i) for i in range(50)]
    results = [f.result() for f in futures]

print(f"Average Latency: {sum(results)/len(results):.2f}s")
print(f"Total Requests: {len(results)}")
```

---

## 8. Security & Safety in Deployment

1. **Input Validation**: Prevent prompt injection attacks. Filter malicious inputs before sending to model.
2. **Output Filtering**: Use classifiers to detect hate speech, PII, or toxicity before returning to user.
3. **Rate Limiting**: Protect against DoS and cost overruns.
4. **Model Watermarking**: Embed invisible signals to identify AI-generated text (supported in TGI).

---

## 9. Summary Checklist

| Technique | Benefit | Trade-off | Best For |
| :--- | :--- | :--- | :--- |
| **INT8/FP8** | 2x memory savings, 2x speed | Slight accuracy drop | General inference |
| **INT4 (AWQ/GPTQ)** | 4x memory savings | Calibration needed, HW specific | Consumer GPUs, Edge |
| **KV Cache** | Linear vs Quadratic time | High VRAM usage | Long context |
| **PagedAttention** | Max VRAM utilization | Implementation complexity | High concurrency |
| **Speculative Decoding** | 2-3x speedup | Needs draft model | Latency sensitive |
| **vLLM/TGI** | Production features | Extra infrastructure | Production APIs |

## Next Steps
In Tutorial 12, we will cover **MLOps, Automation & Governance**, focusing on CI/CD pipelines for models, model registries, compliance tracking, and automated evaluation gates.