# Tutorial 11: Inference Optimization & Deployment ## Overview Training a model is only half the battle. Deploying Large Language Models (LLMs) for inference presents unique challenges: high memory bandwidth requirements, latency constraints, and the need to serve multiple users concurrently. This tutorial covers **Quantization**, **KV Caching**, **Speculative Decoding**, and production serving engines like **vLLM** and **TGI**. ## Prerequisites - Understanding of Transformer architecture (Tutorial 01) - Basic PyTorch knowledge - Familiarity with Hugging Face `transformers` --- ## 1. The Inference Bottleneck Unlike training, which is compute-bound (matrix multiplications), inference is often **memory-bandwidth bound**. ### The Memory Wall To generate one token: 1. Load all model weights from VRAM to Compute Units. 2. Perform calculation. 3. Store result. For a 7B model (14GB in fp16): - To generate 1 token at batch size 1, you must read 14GB of data. - On an A100 (1.5 TB/s bandwidth), this takes ~9ms just for memory transfer. - Compute time is negligible (~0.1ms). - **Result**: You can only generate ~100 tokens/sec regardless of compute power. **Solution**: Reduce memory footprint (Quantization) and reuse memory (KV Cache). --- ## 2. Quantization: Reducing Precision Quantization reduces the number of bits used to represent weights and activations. ### Types of Quantization #### 2.1 Post-Training Quantization (PTQ) Quantize a pre-trained model without retraining. - **INT8**: Weights scaled to 8-bit integers. Minimal accuracy loss. - **FP8**: New format supported by H100s. Good balance of range and precision. - **INT4**: Aggressive compression (e.g., GPTQ, AWQ). Requires careful calibration. #### 2.2 Quantization-Aware Training (QAT) Simulate quantization noise during training/fine-tuning. - Model learns to compensate for precision loss. - Best for INT4 or lower, but computationally expensive. ### Implementing INT8 with BitsAndBytes The easiest way to quantize for inference using Hugging Face. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch from accelerate import infer_auto_device_map model_id = "meta-llama/Llama-2-7b-hf" # 4-bit configuration (QLoRA style inference) bnb_config = { "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", # Normalized Float 4 "bnb_4bit_compute_dtype": torch.float16, "bnb_4bit_use_double_quant": True, # Nested quantization } # 8-bit configuration # bnb_config = {"load_in_8bit": True} tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", # Automatically spread across GPUs quantization_config=bnb_config if "load_in_4bit" in bnb_config else None, torch_dtype=torch.float16 if not bnb_config.get("load_in_8bit") else None, low_cpu_mem_usage=True ) # Inference input_text = "Explain quantum entanglement." inputs = tokenizer(input_text, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### GPTQ & AWQ (GPU-Aware Quantization) For INT4, generic quantization fails. **GPTQ** and **AWQ** use activation-aware weight selection. - **AWQ (Activation-Aware Weight Quantization)**: Preserves salient weights (those causing large activations) in FP16, quantizes the rest. - **Tool**: Use `auto-gptq` or `llama-cpp-python` for GGUF formats. ```bash # Install auto-gptq pip install auto-gptq optimum # Quantize script example from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig quantize_config = BaseQuantizeConfig( bits=4, group_size=128, damp_percent=0.01, desc_act=False, ) model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantize_config) # ... calibrate on dataset ... model.quantize(calibration_data) model.save_quantized("llama-2-7b-gptq") ``` --- ## 3. KV Cache: Speeding Up Generation Transformers are autoregressive. To generate token $t$, we need all previous tokens $0 \dots t-1$. Naive approach: Re-compute the entire sequence history for every new token. $O(N^2)$. **Optimization**: Cache the Key (K) and Value (V) matrices from previous steps. - At step $t$, only compute Q, K, V for the *new* token. - Retrieve cached K, V for previous tokens. - Perform Attention. ### Memory Cost of KV Cache For a model with $L$ layers, $H$ heads, hidden size $D$, sequence length $S$, batch size $B$: $$ \text{Memory} = 2 \times L \times H \times D_{head} \times S \times B \times \text{precision\_bytes} $$ For Llama-2-7B (32 layers, 4096 hidden, fp16): - Approx 0.5 MB per token per batch item. - For Batch=32, SeqLen=2048: ~32GB just for KV cache! - **Implication**: KV cache often limits batch size more than model weights. ### Paged Attention (vLLM Innovation) Traditional KV cache allocation is static (pre-allocate max seq len). Wasteful. **vLLM** uses OS-style virtual memory paging: - Split KV cache into fixed-size blocks. - Dynamically allocate blocks as tokens are generated. - Share blocks between sequences (useful for beam search or same prompt). - **Result**: 100% memory utilization, higher throughput. --- ## 4. Speculative Decoding Idea: Use a small "draft" model to guess the next $K$ tokens, then verify them with the large "target" model in parallel. 1. Draft model (fast) generates $x_1, x_2, x_3$. 2. Target model (slow) computes probabilities for all 3 in one forward pass. 3. Accept/Reject logic: - If $P_{target}(x_1) \approx P_{draft}(x_1)$, accept. - If rejected, resample. 4. **Speedup**: If draft is accurate, we generate $K$ tokens in the time of 1 target forward pass. ```python # Native PyTorch Speculative Decoding (Accelerate) from transformers import AutoModelForCausalLM, AutoTokenizer import torch draft_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0").to("cuda") target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").to("cuda") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") inputs = tokenizer("Write a python function to sort a list.", return_tensors="pt").to("cuda") # Assisted generation outputs = target_model.generate( **inputs, assistant_model=draft_model, num_assistant_tokens=5, # How many tokens to draft max_new_tokens=50 ) ``` --- ## 5. Production Serving Engines Do not use `model.generate()` in production. It lacks concurrency control, batching, and streaming optimization. Use dedicated servers. ### 5.1 vLLM (High Throughput) Best for high-throughput scenarios (batch processing, heavy traffic). Implements PagedAttention. **Installation**: ```bash pip install vllm ``` **Running a Server**: ```bash python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --port 8000 \ --tensor-parallel-size 1 \ --max-num-seqs 256 \ --gpu-memory-utilization 0.9 ``` **Client Usage**: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") response = client.chat.completions.create( model="meta-llama/Llama-2-7b-hf", messages=[{"role": "user", "content": "Hello!"}], stream=True ) for chunk in response: print(chunk.choices[0].delta.content, end="") ``` **Key Features**: - Continuous batching: As soon as one request finishes, start a new one in the same batch slot. - PagedAttention: Efficient memory usage. - Supports quantization (AWQ, GPTQ, SqueezeLLM). ### 5.2 Text Generation Inference (TGI) Developed by Hugging Face. Optimized for text generation, supports Tensor Parallelism out of the box. Written in Rust/C++. **Run with Docker**: ```bash docker run --gpus all --shm-size 1g -p 8080:80 \ -e MODEL_ID=meta-llama/Llama-2-7b-hf \ -e QUANTIZE=bitsandbytes \ ghcr.io/huggingface/text-generation-inference:latest ``` **Key Features**: - Flash Attention integration. - Token streaming. - Watermarking (for AI detection). - Logits warping (temperature, top_p) handled efficiently. ### 5.3 Llama.cpp (CPU/Mac Inference) Uses GGUF format (highly quantized). Runs on CPU, Apple Silicon, or GPU. - Ideal for local deployment, edge devices, or MacBooks. - Supports Q4_K_M, Q5_K_M quantizations. ```bash # Convert HF model to GGUF python convert.py meta-llama/Llama-2-7b-hf --outfile llama-2-7b.gguf --outtype q4_0 # Run inference ./main -m llama-2-7b.gguf -p "Hello world" -n 128 ``` --- ## 6. Deployment Strategies ### 6.1 Latency vs. Throughput - **Latency**: Time to first token (TTFT) + time per token. Critical for chatbots. - *Optimization*: Smaller models, fewer layers, speculative decoding. - **Throughput**: Total tokens generated per second across all users. Critical for batch analysis. - *Optimization*: Large batch sizes, vLLM, Tensor Parallelism. ### 6.2 Scaling Topology 1. **Replica Scaling**: Run multiple instances behind a load balancer (Kubernetes). Simplest. 2. **Tensor Parallelism**: Split one model across 4-8 GPUs for a single request. Needed for >30B models. 3. **Pipeline Parallelism**: Rare for inference due to bubble overhead, but useful for massive models on limited GPUs. ### 6.3 Monitoring Metrics Track these in production: - **TTFT (Time To First Token)**: User perceived latency. - **TPOT (Time Per Output Token)**: Reading speed match (aim for <50ms/token). - **Queue Depth**: How many requests waiting? - **GPU Utilization**: Should be high for throughput, moderate for latency-sensitive apps. - **KV Cache Hit Rate**: (If using prefix caching for system prompts). --- ## 7. Practical Exercise: Deploying a Quantized Model **Goal**: Serve a 4-bit Llama-2-7B model using vLLM with high concurrency. **Step 1: Prepare the Quantized Model** Use `TheBloke` models from HuggingFace (pre-quantized) or quantize your own. ```bash # Example: Using a pre-quantized AWQ model MODEL="TheBloke/Llama-2-7B-Chat-AWQ" ``` **Step 2: Launch vLLM** ```bash python -m vllm.entrypoints.api_server \ --model $MODEL \ --quantization awq \ --port 8000 \ --max-num-seqs 128 \ --max-model-len 4096 ``` **Step 3: Load Test** Use `locust` or a simple script to simulate 50 concurrent users. ```python # load_test.py import concurrent.futures import requests import time def send_request(i): start = time.time() resp = requests.post("http://localhost:8000/generate", json={ "prompt": f"Question {i}: What is AI?", "max_tokens": 50 }) duration = time.time() - start return duration with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor: futures = [executor.submit(send_request, i) for i in range(50)] results = [f.result() for f in futures] print(f"Average Latency: {sum(results)/len(results):.2f}s") print(f"Total Requests: {len(results)}") ``` --- ## 8. Security & Safety in Deployment 1. **Input Validation**: Prevent prompt injection attacks. Filter malicious inputs before sending to model. 2. **Output Filtering**: Use classifiers to detect hate speech, PII, or toxicity before returning to user. 3. **Rate Limiting**: Protect against DoS and cost overruns. 4. **Model Watermarking**: Embed invisible signals to identify AI-generated text (supported in TGI). --- ## 9. Summary Checklist | Technique | Benefit | Trade-off | Best For | | :--- | :--- | :--- | :--- | | **INT8/FP8** | 2x memory savings, 2x speed | Slight accuracy drop | General inference | | **INT4 (AWQ/GPTQ)** | 4x memory savings | Calibration needed, HW specific | Consumer GPUs, Edge | | **KV Cache** | Linear vs Quadratic time | High VRAM usage | Long context | | **PagedAttention** | Max VRAM utilization | Implementation complexity | High concurrency | | **Speculative Decoding** | 2-3x speedup | Needs draft model | Latency sensitive | | **vLLM/TGI** | Production features | Extra infrastructure | Production APIs | ## Next Steps In Tutorial 12, we will cover **MLOps, Automation & Governance**, focusing on CI/CD pipelines for models, model registries, compliance tracking, and automated evaluation gates.