Tutorial 11: Inference Optimization & Deployment
Overview
Training a model is only half the battle. Deploying Large Language Models (LLMs) for inference presents unique challenges: high memory bandwidth requirements, latency constraints, and the need to serve multiple users concurrently. This tutorial covers Quantization, KV Caching, Speculative Decoding, and production serving engines like vLLM and TGI.
Prerequisites
- Understanding of Transformer architecture (Tutorial 01)
- Basic PyTorch knowledge
- Familiarity with Hugging Face
transformers
1. The Inference Bottleneck
Unlike training, which is compute-bound (matrix multiplications), inference is often memory-bandwidth bound.
The Memory Wall
To generate one token:
- Load all model weights from VRAM to Compute Units.
- Perform calculation.
- Store result.
For a 7B model (14GB in fp16):
- To generate 1 token at batch size 1, you must read 14GB of data.
- On an A100 (1.5 TB/s bandwidth), this takes ~9ms just for memory transfer.
- Compute time is negligible (~0.1ms).
- Result: You can only generate ~100 tokens/sec regardless of compute power.
Solution: Reduce memory footprint (Quantization) and reuse memory (KV Cache).
2. Quantization: Reducing Precision
Quantization reduces the number of bits used to represent weights and activations.
Types of Quantization
2.1 Post-Training Quantization (PTQ)
Quantize a pre-trained model without retraining.
- INT8: Weights scaled to 8-bit integers. Minimal accuracy loss.
- FP8: New format supported by H100s. Good balance of range and precision.
- INT4: Aggressive compression (e.g., GPTQ, AWQ). Requires careful calibration.
2.2 Quantization-Aware Training (QAT)
Simulate quantization noise during training/fine-tuning.
- Model learns to compensate for precision loss.
- Best for INT4 or lower, but computationally expensive.
Implementing INT8 with BitsAndBytes
The easiest way to quantize for inference using Hugging Face.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate import infer_auto_device_map
model_id = "meta-llama/Llama-2-7b-hf"
# 4-bit configuration (QLoRA style inference)
bnb_config = {
"load_in_4bit": True,
"bnb_4bit_quant_type": "nf4", # Normalized Float 4
"bnb_4bit_compute_dtype": torch.float16,
"bnb_4bit_use_double_quant": True, # Nested quantization
}
# 8-bit configuration
# bnb_config = {"load_in_8bit": True}
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # Automatically spread across GPUs
quantization_config=bnb_config if "load_in_4bit" in bnb_config else None,
torch_dtype=torch.float16 if not bnb_config.get("load_in_8bit") else None,
low_cpu_mem_usage=True
)
# Inference
input_text = "Explain quantum entanglement."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GPTQ & AWQ (GPU-Aware Quantization)
For INT4, generic quantization fails. GPTQ and AWQ use activation-aware weight selection.
- AWQ (Activation-Aware Weight Quantization): Preserves salient weights (those causing large activations) in FP16, quantizes the rest.
- Tool: Use
auto-gptqorllama-cpp-pythonfor GGUF formats.
# Install auto-gptq
pip install auto-gptq optimum
# Quantize script example
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
damp_percent=0.01,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantize_config)
# ... calibrate on dataset ...
model.quantize(calibration_data)
model.save_quantized("llama-2-7b-gptq")
3. KV Cache: Speeding Up Generation
Transformers are autoregressive. To generate token $t$, we need all previous tokens $0 \dots t-1$. Naive approach: Re-compute the entire sequence history for every new token. $O(N^2)$.
Optimization: Cache the Key (K) and Value (V) matrices from previous steps.
- At step $t$, only compute Q, K, V for the new token.
- Retrieve cached K, V for previous tokens.
- Perform Attention.
Memory Cost of KV Cache
For a model with $L$ layers, $H$ heads, hidden size $D$, sequence length $S$, batch size $B$:
For Llama-2-7B (32 layers, 4096 hidden, fp16):
- Approx 0.5 MB per token per batch item.
- For Batch=32, SeqLen=2048: ~32GB just for KV cache!
- Implication: KV cache often limits batch size more than model weights.
Paged Attention (vLLM Innovation)
Traditional KV cache allocation is static (pre-allocate max seq len). Wasteful. vLLM uses OS-style virtual memory paging:
- Split KV cache into fixed-size blocks.
- Dynamically allocate blocks as tokens are generated.
- Share blocks between sequences (useful for beam search or same prompt).
- Result: 100% memory utilization, higher throughput.
4. Speculative Decoding
Idea: Use a small "draft" model to guess the next $K$ tokens, then verify them with the large "target" model in parallel.
- Draft model (fast) generates $x_1, x_2, x_3$.
- Target model (slow) computes probabilities for all 3 in one forward pass.
- Accept/Reject logic:
- If $P_{target}(x_1) \approx P_{draft}(x_1)$, accept.
- If rejected, resample.
- Speedup: If draft is accurate, we generate $K$ tokens in the time of 1 target forward pass.
# Native PyTorch Speculative Decoding (Accelerate)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
draft_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0").to("cuda")
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Write a python function to sort a list.", return_tensors="pt").to("cuda")
# Assisted generation
outputs = target_model.generate(
**inputs,
assistant_model=draft_model,
num_assistant_tokens=5, # How many tokens to draft
max_new_tokens=50
)
5. Production Serving Engines
Do not use model.generate() in production. It lacks concurrency control, batching, and streaming optimization. Use dedicated servers.
5.1 vLLM (High Throughput)
Best for high-throughput scenarios (batch processing, heavy traffic). Implements PagedAttention.
Installation:
pip install vllm
Running a Server:
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--port 8000 \
--tensor-parallel-size 1 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9
Client Usage:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-hf",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
Key Features:
- Continuous batching: As soon as one request finishes, start a new one in the same batch slot.
- PagedAttention: Efficient memory usage.
- Supports quantization (AWQ, GPTQ, SqueezeLLM).
5.2 Text Generation Inference (TGI)
Developed by Hugging Face. Optimized for text generation, supports Tensor Parallelism out of the box. Written in Rust/C++.
Run with Docker:
docker run --gpus all --shm-size 1g -p 8080:80 \
-e MODEL_ID=meta-llama/Llama-2-7b-hf \
-e QUANTIZE=bitsandbytes \
ghcr.io/huggingface/text-generation-inference:latest
Key Features:
- Flash Attention integration.
- Token streaming.
- Watermarking (for AI detection).
- Logits warping (temperature, top_p) handled efficiently.
5.3 Llama.cpp (CPU/Mac Inference)
Uses GGUF format (highly quantized). Runs on CPU, Apple Silicon, or GPU.
- Ideal for local deployment, edge devices, or MacBooks.
- Supports Q4_K_M, Q5_K_M quantizations.
# Convert HF model to GGUF
python convert.py meta-llama/Llama-2-7b-hf --outfile llama-2-7b.gguf --outtype q4_0
# Run inference
./main -m llama-2-7b.gguf -p "Hello world" -n 128
6. Deployment Strategies
6.1 Latency vs. Throughput
- Latency: Time to first token (TTFT) + time per token. Critical for chatbots.
- Optimization: Smaller models, fewer layers, speculative decoding.
- Throughput: Total tokens generated per second across all users. Critical for batch analysis.
- Optimization: Large batch sizes, vLLM, Tensor Parallelism.
6.2 Scaling Topology
- Replica Scaling: Run multiple instances behind a load balancer (Kubernetes). Simplest.
- Tensor Parallelism: Split one model across 4-8 GPUs for a single request. Needed for >30B models.
- Pipeline Parallelism: Rare for inference due to bubble overhead, but useful for massive models on limited GPUs.
6.3 Monitoring Metrics
Track these in production:
- TTFT (Time To First Token): User perceived latency.
- TPOT (Time Per Output Token): Reading speed match (aim for <50ms/token).
- Queue Depth: How many requests waiting?
- GPU Utilization: Should be high for throughput, moderate for latency-sensitive apps.
- KV Cache Hit Rate: (If using prefix caching for system prompts).
7. Practical Exercise: Deploying a Quantized Model
Goal: Serve a 4-bit Llama-2-7B model using vLLM with high concurrency.
Step 1: Prepare the Quantized Model
Use TheBloke models from HuggingFace (pre-quantized) or quantize your own.
# Example: Using a pre-quantized AWQ model
MODEL="TheBloke/Llama-2-7B-Chat-AWQ"
Step 2: Launch vLLM
python -m vllm.entrypoints.api_server \
--model $MODEL \
--quantization awq \
--port 8000 \
--max-num-seqs 128 \
--max-model-len 4096
Step 3: Load Test
Use locust or a simple script to simulate 50 concurrent users.
# load_test.py
import concurrent.futures
import requests
import time
def send_request(i):
start = time.time()
resp = requests.post("http://localhost:8000/generate", json={
"prompt": f"Question {i}: What is AI?",
"max_tokens": 50
})
duration = time.time() - start
return duration
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
futures = [executor.submit(send_request, i) for i in range(50)]
results = [f.result() for f in futures]
print(f"Average Latency: {sum(results)/len(results):.2f}s")
print(f"Total Requests: {len(results)}")
8. Security & Safety in Deployment
- Input Validation: Prevent prompt injection attacks. Filter malicious inputs before sending to model.
- Output Filtering: Use classifiers to detect hate speech, PII, or toxicity before returning to user.
- Rate Limiting: Protect against DoS and cost overruns.
- Model Watermarking: Embed invisible signals to identify AI-generated text (supported in TGI).
9. Summary Checklist
| Technique | Benefit | Trade-off | Best For |
|---|---|---|---|
| INT8/FP8 | 2x memory savings, 2x speed | Slight accuracy drop | General inference |
| INT4 (AWQ/GPTQ) | 4x memory savings | Calibration needed, HW specific | Consumer GPUs, Edge |
| KV Cache | Linear vs Quadratic time | High VRAM usage | Long context |
| PagedAttention | Max VRAM utilization | Implementation complexity | High concurrency |
| Speculative Decoding | 2-3x speedup | Needs draft model | Latency sensitive |
| vLLM/TGI | Production features | Extra infrastructure | Production APIs |
Next Steps
In Tutorial 12, we will cover MLOps, Automation & Governance, focusing on CI/CD pipelines for models, model registries, compliance tracking, and automated evaluation gates.