Nexuss-Transformer / Tutorials /11-inference-optimization-deployment.md

Upload data/train-00000-of-00001.parquet with huggingface_hub

7cb972e 15 days ago

12.1 kB

	# Tutorial 11: Inference Optimization & Deployment

	## Overview
	Training a model is only half the battle. Deploying Large Language Models (LLMs) for inference presents unique challenges: high memory bandwidth requirements, latency constraints, and the need to serve multiple users concurrently. This tutorial covers Quantization, KV Caching, Speculative Decoding, and production serving engines like vLLM and TGI.

	## Prerequisites
	- Understanding of Transformer architecture (Tutorial 01)
	- Basic PyTorch knowledge
	- Familiarity with Hugging Face `transformers`

	---

	## 1. The Inference Bottleneck

	Unlike training, which is compute-bound (matrix multiplications), inference is often memory-bandwidth bound.

	### The Memory Wall
	To generate one token:
	1. Load all model weights from VRAM to Compute Units.
	2. Perform calculation.
	3. Store result.

	For a 7B model (14GB in fp16):
	- To generate 1 token at batch size 1, you must read 14GB of data.
	- On an A100 (1.5 TB/s bandwidth), this takes ~9ms just for memory transfer.
	- Compute time is negligible (~0.1ms).
	- Result: You can only generate ~100 tokens/sec regardless of compute power.

	Solution: Reduce memory footprint (Quantization) and reuse memory (KV Cache).

	---

	## 2. Quantization: Reducing Precision

	Quantization reduces the number of bits used to represent weights and activations.

	### Types of Quantization

	#### 2.1 Post-Training Quantization (PTQ)
	Quantize a pre-trained model without retraining.
	- INT8: Weights scaled to 8-bit integers. Minimal accuracy loss.
	- FP8: New format supported by H100s. Good balance of range and precision.
	- INT4: Aggressive compression (e.g., GPTQ, AWQ). Requires careful calibration.

	#### 2.2 Quantization-Aware Training (QAT)
	Simulate quantization noise during training/fine-tuning.
	- Model learns to compensate for precision loss.
	- Best for INT4 or lower, but computationally expensive.

	### Implementing INT8 with BitsAndBytes
	The easiest way to quantize for inference using Hugging Face.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	from accelerate import infer_auto_device_map

	model_id = "meta-llama/Llama-2-7b-hf"

	# 4-bit configuration (QLoRA style inference)
	bnb_config = {
	"load_in_4bit": True,
	"bnb_4bit_quant_type": "nf4", # Normalized Float 4
	"bnb_4bit_compute_dtype": torch.float16,
	"bnb_4bit_use_double_quant": True, # Nested quantization
	}

	# 8-bit configuration
	# bnb_config = {"load_in_8bit": True}

	tokenizer = AutoTokenizer.from_pretrained(model_id)

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto", # Automatically spread across GPUs
	quantization_config=bnb_config if "load_in_4bit" in bnb_config else None,
	torch_dtype=torch.float16 if not bnb_config.get("load_in_8bit") else None,
	low_cpu_mem_usage=True
	)

	# Inference
	input_text = "Explain quantum entanglement."
	inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### GPTQ & AWQ (GPU-Aware Quantization)
	For INT4, generic quantization fails. GPTQ and AWQ use activation-aware weight selection.
	- AWQ (Activation-Aware Weight Quantization): Preserves salient weights (those causing large activations) in FP16, quantizes the rest.
	- Tool: Use `auto-gptq` or `llama-cpp-python` for GGUF formats.

	```bash
	# Install auto-gptq
	pip install auto-gptq optimum

	# Quantize script example
	from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

	quantize_config = BaseQuantizeConfig(
	bits=4,
	group_size=128,
	damp_percent=0.01,
	desc_act=False,
	)

	model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantize_config)
	# ... calibrate on dataset ...
	model.quantize(calibration_data)
	model.save_quantized("llama-2-7b-gptq")
	```

	---

	## 3. KV Cache: Speeding Up Generation

	Transformers are autoregressive. To generate token $t$, we need all previous tokens $0 \dots t-1$.
	Naive approach: Re-compute the entire sequence history for every new token. $O(N^2)$.

	Optimization: Cache the Key (K) and Value (V) matrices from previous steps.
	- At step $t$, only compute Q, K, V for the new token.
	- Retrieve cached K, V for previous tokens.
	- Perform Attention.

	### Memory Cost of KV Cache
	For a model with $L$ layers, $H$ heads, hidden size $D$, sequence length $S$, batch size $B$:
	$$ \text{Memory} = 2 \times L \times H \times D_{head} \times S \times B \times \text{precision\_bytes} $$

	For Llama-2-7B (32 layers, 4096 hidden, fp16):
	- Approx 0.5 MB per token per batch item.
	- For Batch=32, SeqLen=2048: ~32GB just for KV cache!
	- Implication: KV cache often limits batch size more than model weights.

	### Paged Attention (vLLM Innovation)
	Traditional KV cache allocation is static (pre-allocate max seq len). Wasteful.
	vLLM uses OS-style virtual memory paging:
	- Split KV cache into fixed-size blocks.
	- Dynamically allocate blocks as tokens are generated.
	- Share blocks between sequences (useful for beam search or same prompt).
	- Result: 100% memory utilization, higher throughput.

	---

	## 4. Speculative Decoding

	Idea: Use a small "draft" model to guess the next $K$ tokens, then verify them with the large "target" model in parallel.

	1. Draft model (fast) generates $x_1, x_2, x_3$.
	2. Target model (slow) computes probabilities for all 3 in one forward pass.
	3. Accept/Reject logic:
	- If $P_{target}(x_1) \approx P_{draft}(x_1)$, accept.
	- If rejected, resample.
	4. Speedup: If draft is accurate, we generate $K$ tokens in the time of 1 target forward pass.

	```python
	# Native PyTorch Speculative Decoding (Accelerate)
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	draft_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0").to("cuda")
	target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").to("cuda")

	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
	inputs = tokenizer("Write a python function to sort a list.", return_tensors="pt").to("cuda")

	# Assisted generation
	outputs = target_model.generate(
	**inputs,
	assistant_model=draft_model,
	num_assistant_tokens=5, # How many tokens to draft
	max_new_tokens=50
	)
	```

	---

	## 5. Production Serving Engines

	Do not use `model.generate()` in production. It lacks concurrency control, batching, and streaming optimization. Use dedicated servers.

	### 5.1 vLLM (High Throughput)
	Best for high-throughput scenarios (batch processing, heavy traffic). Implements PagedAttention.

	Installation:
	```bash
	pip install vllm
	```

	Running a Server:
	```bash
	python -m vllm.entrypoints.api_server \
	--model meta-llama/Llama-2-7b-hf \
	--port 8000 \
	--tensor-parallel-size 1 \
	--max-num-seqs 256 \
	--gpu-memory-utilization 0.9
	```

	Client Usage:
	```python
	from openai import OpenAI
	client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

	response = client.chat.completions.create(
	model="meta-llama/Llama-2-7b-hf",
	messages=[{"role": "user", "content": "Hello!"}],
	stream=True
	)

	for chunk in response:
	print(chunk.choices[0].delta.content, end="")
	```

	Key Features:
	- Continuous batching: As soon as one request finishes, start a new one in the same batch slot.
	- PagedAttention: Efficient memory usage.
	- Supports quantization (AWQ, GPTQ, SqueezeLLM).

	### 5.2 Text Generation Inference (TGI)
	Developed by Hugging Face. Optimized for text generation, supports Tensor Parallelism out of the box. Written in Rust/C++.

	Run with Docker:
	```bash
	docker run --gpus all --shm-size 1g -p 8080:80 \
	-e MODEL_ID=meta-llama/Llama-2-7b-hf \
	-e QUANTIZE=bitsandbytes \
	ghcr.io/huggingface/text-generation-inference:latest
	```

	Key Features:
	- Flash Attention integration.
	- Token streaming.
	- Watermarking (for AI detection).
	- Logits warping (temperature, top_p) handled efficiently.

	### 5.3 Llama.cpp (CPU/Mac Inference)
	Uses GGUF format (highly quantized). Runs on CPU, Apple Silicon, or GPU.
	- Ideal for local deployment, edge devices, or MacBooks.
	- Supports Q4_K_M, Q5_K_M quantizations.

	```bash
	# Convert HF model to GGUF
	python convert.py meta-llama/Llama-2-7b-hf --outfile llama-2-7b.gguf --outtype q4_0

	# Run inference
	./main -m llama-2-7b.gguf -p "Hello world" -n 128
	```

	---

	## 6. Deployment Strategies

	### 6.1 Latency vs. Throughput
	- Latency: Time to first token (TTFT) + time per token. Critical for chatbots.
	- Optimization: Smaller models, fewer layers, speculative decoding.
	- Throughput: Total tokens generated per second across all users. Critical for batch analysis.
	- Optimization: Large batch sizes, vLLM, Tensor Parallelism.

	### 6.2 Scaling Topology
	1. Replica Scaling: Run multiple instances behind a load balancer (Kubernetes). Simplest.
	2. Tensor Parallelism: Split one model across 4-8 GPUs for a single request. Needed for >30B models.
	3. Pipeline Parallelism: Rare for inference due to bubble overhead, but useful for massive models on limited GPUs.

	### 6.3 Monitoring Metrics
	Track these in production:
	- TTFT (Time To First Token): User perceived latency.
	- TPOT (Time Per Output Token): Reading speed match (aim for <50ms/token).
	- Queue Depth: How many requests waiting?
	- GPU Utilization: Should be high for throughput, moderate for latency-sensitive apps.
	- KV Cache Hit Rate: (If using prefix caching for system prompts).

	---

	## 7. Practical Exercise: Deploying a Quantized Model

	Goal: Serve a 4-bit Llama-2-7B model using vLLM with high concurrency.

	Step 1: Prepare the Quantized Model
	Use `TheBloke` models from HuggingFace (pre-quantized) or quantize your own.
	```bash
	# Example: Using a pre-quantized AWQ model
	MODEL="TheBloke/Llama-2-7B-Chat-AWQ"
	```

	Step 2: Launch vLLM
	```bash
	python -m vllm.entrypoints.api_server \
	--model $MODEL \
	--quantization awq \
	--port 8000 \
	--max-num-seqs 128 \
	--max-model-len 4096
	```

	Step 3: Load Test
	Use `locust` or a simple script to simulate 50 concurrent users.
	```python
	# load_test.py
	import concurrent.futures
	import requests
	import time

	def send_request(i):
	start = time.time()
	resp = requests.post("http://localhost:8000/generate", json={
	"prompt": f"Question {i}: What is AI?",
	"max_tokens": 50
	})
	duration = time.time() - start
	return duration

	with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
	futures = [executor.submit(send_request, i) for i in range(50)]
	results = [f.result() for f in futures]

	print(f"Average Latency: {sum(results)/len(results):.2f}s")
	print(f"Total Requests: {len(results)}")
	```

	---

	## 8. Security & Safety in Deployment

	1. Input Validation: Prevent prompt injection attacks. Filter malicious inputs before sending to model.
	2. Output Filtering: Use classifiers to detect hate speech, PII, or toxicity before returning to user.
	3. Rate Limiting: Protect against DoS and cost overruns.
	4. Model Watermarking: Embed invisible signals to identify AI-generated text (supported in TGI).

	---

	## 9. Summary Checklist

	\| Technique \| Benefit \| Trade-off \| Best For \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| INT8/FP8 \| 2x memory savings, 2x speed \| Slight accuracy drop \| General inference \|
	\| INT4 (AWQ/GPTQ) \| 4x memory savings \| Calibration needed, HW specific \| Consumer GPUs, Edge \|
	\| KV Cache \| Linear vs Quadratic time \| High VRAM usage \| Long context \|
	\| PagedAttention \| Max VRAM utilization \| Implementation complexity \| High concurrency \|
	\| Speculative Decoding \| 2-3x speedup \| Needs draft model \| Latency sensitive \|
	\| vLLM/TGI \| Production features \| Extra infrastructure \| Production APIs \|

	## Next Steps
	In Tutorial 12, we will cover MLOps, Automation & Governance, focusing on CI/CD pipelines for models, model registries, compliance tracking, and automated evaluation gates.