Nexuss-Transformer / Tutorials /11-inference-optimization-deployment.md

Upload data/train-00000-of-00001.parquet with huggingface_hub

7cb972e 13 days ago

preview code

raw

history blame contribute delete

12.1 kB

Tutorial 11: Inference Optimization & Deployment

Overview

Training a model is only half the battle. Deploying Large Language Models (LLMs) for inference presents unique challenges: high memory bandwidth requirements, latency constraints, and the need to serve multiple users concurrently. This tutorial covers Quantization, KV Caching, Speculative Decoding, and production serving engines like vLLM and TGI.

Prerequisites

Understanding of Transformer architecture (Tutorial 01)
Basic PyTorch knowledge
Familiarity with Hugging Face transformers

1. The Inference Bottleneck

Unlike training, which is compute-bound (matrix multiplications), inference is often memory-bandwidth bound.

The Memory Wall

To generate one token:

Load all model weights from VRAM to Compute Units.
Perform calculation.
Store result.

For a 7B model (14GB in fp16):

To generate 1 token at batch size 1, you must read 14GB of data.
On an A100 (1.5 TB/s bandwidth), this takes ~9ms just for memory transfer.
Compute time is negligible (~0.1ms).
Result: You can only generate ~100 tokens/sec regardless of compute power.

Solution: Reduce memory footprint (Quantization) and reuse memory (KV Cache).

2. Quantization: Reducing Precision

Quantization reduces the number of bits used to represent weights and activations.

Types of Quantization

2.1 Post-Training Quantization (PTQ)

Quantize a pre-trained model without retraining.

INT8: Weights scaled to 8-bit integers. Minimal accuracy loss.
FP8: New format supported by H100s. Good balance of range and precision.
INT4: Aggressive compression (e.g., GPTQ, AWQ). Requires careful calibration.

2.2 Quantization-Aware Training (QAT)

Simulate quantization noise during training/fine-tuning.

Model learns to compensate for precision loss.
Best for INT4 or lower, but computationally expensive.

Implementing INT8 with BitsAndBytes

The easiest way to quantize for inference using Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate import infer_auto_device_map

model_id = "meta-llama/Llama-2-7b-hf"

# 4-bit configuration (QLoRA style inference)
bnb_config = {
    "load_in_4bit": True,
    "bnb_4bit_quant_type": "nf4",  # Normalized Float 4
    "bnb_4bit_compute_dtype": torch.float16,
    "bnb_4bit_use_double_quant": True, # Nested quantization
}

# 8-bit configuration
# bnb_config = {"load_in_8bit": True}

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",  # Automatically spread across GPUs
    quantization_config=bnb_config if "load_in_4bit" in bnb_config else None,
    torch_dtype=torch.float16 if not bnb_config.get("load_in_8bit") else None,
    low_cpu_mem_usage=True
)

# Inference
input_text = "Explain quantum entanglement."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GPTQ & AWQ (GPU-Aware Quantization)

For INT4, generic quantization fails. GPTQ and AWQ use activation-aware weight selection.

AWQ (Activation-Aware Weight Quantization): Preserves salient weights (those causing large activations) in FP16, quantizes the rest.
Tool: Use auto-gptq or llama-cpp-python for GGUF formats.

# Install auto-gptq
pip install auto-gptq optimum

# Quantize script example
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", quantize_config)
# ... calibrate on dataset ...
model.quantize(calibration_data)
model.save_quantized("llama-2-7b-gptq")

3. KV Cache: Speeding Up Generation

Transformers are autoregressive. To generate token $t$, we need all previous tokens $0 \dots t-1$. Naive approach: Re-compute the entire sequence history for every new token. $O(N^2)$.

Optimization: Cache the Key (K) and Value (V) matrices from previous steps.

At step $t$, only compute Q, K, V for the new token.
Retrieve cached K, V for previous tokens.
Perform Attention.

Memory Cost of KV Cache

For a model with $L$ layers, $H$ heads, hidden size $D$, sequence length $S$, batch size $B$: $\text{Memory} = 2 \times L \times H \times D_{head} \times S \times B \times \text{precision\_bytes}$

For Llama-2-7B (32 layers, 4096 hidden, fp16):

Approx 0.5 MB per token per batch item.
For Batch=32, SeqLen=2048: ~32GB just for KV cache!
Implication: KV cache often limits batch size more than model weights.

Paged Attention (vLLM Innovation)

Traditional KV cache allocation is static (pre-allocate max seq len). Wasteful. vLLM uses OS-style virtual memory paging:

Split KV cache into fixed-size blocks.
Dynamically allocate blocks as tokens are generated.
Share blocks between sequences (useful for beam search or same prompt).
Result: 100% memory utilization, higher throughput.

4. Speculative Decoding

Idea: Use a small "draft" model to guess the next $K$ tokens, then verify them with the large "target" model in parallel.

Draft model (fast) generates $x_1, x_2, x_3$.
Target model (slow) computes probabilities for all 3 in one forward pass.
Accept/Reject logic:
- If $P_{target}(x_1) \approx P_{draft}(x_1)$, accept.
- If rejected, resample.
Speedup: If draft is accurate, we generate $K$ tokens in the time of 1 target forward pass.

# Native PyTorch Speculative Decoding (Accelerate)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

draft_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0").to("cuda")
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf").to("cuda")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Write a python function to sort a list.", return_tensors="pt").to("cuda")

# Assisted generation
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    num_assistant_tokens=5,  # How many tokens to draft
    max_new_tokens=50
)

5. Production Serving Engines

Do not use model.generate() in production. It lacks concurrency control, batching, and streaming optimization. Use dedicated servers.

5.1 vLLM (High Throughput)

Best for high-throughput scenarios (batch processing, heavy traffic). Implements PagedAttention.

Installation:

pip install vllm

Running a Server:

python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9

Client Usage:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

Key Features:

Continuous batching: As soon as one request finishes, start a new one in the same batch slot.
PagedAttention: Efficient memory usage.
Supports quantization (AWQ, GPTQ, SqueezeLLM).

5.2 Text Generation Inference (TGI)

Developed by Hugging Face. Optimized for text generation, supports Tensor Parallelism out of the box. Written in Rust/C++.

Run with Docker:

docker run --gpus all --shm-size 1g -p 8080:80 \
  -e MODEL_ID=meta-llama/Llama-2-7b-hf \
  -e QUANTIZE=bitsandbytes \
  ghcr.io/huggingface/text-generation-inference:latest

Key Features:

Flash Attention integration.
Token streaming.
Watermarking (for AI detection).
Logits warping (temperature, top_p) handled efficiently.

5.3 Llama.cpp (CPU/Mac Inference)

Uses GGUF format (highly quantized). Runs on CPU, Apple Silicon, or GPU.

Ideal for local deployment, edge devices, or MacBooks.
Supports Q4_K_M, Q5_K_M quantizations.

# Convert HF model to GGUF
python convert.py meta-llama/Llama-2-7b-hf --outfile llama-2-7b.gguf --outtype q4_0

# Run inference
./main -m llama-2-7b.gguf -p "Hello world" -n 128

6. Deployment Strategies

6.1 Latency vs. Throughput

Latency: Time to first token (TTFT) + time per token. Critical for chatbots.
- Optimization: Smaller models, fewer layers, speculative decoding.
Throughput: Total tokens generated per second across all users. Critical for batch analysis.
- Optimization: Large batch sizes, vLLM, Tensor Parallelism.

6.2 Scaling Topology

Replica Scaling: Run multiple instances behind a load balancer (Kubernetes). Simplest.
Tensor Parallelism: Split one model across 4-8 GPUs for a single request. Needed for >30B models.
Pipeline Parallelism: Rare for inference due to bubble overhead, but useful for massive models on limited GPUs.

6.3 Monitoring Metrics

Track these in production:

TTFT (Time To First Token): User perceived latency.
TPOT (Time Per Output Token): Reading speed match (aim for <50ms/token).
Queue Depth: How many requests waiting?
GPU Utilization: Should be high for throughput, moderate for latency-sensitive apps.
KV Cache Hit Rate: (If using prefix caching for system prompts).

7. Practical Exercise: Deploying a Quantized Model

Goal: Serve a 4-bit Llama-2-7B model using vLLM with high concurrency.

Step 1: Prepare the Quantized Model Use TheBloke models from HuggingFace (pre-quantized) or quantize your own.

# Example: Using a pre-quantized AWQ model
MODEL="TheBloke/Llama-2-7B-Chat-AWQ"

Step 2: Launch vLLM

python -m vllm.entrypoints.api_server \
    --model $MODEL \
    --quantization awq \
    --port 8000 \
    --max-num-seqs 128 \
    --max-model-len 4096

Step 3: Load Test Use locust or a simple script to simulate 50 concurrent users.

# load_test.py
import concurrent.futures
import requests
import time

def send_request(i):
    start = time.time()
    resp = requests.post("http://localhost:8000/generate", json={
        "prompt": f"Question {i}: What is AI?",
        "max_tokens": 50
    })
    duration = time.time() - start
    return duration

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    futures = [executor.submit(send_request, i) for i in range(50)]
    results = [f.result() for f in futures]

print(f"Average Latency: {sum(results)/len(results):.2f}s")
print(f"Total Requests: {len(results)}")

8. Security & Safety in Deployment

Input Validation: Prevent prompt injection attacks. Filter malicious inputs before sending to model.
Output Filtering: Use classifiers to detect hate speech, PII, or toxicity before returning to user.
Rate Limiting: Protect against DoS and cost overruns.
Model Watermarking: Embed invisible signals to identify AI-generated text (supported in TGI).

9. Summary Checklist

Technique	Benefit	Trade-off	Best For
INT8/FP8	2x memory savings, 2x speed	Slight accuracy drop	General inference
INT4 (AWQ/GPTQ)	4x memory savings	Calibration needed, HW specific	Consumer GPUs, Edge
KV Cache	Linear vs Quadratic time	High VRAM usage	Long context
PagedAttention	Max VRAM utilization	Implementation complexity	High concurrency
Speculative Decoding	2-3x speedup	Needs draft model	Latency sensitive
vLLM/TGI	Production features	Extra infrastructure	Production APIs

Next Steps

In Tutorial 12, we will cover MLOps, Automation & Governance, focusing on CI/CD pipelines for models, model registries, compliance tracking, and automated evaluation gates.