Spaces:
Sleeping
Sleeping
| # LLM Compressor & vLLM Advanced Features | |
| This document outlines advanced features from LLM Compressor and vLLM that can be leveraged for better performance and optimization. | |
| ## LLM Compressor Features | |
| ### 1. Quantization Modifiers | |
| LLM Compressor supports multiple quantization methods beyond AWQ: | |
| #### AWQModifier (Activation-aware Weight Quantization) | |
| ```python | |
| from llmcompressor.modifiers.awq import AWQModifier | |
| AWQModifier( | |
| w_bit=4, # Weight bits (4 or 8) | |
| q_group_size=128, # Quantization group size | |
| zero_point=True, # Use zero-point quantization | |
| version="GEMM" # Kernel version: "GEMM" or "GEMV" | |
| ) | |
| ``` | |
| #### GPTQModifier (GPTQ Quantization) | |
| ```python | |
| from llmcompressor.modifiers.quantization import GPTQModifier | |
| GPTQModifier( | |
| w_bit=4, # Weight bits | |
| q_group_size=128, # Group size | |
| desc_act=False, # Whether to use activation order | |
| sym=True # Symmetric quantization | |
| ) | |
| ``` | |
| #### INT8Modifier (8-bit Quantization) | |
| ```python | |
| from llmcompressor.modifiers.quantization import INT8Modifier | |
| INT8Modifier( | |
| w_bit=8, | |
| q_group_size=128 | |
| ) | |
| ``` | |
| ### 2. Pruning Modifiers | |
| #### MagnitudePruningModifier | |
| ```python | |
| from llmcompressor.modifiers.pruning import MagnitudePruningModifier | |
| MagnitudePruningModifier( | |
| sparsity=0.5, # 50% sparsity | |
| structured=False # Unstructured pruning | |
| ) | |
| ``` | |
| ### 3. Combined Modifiers | |
| You can combine multiple modifiers for maximum compression: | |
| ```python | |
| from llmcompressor import oneshot | |
| from llmcompressor.modifiers.awq import AWQModifier | |
| from llmcompressor.modifiers.pruning import MagnitudePruningModifier | |
| oneshot( | |
| model="Alovestocode/router-qwen3-32b-merged", | |
| output_dir="./router-qwen3-compressed", | |
| modifiers=[ | |
| AWQModifier(w_bit=4, q_group_size=128), | |
| MagnitudePruningModifier(sparsity=0.1) # 10% pruning + AWQ | |
| ] | |
| ) | |
| ``` | |
| ## vLLM Advanced Features | |
| ### 1. FP8 Quantization (Latest) | |
| vLLM supports FP8 quantization for even better performance: | |
| ```python | |
| from vllm import LLM | |
| llm = LLM( | |
| model="Alovestocode/router-qwen3-32b-merged", | |
| quantization="fp8", # FP8 quantization | |
| dtype="float8_e5m2", # FP8 format | |
| gpu_memory_utilization=0.95 | |
| ) | |
| ``` | |
| **Benefits:** | |
| - ~2x faster than AWQ | |
| - Lower memory usage | |
| - Better quality retention | |
| ### 2. FP8 KV Cache | |
| Reduce KV cache memory usage with FP8: | |
| ```python | |
| llm = LLM( | |
| model="Alovestocode/router-qwen3-32b-merged", | |
| quantization="awq", | |
| kv_cache_dtype="fp8", # FP8 KV cache | |
| gpu_memory_utilization=0.90 | |
| ) | |
| ``` | |
| **Benefits:** | |
| - 50% reduction in KV cache memory | |
| - Enables longer context windows | |
| - Minimal quality impact | |
| ### 3. Chunked Prefill (Already Implemented) | |
| ```python | |
| enable_chunked_prefill=True # ✅ Already in our config | |
| ``` | |
| **Benefits:** | |
| - Better handling of long prompts | |
| - Reduced memory spikes | |
| - Improved throughput | |
| ### 4. Prefix Caching (Already Implemented) | |
| ```python | |
| enable_prefix_caching=True # ✅ Already in our config | |
| ``` | |
| **Benefits:** | |
| - Faster time-to-first-token (TTFT) | |
| - Reuses common prefixes | |
| - Better for repeated prompts | |
| ### 5. Continuous Batching (Already Implemented) | |
| ```python | |
| max_num_seqs=256 # ✅ Already in our config | |
| ``` | |
| **Benefits:** | |
| - Dynamic batching | |
| - Better GPU utilization | |
| - Lower latency | |
| ### 6. Tensor Parallelism | |
| For multi-GPU setups: | |
| ```python | |
| llm = LLM( | |
| model="Alovestocode/router-qwen3-32b-merged", | |
| tensor_parallel_size=2, # Use 2 GPUs | |
| pipeline_parallel_size=1 # Pipeline parallelism | |
| ) | |
| ``` | |
| ### 7. Speculative Decoding | |
| For faster inference with draft models: | |
| ```python | |
| llm = LLM( | |
| model="Alovestocode/router-qwen3-32b-merged", | |
| speculative_model="small-draft-model", # Draft model | |
| num_speculative_tokens=5 # Tokens to speculate | |
| ) | |
| ``` | |
| ### 8. SGLang Backend | |
| For even better performance with structured outputs: | |
| ```python | |
| llm = LLM( | |
| model="Alovestocode/router-qwen3-32b-merged", | |
| enable_lora=True, # LoRA support | |
| max_lora_rank=16 | |
| ) | |
| ``` | |
| ## Recommended Optimizations for Our Use Case | |
| ### Current Setup (Good) | |
| - ✅ AWQ 4-bit quantization | |
| - ✅ Continuous batching (max_num_seqs=256) | |
| - ✅ Prefix caching | |
| - ✅ Chunked prefill | |
| - ✅ FlashAttention-2 | |
| ### Additional Optimizations to Consider | |
| #### 1. FP8 KV Cache (High Impact) | |
| ```python | |
| llm_kwargs = { | |
| "model": repo, | |
| "quantization": "awq", | |
| "kv_cache_dtype": "fp8", # Add this | |
| "gpu_memory_utilization": 0.95, # Can increase with FP8 KV | |
| # ... rest of config | |
| } | |
| ``` | |
| **Impact:** 50% KV cache memory reduction, longer contexts | |
| #### 2. FP8 Quantization (If Available) | |
| ```python | |
| llm_kwargs = { | |
| "model": repo, | |
| "quantization": "fp8", # Instead of AWQ | |
| "dtype": "float8_e5m2", | |
| # ... rest of config | |
| } | |
| ``` | |
| **Impact:** ~2x faster inference, better quality | |
| #### 3. Optimized Sampling Parameters | |
| ```python | |
| sampling_params = SamplingParams( | |
| temperature=0.2, | |
| top_p=0.9, | |
| max_tokens=20000, | |
| stop=["<|end_of_plan|>"], | |
| skip_special_tokens=False, # Keep special tokens for parsing | |
| spaces_between_special_tokens=False | |
| ) | |
| ``` | |
| #### 4. Model Warmup with Real Prompts | |
| ```python | |
| def warm_vllm_model(llm, tokenizer): | |
| """Warm up with actual router prompts.""" | |
| warmup_prompts = [ | |
| "You are the Router Agent. Test task: solve 2x+3=7", | |
| "You are the Router Agent. Test task: implement binary search", | |
| ] | |
| for prompt in warmup_prompts: | |
| outputs = llm.generate( | |
| [prompt], | |
| SamplingParams(max_tokens=10, temperature=0) | |
| ) | |
| ``` | |
| ## Implementation Priority | |
| 1. **High Priority:** | |
| - FP8 KV cache (easy, high impact) | |
| - Optimized sampling parameters (easy) | |
| 2. **Medium Priority:** | |
| - FP8 quantization (if models support it) | |
| - Better warmup strategy | |
| 3. **Low Priority:** | |
| - Tensor parallelism (requires multi-GPU) | |
| - Speculative decoding (requires draft model) | |
| ## References | |
| - [vLLM Quantization Docs](https://docs.vllm.ai/en/latest/features/quantization/) | |
| - [LLM Compressor Docs](https://docs.vllm.ai/projects/llm-compressor/) | |
| - [vLLM Performance Guide](https://docs.vllm.ai/en/latest/performance/) | |
| - [FP8 Quantization Paper](https://arxiv.org/abs/2309.06180) | |