File size: 6,243 Bytes
808203f
 
 
 
 
 
 
 
 
 
 
 
f0033ab
808203f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0033ab
808203f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# LLM Compressor & vLLM Advanced Features

This document outlines advanced features from LLM Compressor and vLLM that can be leveraged for better performance and optimization.

## LLM Compressor Features

### 1. Quantization Modifiers

LLM Compressor supports multiple quantization methods beyond AWQ:

#### AWQModifier (Activation-aware Weight Quantization)
```python
from llmcompressor.modifiers.awq import AWQModifier

AWQModifier(
    w_bit=4,              # Weight bits (4 or 8)
    q_group_size=128,     # Quantization group size
    zero_point=True,      # Use zero-point quantization
    version="GEMM"        # Kernel version: "GEMM" or "GEMV"
)
```

#### GPTQModifier (GPTQ Quantization)
```python
from llmcompressor.modifiers.quantization import GPTQModifier

GPTQModifier(
    w_bit=4,              # Weight bits
    q_group_size=128,     # Group size
    desc_act=False,       # Whether to use activation order
    sym=True              # Symmetric quantization
)
```

#### INT8Modifier (8-bit Quantization)
```python
from llmcompressor.modifiers.quantization import INT8Modifier

INT8Modifier(
    w_bit=8,
    q_group_size=128
)
```

### 2. Pruning Modifiers

#### MagnitudePruningModifier
```python
from llmcompressor.modifiers.pruning import MagnitudePruningModifier

MagnitudePruningModifier(
    sparsity=0.5,         # 50% sparsity
    structured=False      # Unstructured pruning
)
```

### 3. Combined Modifiers

You can combine multiple modifiers for maximum compression:

```python
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.pruning import MagnitudePruningModifier

oneshot(
    model="Alovestocode/router-qwen3-32b-merged",
    output_dir="./router-qwen3-compressed",
    modifiers=[
        AWQModifier(w_bit=4, q_group_size=128),
        MagnitudePruningModifier(sparsity=0.1)  # 10% pruning + AWQ
    ]
)
```

## vLLM Advanced Features

### 1. FP8 Quantization (Latest)

vLLM supports FP8 quantization for even better performance:

```python
from vllm import LLM

llm = LLM(
    model="Alovestocode/router-qwen3-32b-merged",
    quantization="fp8",           # FP8 quantization
    dtype="float8_e5m2",          # FP8 format
    gpu_memory_utilization=0.95
)
```

**Benefits:**
- ~2x faster than AWQ
- Lower memory usage
- Better quality retention

### 2. FP8 KV Cache

Reduce KV cache memory usage with FP8:

```python
llm = LLM(
    model="Alovestocode/router-qwen3-32b-merged",
    quantization="awq",
    kv_cache_dtype="fp8",         # FP8 KV cache
    gpu_memory_utilization=0.90
)
```

**Benefits:**
- 50% reduction in KV cache memory
- Enables longer context windows
- Minimal quality impact

### 3. Chunked Prefill (Already Implemented)

```python
enable_chunked_prefill=True  # ✅ Already in our config
```

**Benefits:**
- Better handling of long prompts
- Reduced memory spikes
- Improved throughput

### 4. Prefix Caching (Already Implemented)

```python
enable_prefix_caching=True  # ✅ Already in our config
```

**Benefits:**
- Faster time-to-first-token (TTFT)
- Reuses common prefixes
- Better for repeated prompts

### 5. Continuous Batching (Already Implemented)

```python
max_num_seqs=256  # ✅ Already in our config
```

**Benefits:**
- Dynamic batching
- Better GPU utilization
- Lower latency

### 6. Tensor Parallelism

For multi-GPU setups:

```python
llm = LLM(
    model="Alovestocode/router-qwen3-32b-merged",
    tensor_parallel_size=2,      # Use 2 GPUs
    pipeline_parallel_size=1      # Pipeline parallelism
)
```

### 7. Speculative Decoding

For faster inference with draft models:

```python
llm = LLM(
    model="Alovestocode/router-qwen3-32b-merged",
    speculative_model="small-draft-model",  # Draft model
    num_speculative_tokens=5                # Tokens to speculate
)
```

### 8. SGLang Backend

For even better performance with structured outputs:

```python
llm = LLM(
    model="Alovestocode/router-qwen3-32b-merged",
    enable_lora=True,              # LoRA support
    max_lora_rank=16
)
```

## Recommended Optimizations for Our Use Case

### Current Setup (Good)
- ✅ AWQ 4-bit quantization
- ✅ Continuous batching (max_num_seqs=256)
- ✅ Prefix caching
- ✅ Chunked prefill
- ✅ FlashAttention-2

### Additional Optimizations to Consider

#### 1. FP8 KV Cache (High Impact)
```python
llm_kwargs = {
    "model": repo,
    "quantization": "awq",
    "kv_cache_dtype": "fp8",      # Add this
    "gpu_memory_utilization": 0.95,  # Can increase with FP8 KV
    # ... rest of config
}
```

**Impact:** 50% KV cache memory reduction, longer contexts

#### 2. FP8 Quantization (If Available)
```python
llm_kwargs = {
    "model": repo,
    "quantization": "fp8",        # Instead of AWQ
    "dtype": "float8_e5m2",
    # ... rest of config
}
```

**Impact:** ~2x faster inference, better quality

#### 3. Optimized Sampling Parameters
```python
sampling_params = SamplingParams(
    temperature=0.2,
    top_p=0.9,
    max_tokens=20000,
    stop=["<|end_of_plan|>"],
    skip_special_tokens=False,   # Keep special tokens for parsing
    spaces_between_special_tokens=False
)
```

#### 4. Model Warmup with Real Prompts
```python
def warm_vllm_model(llm, tokenizer):
    """Warm up with actual router prompts."""
    warmup_prompts = [
        "You are the Router Agent. Test task: solve 2x+3=7",
        "You are the Router Agent. Test task: implement binary search",
    ]
    for prompt in warmup_prompts:
        outputs = llm.generate(
            [prompt],
            SamplingParams(max_tokens=10, temperature=0)
        )
```

## Implementation Priority

1. **High Priority:**
   - FP8 KV cache (easy, high impact)
   - Optimized sampling parameters (easy)

2. **Medium Priority:**
   - FP8 quantization (if models support it)
   - Better warmup strategy

3. **Low Priority:**
   - Tensor parallelism (requires multi-GPU)
   - Speculative decoding (requires draft model)

## References

- [vLLM Quantization Docs](https://docs.vllm.ai/en/latest/features/quantization/)
- [LLM Compressor Docs](https://docs.vllm.ai/projects/llm-compressor/)
- [vLLM Performance Guide](https://docs.vllm.ai/en/latest/performance/)
- [FP8 Quantization Paper](https://arxiv.org/abs/2309.06180)