File size: 4,416 Bytes
b6ae7b8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | # Stack 2.9 Optimization Guide
This guide covers optimizing Stack 2.9 for fast, efficient inference while maintaining quality.
## Overview
Stack 2.9 can be quantized from 64GB (bfloat16) down to ~18GB (4-bit) with minimal quality loss, enabling deployment on consumer GPUs.
## Quick Start
```bash
# 1. Quantize the model
python quantize.py \
--model-path ./output/stack-2.9-merged \
--output-path ./output/stack-2.9-quantized \
--method bnb \
--bits 4
# 2. Benchmark the optimized model
python benchmark_optimized.py \
--optimized-model ./output/stack-2.9-quantized
# 3. Upload to HuggingFace
python upload_hf.py \
--model-path ./output/stack-2.9-quantized \
--repo-id your-username/stack-2.9
```
## Quantization Methods
### 1. BitsAndBytes (Recommended)
Most compatible, good quality, fast inference.
```bash
python quantize.py --method bnb --bits 4
```
**Pros:**
- Works on any GPU
- Fast inference
- No calibration data needed
- Good quality preservation
**Cons:**
- ~4x compression (not the best)
### 2. AWQ (Activation-Aware Weight Quantization)
Best quality/performance ratio, but requires specific hardware.
```bash
python quantize.py --method awq
```
**Pros:**
- Best quality preservation
- Hardware-aware
- Good for specific tasks
**Cons:**
- Requires recent GPU
- May need calibration data
### 3. GPTQ
Good compression, slower inference.
```bash
python quantize.py --method gptq --bits 4
```
**Pros:**
- Excellent compression
- Well-studied method
**Cons:**
- Requires calibration
- Slower inference than AWQ/BNB
## Model Sizes
| Precision | Size | Min GPU VRAM | Quality |
|------------|------|--------------|---------|
| bfloat16 | 64 GB | 80 GB | 100% |
| float16 | 64 GB | 64 GB | 99% |
| int8 | 32 GB | 40 GB | 95% |
| int4 | 18 GB | 24 GB | 90-95% |
## Benchmarking
Compare optimized vs base model:
```bash
python benchmark_optimized.py \
--base-model Qwen/Qwen2.5-Coder-32B \
--optimized-model ./output/stack-2.9-quantized \
--num-runs 5 \
--test-mmlu
```
Expected results (int4 vs bf16):
- **Speed**: 2-3x faster
- **Memory**: 60-70% reduction
- **Quality**: ~92-95% preserved
## API Server
Deploy an OpenAI-compatible API:
```bash
# Install dependencies
pip install fastapi uvicorn transformers torch
# Start server
python convert_openai.py \
--model-path ./output/stack-2.9-quantized \
--port 8000
# Test
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "stack-2.9",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
## vLLM Deployment
For production, use vLLM:
```bash
pip install vllm
vllm serve ./output/stack-2.9-quantized \
--dtype half \
--tensor-parallel-size 2 \
--max-model-len 32768
```
## HuggingFace Upload
```bash
# Upload model
python upload_hf.py \
--model-path ./output/stack-2.9-quantized \
--repo-id your-username/stack-2.9 \
--token hf_your_token
# Upload with Gradio Spaces demo
python upload_hf.py \
--model-path ./output/stack-2.9-quantized \
--repo-id your-username/stack-2.9 \
--add-spaces
```
## Expected Performance
With int4 quantization:
| Metric | Value |
|--------|-------|
| Tokens/sec | 30-50 |
| Memory (GPU) | 18-22 GB |
| Model size | ~18 GB |
| Cold start | 10-20s |
## Quality Preservation
Stack 2.9 maintains ~92-95% quality after int4 quantization:
- Code generation: ~95% (excellent for most tasks)
- Reasoning: ~90% (may struggle with complex logic)
- General knowledge: ~92%
## Troubleshooting
### Out of Memory
```bash
# Try int8 instead of int4
python quantize.py --method bnb --bits 8
# Or use CPU offloading
python convert_openai.py --device-map cpu
```
### Slow Inference
- Use vLLM for 2-3x speedup
- Enable flash attention (if supported)
- Use shorter context
### Quality Issues
- Try GPTQ instead of BNB
- Use int8 instead of int4
- Increase tokens per generation
## Production Checklist
- [ ] Quantize model
- [ ] Benchmark against base
- [ ] Run quality tests
- [ ] Test API endpoints
- [ ] Set up monitoring
- [ ] Configure rate limiting
- [ ] Set up autoscaling
- [ ] Document deployment
## Resources
- [AWQ Paper](https://arxiv.org/abs/2306.06965)
- [GPTQ Paper](https://arxiv.org/abs/2210.17323)
- [vLLM Documentation](https://docs.vllm.ai/)
- [HuggingFace Hub](https://huggingface.co/docs/hub/) |