File size: 3,819 Bytes
954e44f f359aa5 954e44f fb4181d 954e44f fb4181d 954e44f fb4181d 954e44f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | # Serving CloverLM with vLLM (Quartet II NVFP4)
## Prerequisites
Before following this guide, first set up the environment as described in `lm_eval/README.md`.
- NVIDIA Blackwell GPU (B300 / B200 / RTX 5090) for real Quartet II NVFP4 kernels
- CUDA 13.0+
- Python 3.11+
- The Quartet II kernels (`quartet2` package) installed
## 1. Environment Setup
```bash
# Activate the existing environment
source .venv/bin/activate
# Set CUDA paths
export CUDA_HOME=/usr/local/cuda-13.0/
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
```
## 2. Install vLLM
```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest \
| jq -r .tag_name | sed 's/^v//')
export CUDA_VERSION=130
export CPU_ARCH=$(uname -m)
uv pip install \
"https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl" \
--extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
```
## 3. Serve the Model
### Offline inference (quick test)
```bash
cd CloverLM/vllm_plugin
python serve.py
```
### OpenAI-compatible API server
```bash
cd CloverLM/vllm_plugin
python serve.py --api --port 8000
```
Then query:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "path/to/CloverLM",
"prompt": "The capital of France is",
"max_tokens": 64,
"temperature": 0.8
}'
```
### Options
| Flag | Default | Description |
|------|---------|-------------|
| `--model` | `../` (CloverLM dir) | Path to CloverLM model directory |
| `--api` | off | Start OpenAI-compatible API server |
| `--port` | 8000 | API server port |
| `--host` | 0.0.0.0 | API server host |
| `--tp` | 1 | Tensor parallel size |
| `--max-model-len` | 1024 | Maximum context length |
| `--gpu-memory-utilization` | 0.9 | GPU memory fraction to use |
## Architecture
The vLLM integration consists of three components:
1. **`quartet2_quant.py`** -- Quartet II quantization plugin registered as `"quartet2"`.
Wraps the Quartet II on-the-fly FP4 quantization (`quant_fp4` + `flashinfer.mm_fp4`)
into vLLM's `LinearMethodBase` interface. Weights stay in bf16; quantization happens
at each forward pass.
2. **`cloverlm_vllm.py`** -- Full vLLM model implementation with paged KV cache.
Reimplements CloverLM's architecture using vLLM primitives:
- `ColumnParallelLinear` / `RowParallelLinear` for Q/K/V/O and MLP projections
- vLLM `Attention` for paged KV caching and efficient attention
- Custom RoPE (base 1024, repeat_interleave pattern)
- Sphere normalization on Q/K before attention
- Per-head learnable scale parameter
- Squared ReLU activation in MLP
- Post-sublayer RMSNorm (not pre-norm)
3. **`serve.py`** -- Entry point that registers both the quantization plugin and model,
then launches vLLM in offline or API mode.
## Known Limitations
- **CUDA graphs**: Currently `enforce_eager=True` is required because the Quartet II
on-the-fly quantization kernels (`quant_fp4` + `mm_fp4`) are not compatible with
CUDA graph capture. This means slightly higher per-token latency compared to
CUDA-graph-enabled models. A future update to the Quartet II kernels could remove
this limitation.
## Troubleshooting
**"No module named 'quartet2'"**: Ensure the Quartet II kernels are installed:
```bash
uv pip install "quartet2 @ git+https://github.com/IST-DASLab/Quartet-II.git#subdirectory=kernels"
```
**CUDA errors**: Make sure `CUDA_HOME` points to CUDA 13.0+ and `TRITON_PTXAS_PATH` is set.
**Out of memory**: Reduce `--gpu-memory-utilization` or use `--tp 2` for tensor parallelism.
|