| # Serving CloverLM with vLLM (Quartet II NVFP4) |
|
|
| ## Prerequisites |
|
|
| Before following this guide, first set up the environment as described in `lm_eval/README.md`. |
|
|
| - NVIDIA Blackwell GPU (B300 / B200 / RTX 5090) for real Quartet II NVFP4 kernels |
| - CUDA 13.0+ |
| - Python 3.11+ |
| - The Quartet II kernels (`quartet2` package) installed |
|
|
| ## 1. Environment Setup |
|
|
| ```bash |
| # Activate the existing environment |
| source .venv/bin/activate |
| |
| # Set CUDA paths |
| export CUDA_HOME=/usr/local/cuda-13.0/ |
| export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas |
| export PATH=/usr/local/cuda/bin:$PATH |
| export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-} |
| ``` |
|
|
| ## 2. Install vLLM |
|
|
| ```bash |
| export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest \ |
| | jq -r .tag_name | sed 's/^v//') |
| export CUDA_VERSION=130 |
| export CPU_ARCH=$(uname -m) |
| |
| uv pip install \ |
| "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_35_${CPU_ARCH}.whl" \ |
| --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION} |
| ``` |
|
|
| ## 3. Serve the Model |
|
|
| ### Offline inference (quick test) |
|
|
| ```bash |
| cd CloverLM/vllm_plugin |
| python serve.py |
| ``` |
|
|
| ### OpenAI-compatible API server |
|
|
| ```bash |
| cd CloverLM/vllm_plugin |
| python serve.py --api --port 8000 |
| ``` |
|
|
| Then query: |
|
|
| ```bash |
| curl http://localhost:8000/v1/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "path/to/CloverLM", |
| "prompt": "The capital of France is", |
| "max_tokens": 64, |
| "temperature": 0.8 |
| }' |
| ``` |
|
|
| ### Options |
|
|
| | Flag | Default | Description | |
| |------|---------|-------------| |
| | `--model` | `../` (CloverLM dir) | Path to CloverLM model directory | |
| | `--api` | off | Start OpenAI-compatible API server | |
| | `--port` | 8000 | API server port | |
| | `--host` | 0.0.0.0 | API server host | |
| | `--tp` | 1 | Tensor parallel size | |
| | `--max-model-len` | 1024 | Maximum context length | |
| | `--gpu-memory-utilization` | 0.9 | GPU memory fraction to use | |
|
|
| ## Architecture |
|
|
| The vLLM integration consists of three components: |
|
|
| 1. **`quartet2_quant.py`** -- Quartet II quantization plugin registered as `"quartet2"`. |
| Wraps the Quartet II on-the-fly FP4 quantization (`quant_fp4` + `flashinfer.mm_fp4`) |
| into vLLM's `LinearMethodBase` interface. Weights stay in bf16; quantization happens |
| at each forward pass. |
| |
| 2. **`cloverlm_vllm.py`** -- Full vLLM model implementation with paged KV cache. |
| Reimplements CloverLM's architecture using vLLM primitives: |
| - `ColumnParallelLinear` / `RowParallelLinear` for Q/K/V/O and MLP projections |
| - vLLM `Attention` for paged KV caching and efficient attention |
| - Custom RoPE (base 1024, repeat_interleave pattern) |
| - Sphere normalization on Q/K before attention |
| - Per-head learnable scale parameter |
| - Squared ReLU activation in MLP |
| - Post-sublayer RMSNorm (not pre-norm) |
| |
| 3. **`serve.py`** -- Entry point that registers both the quantization plugin and model, |
| then launches vLLM in offline or API mode. |
| |
| ## Known Limitations |
| |
| - **CUDA graphs**: Currently `enforce_eager=True` is required because the Quartet II |
| on-the-fly quantization kernels (`quant_fp4` + `mm_fp4`) are not compatible with |
| CUDA graph capture. This means slightly higher per-token latency compared to |
| CUDA-graph-enabled models. A future update to the Quartet II kernels could remove |
| this limitation. |
|
|
| ## Troubleshooting |
|
|
| **"No module named 'quartet2'"**: Ensure the Quartet II kernels are installed: |
| ```bash |
| uv pip install "quartet2 @ git+https://github.com/IST-DASLab/Quartet-II.git#subdirectory=kernels" |
| ``` |
|
|
| **CUDA errors**: Make sure `CUDA_HOME` points to CUDA 13.0+ and `TRITON_PTXAS_PATH` is set. |
|
|
| **Out of memory**: Reduce `--gpu-memory-utilization` or use `--tp 2` for tensor parallelism. |
|
|