sakamakismile's picture
BLS-Mini-Code-1.0 NVFP4 (W4A4) for vLLM: 17.97GB, TP=2 ~104 t/s, model card w/ flags+TPS+benchmarks
f6c7144 verified
---
base_model: CohereLabs/BLS-Mini-Code-1.0
base_model_relation: quantized
pipeline_tag: text-generation
library_name: vllm
tags:
- cohere2_moe
- nvfp4
- compressed-tensors
- llm-compressor
- code
- vllm
- blackwell
- w4a4
language:
- en
license: cc-by-nc-4.0
---
# BLS-Mini-Code-1.0-NVFP4
**NVFP4 (W4A4) quantization** of [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) — a `cohere2_moe` code-specialist MoE — packed to **17.97 GB** (from ~61 GB BF16) so it runs a 30B-class code model on **two 16 GB GPUs** at **~104 tok/s** on vLLM.
> Made by quantizing the original CohereLabs release with [llm-compressor](https://github.com/vllm-project/llm-compressor) + [compressed-tensors](https://github.com/neuralmagic/compressed-tensors). All credit for the model itself goes to **Cohere Labs**; this repo only adds the FP4 quantization. Please review the original model's terms before use.
## What it is
| | |
|---|---|
| Architecture | `Cohere2MoeForCausalLM` (`cohere2_moe`) |
| Total / active params | 30.5B total · **~3–4B active** (128 experts, top-8, sigmoid router; layer-0 dense) |
| Context | 500k positions (rope θ=50000), sliding-window 4096 + interleaved full attention |
| Quantization | **NVFP4** `nvfp4-pack-quantized`, W4A4, group size 16, FP8-E4M3 scales |
| What's quantized | all **18,631** Linear layers (attention + dense layer-0 + **18,432 experts**); router/embeddings/norms kept BF16 |
| Size | **17.97 GB** (single `model.safetensors`) |
| EOS / chat | `<|END_OF_TURN_TOKEN|>` (255001); `chat_template.jinja` included; optional thinking mode |
## Serving with vLLM
Requires **vLLM ≥ 0.21** (native `Cohere2MoeForCausalLM` + compressed-tensors NVFP4 auto-detect — **no `--quantization` flag needed**) and `transformers ≥ 5.8` (where `cohere2_moe` lives). Tested on RTX PRO 2000 Blackwell (sm_120), torch cu128.
```bash
# TP=2 across two GPUs. On a box WITHOUT NVLink/P2P, the two NCCL flags +
# --disable-custom-all-reduce are MANDATORY (else it hangs at the first all-reduce).
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
vllm serve sakamakismile/BLS-Mini-Code-1.0-NVFP4 \
--tensor-parallel-size 2 \
--disable-custom-all-reduce \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--port 8000
```
**Clean output / reasoning split (optional).** The model wraps answers in `<|START_TEXT|>…<|END_TEXT|>` (and `<|START_THINKING|>…<|END_THINKING|>` when thinking). To get clean `content` + separate `reasoning_content` server-side, add `--reasoning-parser cohere_command4` (requires `pip install cohere_melody`). Otherwise strip the tags client-side. Toggle thinking per request with `chat_template_kwargs={"reasoning": true|false}`.
> **NVLink boxes:** drop `NCCL_P2P_DISABLE`/`NCCL_IB_DISABLE`/`--disable-custom-all-reduce`. They are only needed on PCIe-only / no-P2P machines.
## Performance (throughput)
Measured on 2× RTX PRO 2000 Blackwell (16 GB), TP=2, 256-token outputs. **TPS is a latency↔throughput curve** — fastest alone, more aggregate under load:
| concurrent requests | per-request tok/s | aggregate tok/s |
|---:|---:|---:|
| 1 | **111** | 111 |
| 8 | 81 | 650 |
| 16 | 61 | 966 |
| 32 | 50 | 1609 |
| 64 | 33 | 2091 |
| 384 | 9 | 2606 (saturated) |
On a **no-P2P box, scale-OUT beats scale-UP**: TP=2 (2 GPU) peaks ~2606 tok/s and is the most GPU-efficient (1303 tok/s/GPU); TP=4 (4 GPU) is *slower* (2494, all-reduce-bound); data-parallel 2×TP2 (4 GPU) wins only at very high concurrency (~3343 tok/s).
## Coding benchmarks
EvalPlus-hardened **HumanEval+ (163)** and **MBPP+ (160)** test suites, single sample, code extracted from one ```python block. *(Local harness prompt/sampling; not identical to the official EvalPlus greedy methodology, so treat as tier-indicative.)*
| mode | HumanEval+ /163 | MBPP+ /160 |
|---|---:|---:|
| **greedy (temp 0, no-think)** — best | **145 (89.0%)** | **120 (75.0%)** |
| temp 0.4, no-think | 138 (84.7%) | 116 (72.5%) |
| reasoning ON | 131 (80.4%) | 112 (70.0%) |
**Notes from testing:** greedy no-think is this model's sweet spot (sampling noise and long reasoning both *lowered* pass@1 here — possibly a token-budget effect for reasoning). For a ~3–4B-active model this lands in the same HumanEval+ tier as much larger "Flash"-class models, at a fraction of the compute.
## Quantization recipe
`llm-compressor 0.11.0` + `compressed-tensors 0.16.0`, scheme `NVFP4`, `targets="Linear"`. Because `cohere2_moe` stores experts as **fused 3-D parameters** (`gate_up_proj`/`down_proj`) rather than `nn.Linear`, a custom MoE-calibration module was needed to unpack experts into per-expert linears before quantization (mirrors llm-compressor's gemma4 handler; verified numerically identical to the HF forward). Calibration: `neuralmagic/calibration` (LLM split), 512 samples, seq len 1024, `calibrate_all_experts=True`. Ignore: router (`re:.*mlp\.gate$`), all norms, `embed_tokens`, `lm_head`.
## Attribution & license
- **Base model:** [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) by **Cohere Labs** — all model capabilities and credit belong to them. This repo is a quantization only.
- License: **CC-BY-NC-4.0** (non-commercial). The base model does not state an explicit license upstream; this tag follows Cohere Labs' usual research-weight convention. **Please refer to the [original repository](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) for the authoritative terms of use** — if Cohere Labs specifies different terms, those govern the base weights.
Feedback welcome — open a discussion if you hit issues or have throughput/quality numbers to share. 🙏