BLS-Mini-Code-1.0 NVFP4 (W4A4) for vLLM: 17.97GB, TP=2 ~104 t/s, model card w/ flags+TPS+benchmarks

f6c7144 verified 1 day ago

5.73 kB

	---
	base_model: CohereLabs/BLS-Mini-Code-1.0
	base_model_relation: quantized
	pipeline_tag: text-generation
	library_name: vllm
	tags:
	- cohere2_moe
	- nvfp4
	- compressed-tensors
	- llm-compressor
	- code
	- vllm
	- blackwell
	- w4a4
	language:
	- en
	license: cc-by-nc-4.0
	---

	# BLS-Mini-Code-1.0-NVFP4

	NVFP4 (W4A4) quantization of [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) — a `cohere2_moe` code-specialist MoE — packed to 17.97 GB (from ~61 GB BF16) so it runs a 30B-class code model on two 16 GB GPUs at ~104 tok/s on vLLM.

	> Made by quantizing the original CohereLabs release with [llm-compressor](https://github.com/vllm-project/llm-compressor) + [compressed-tensors](https://github.com/neuralmagic/compressed-tensors). All credit for the model itself goes to Cohere Labs; this repo only adds the FP4 quantization. Please review the original model's terms before use.

	## What it is

	\| \| \|
	\|---\|---\|
	\| Architecture \| `Cohere2MoeForCausalLM` (`cohere2_moe`) \|
	\| Total / active params \| 30.5B total · ~3–4B active (128 experts, top-8, sigmoid router; layer-0 dense) \|
	\| Context \| 500k positions (rope θ=50000), sliding-window 4096 + interleaved full attention \|
	\| Quantization \| NVFP4 `nvfp4-pack-quantized`, W4A4, group size 16, FP8-E4M3 scales \|
	\| What's quantized \| all 18,631 Linear layers (attention + dense layer-0 + 18,432 experts); router/embeddings/norms kept BF16 \|
	\| Size \| 17.97 GB (single `model.safetensors`) \|
	\| EOS / chat \| `<\|END_OF_TURN_TOKEN\|>` (255001); `chat_template.jinja` included; optional thinking mode \|

	## Serving with vLLM

	Requires vLLM ≥ 0.21 (native `Cohere2MoeForCausalLM` + compressed-tensors NVFP4 auto-detect — no `--quantization` flag needed) and `transformers ≥ 5.8` (where `cohere2_moe` lives). Tested on RTX PRO 2000 Blackwell (sm_120), torch cu128.

	```bash
	# TP=2 across two GPUs. On a box WITHOUT NVLink/P2P, the two NCCL flags +
	# --disable-custom-all-reduce are MANDATORY (else it hangs at the first all-reduce).
	NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
	vllm serve sakamakismile/BLS-Mini-Code-1.0-NVFP4 \
	--tensor-parallel-size 2 \
	--disable-custom-all-reduce \
	--max-model-len 16384 \
	--gpu-memory-utilization 0.90 \
	--trust-remote-code \
	--port 8000
	```

	Clean output / reasoning split (optional). The model wraps answers in `<\|START_TEXT\|>…<\|END_TEXT\|>` (and `<\|START_THINKING\|>…<\|END_THINKING\|>` when thinking). To get clean `content` + separate `reasoning_content` server-side, add `--reasoning-parser cohere_command4` (requires `pip install cohere_melody`). Otherwise strip the tags client-side. Toggle thinking per request with `chat_template_kwargs={"reasoning": true\|false}`.

	> NVLink boxes: drop `NCCL_P2P_DISABLE`/`NCCL_IB_DISABLE`/`--disable-custom-all-reduce`. They are only needed on PCIe-only / no-P2P machines.

	## Performance (throughput)

	Measured on 2× RTX PRO 2000 Blackwell (16 GB), TP=2, 256-token outputs. TPS is a latency↔throughput curve — fastest alone, more aggregate under load:

	\| concurrent requests \| per-request tok/s \| aggregate tok/s \|
	\|---:\|---:\|---:\|
	\| 1 \| 111 \| 111 \|
	\| 8 \| 81 \| 650 \|
	\| 16 \| 61 \| 966 \|
	\| 32 \| 50 \| 1609 \|
	\| 64 \| 33 \| 2091 \|
	\| 384 \| 9 \| 2606 (saturated) \|

	On a no-P2P box, scale-OUT beats scale-UP: TP=2 (2 GPU) peaks ~2606 tok/s and is the most GPU-efficient (1303 tok/s/GPU); TP=4 (4 GPU) is slower (2494, all-reduce-bound); data-parallel 2×TP2 (4 GPU) wins only at very high concurrency (~3343 tok/s).

	## Coding benchmarks

	EvalPlus-hardened HumanEval+ (163) and MBPP+ (160) test suites, single sample, code extracted from one ```python block. (Local harness prompt/sampling; not identical to the official EvalPlus greedy methodology, so treat as tier-indicative.)

	\| mode \| HumanEval+ /163 \| MBPP+ /160 \|
	\|---\|---:\|---:\|
	\| greedy (temp 0, no-think) — best \| 145 (89.0%) \| 120 (75.0%) \|
	\| temp 0.4, no-think \| 138 (84.7%) \| 116 (72.5%) \|
	\| reasoning ON \| 131 (80.4%) \| 112 (70.0%) \|

	Notes from testing: greedy no-think is this model's sweet spot (sampling noise and long reasoning both lowered pass@1 here — possibly a token-budget effect for reasoning). For a ~3–4B-active model this lands in the same HumanEval+ tier as much larger "Flash"-class models, at a fraction of the compute.

	## Quantization recipe

	`llm-compressor 0.11.0` + `compressed-tensors 0.16.0`, scheme `NVFP4`, `targets="Linear"`. Because `cohere2_moe` stores experts as fused 3-D parameters (`gate_up_proj`/`down_proj`) rather than `nn.Linear`, a custom MoE-calibration module was needed to unpack experts into per-expert linears before quantization (mirrors llm-compressor's gemma4 handler; verified numerically identical to the HF forward). Calibration: `neuralmagic/calibration` (LLM split), 512 samples, seq len 1024, `calibrate_all_experts=True`. Ignore: router (`re:.*mlp\.gate$`), all norms, `embed_tokens`, `lm_head`.

	## Attribution & license

	- Base model: [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) by Cohere Labs — all model capabilities and credit belong to them. This repo is a quantization only.
	- License: CC-BY-NC-4.0 (non-commercial). The base model does not state an explicit license upstream; this tag follows Cohere Labs' usual research-weight convention. Please refer to the [original repository](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) for the authoritative terms of use — if Cohere Labs specifies different terms, those govern the base weights.

	Feedback welcome — open a discussion if you hit issues or have throughput/quality numbers to share. 🙏