BLS-Mini-Code-1.0 NVFP4 (W4A4) for vLLM: 17.97GB, TP=2 ~104 t/s, model card w/ flags+TPS+benchmarks

f6c7144 verified 1 day ago

5.73 kB

base_model: CohereLabs/BLS-Mini-Code-1.0
base_model_relation: quantized
pipeline_tag: text-generation
library_name: vllm
tags:
  - cohere2_moe
  - nvfp4
  - compressed-tensors
  - llm-compressor
  - code
  - vllm
  - blackwell
  - w4a4
language:
  - en
license: cc-by-nc-4.0

BLS-Mini-Code-1.0-NVFP4

NVFP4 (W4A4) quantization of CohereLabs/BLS-Mini-Code-1.0 — a cohere2_moe code-specialist MoE — packed to 17.97 GB (from 61 GB BF16) so it runs a 30B-class code model on two 16 GB GPUs at **104 tok/s** on vLLM.

Made by quantizing the original CohereLabs release with llm-compressor + compressed-tensors. All credit for the model itself goes to Cohere Labs; this repo only adds the FP4 quantization. Please review the original model's terms before use.

What it is


Architecture	`Cohere2MoeForCausalLM` (`cohere2_moe`)
Total / active params	30.5B total · ~3–4B active (128 experts, top-8, sigmoid router; layer-0 dense)
Context	500k positions (rope θ=50000), sliding-window 4096 + interleaved full attention
Quantization	NVFP4 `nvfp4-pack-quantized`, W4A4, group size 16, FP8-E4M3 scales
What's quantized	all 18,631 Linear layers (attention + dense layer-0 + 18,432 experts); router/embeddings/norms kept BF16
Size	17.97 GB (single `model.safetensors`)
EOS / chat	`<

Serving with vLLM

Requires vLLM ≥ 0.21 (native Cohere2MoeForCausalLM + compressed-tensors NVFP4 auto-detect — no --quantization flag needed) and transformers ≥ 5.8 (where cohere2_moe lives). Tested on RTX PRO 2000 Blackwell (sm_120), torch cu128.

# TP=2 across two GPUs. On a box WITHOUT NVLink/P2P, the two NCCL flags +
# --disable-custom-all-reduce are MANDATORY (else it hangs at the first all-reduce).
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
vllm serve sakamakismile/BLS-Mini-Code-1.0-NVFP4 \
  --tensor-parallel-size 2 \
  --disable-custom-all-reduce \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --port 8000

Clean output / reasoning split (optional). The model wraps answers in <|START_TEXT|>…<|END_TEXT|> (and <|START_THINKING|>…<|END_THINKING|> when thinking). To get clean content + separate reasoning_content server-side, add --reasoning-parser cohere_command4 (requires pip install cohere_melody). Otherwise strip the tags client-side. Toggle thinking per request with chat_template_kwargs={"reasoning": true|false}.

NVLink boxes: drop NCCL_P2P_DISABLE/NCCL_IB_DISABLE/--disable-custom-all-reduce. They are only needed on PCIe-only / no-P2P machines.

Performance (throughput)

Measured on 2× RTX PRO 2000 Blackwell (16 GB), TP=2, 256-token outputs. TPS is a latency↔throughput curve — fastest alone, more aggregate under load:

concurrent requests	per-request tok/s	aggregate tok/s
1	111	111
8	81	650
16	61	966
32	50	1609
64	33	2091
384	9	2606 (saturated)

On a no-P2P box, scale-OUT beats scale-UP: TP=2 (2 GPU) peaks ~~2606 tok/s and is the most GPU-efficient (1303 tok/s/GPU); TP=4 (4 GPU) is slower (2494, all-reduce-bound); data-parallel 2×TP2 (4 GPU) wins only at very high concurrency (~~3343 tok/s).

Coding benchmarks

EvalPlus-hardened HumanEval+ (163) and MBPP+ (160) test suites, single sample, code extracted from one ```python block. (Local harness prompt/sampling; not identical to the official EvalPlus greedy methodology, so treat as tier-indicative.)

mode	HumanEval+ /163	MBPP+ /160
greedy (temp 0, no-think) — best	145 (89.0%)	120 (75.0%)
temp 0.4, no-think	138 (84.7%)	116 (72.5%)
reasoning ON	131 (80.4%)	112 (70.0%)

Notes from testing: greedy no-think is this model's sweet spot (sampling noise and long reasoning both lowered pass@1 here — possibly a token-budget effect for reasoning). For a ~3–4B-active model this lands in the same HumanEval+ tier as much larger "Flash"-class models, at a fraction of the compute.

Quantization recipe

llm-compressor 0.11.0 + compressed-tensors 0.16.0, scheme NVFP4, targets="Linear". Because cohere2_moe stores experts as fused 3-D parameters (gate_up_proj/down_proj) rather than nn.Linear, a custom MoE-calibration module was needed to unpack experts into per-expert linears before quantization (mirrors llm-compressor's gemma4 handler; verified numerically identical to the HF forward). Calibration: neuralmagic/calibration (LLM split), 512 samples, seq len 1024, calibrate_all_experts=True. Ignore: router (re:.*mlp\.gate$), all norms, embed_tokens, lm_head.

Attribution & license

Base model: CohereLabs/BLS-Mini-Code-1.0 by Cohere Labs — all model capabilities and credit belong to them. This repo is a quantization only.
License: CC-BY-NC-4.0 (non-commercial). The base model does not state an explicit license upstream; this tag follows Cohere Labs' usual research-weight convention. Please refer to the original repository for the authoritative terms of use — if Cohere Labs specifies different terms, those govern the base weights.

Feedback welcome — open a discussion if you hit issues or have throughput/quality numbers to share. 🙏