BLS-Mini-Code-1.0-NVFP4
NVFP4 (W4A4) quantization of CohereLabs/BLS-Mini-Code-1.0 — a cohere2_moe code-specialist MoE — packed to 17.97 GB (from 61 GB BF16) so it runs a 30B-class code model on two 16 GB GPUs at **104 tok/s** on vLLM.
Made by quantizing the original CohereLabs release with llm-compressor + compressed-tensors. All credit for the model itself goes to Cohere Labs; this repo only adds the FP4 quantization. Please review the original model's terms before use.
What it is
| Architecture | Cohere2MoeForCausalLM (cohere2_moe) |
| Total / active params | 30.5B total · ~3–4B active (128 experts, top-8, sigmoid router; layer-0 dense) |
| Context | 500k positions (rope θ=50000), sliding-window 4096 + interleaved full attention |
| Quantization | NVFP4 nvfp4-pack-quantized, W4A4, group size 16, FP8-E4M3 scales |
| What's quantized | all 18,631 Linear layers (attention + dense layer-0 + 18,432 experts); router/embeddings/norms kept BF16 |
| Size | 17.97 GB (single model.safetensors) |
| EOS / chat | `< |
Serving with vLLM
Requires vLLM ≥ 0.21 (native Cohere2MoeForCausalLM + compressed-tensors NVFP4 auto-detect — no --quantization flag needed) and transformers ≥ 5.8 (where cohere2_moe lives). Tested on RTX PRO 2000 Blackwell (sm_120), torch cu128.
# TP=2 across two GPUs. On a box WITHOUT NVLink/P2P, the two NCCL flags +
# --disable-custom-all-reduce are MANDATORY (else it hangs at the first all-reduce).
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \
vllm serve sakamakismile/BLS-Mini-Code-1.0-NVFP4 \
--tensor-parallel-size 2 \
--disable-custom-all-reduce \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--port 8000
Clean output / reasoning split (optional). The model wraps answers in <|START_TEXT|>…<|END_TEXT|> (and <|START_THINKING|>…<|END_THINKING|> when thinking). To get clean content + separate reasoning_content server-side, add --reasoning-parser cohere_command4 (requires pip install cohere_melody). Otherwise strip the tags client-side. Toggle thinking per request with chat_template_kwargs={"reasoning": true|false}.
NVLink boxes: drop
NCCL_P2P_DISABLE/NCCL_IB_DISABLE/--disable-custom-all-reduce. They are only needed on PCIe-only / no-P2P machines.
Performance (throughput)
Measured on 2× RTX PRO 2000 Blackwell (16 GB), TP=2, 256-token outputs. TPS is a latency↔throughput curve — fastest alone, more aggregate under load:
| concurrent requests | per-request tok/s | aggregate tok/s |
|---|---|---|
| 1 | 111 | 111 |
| 8 | 81 | 650 |
| 16 | 61 | 966 |
| 32 | 50 | 1609 |
| 64 | 33 | 2091 |
| 384 | 9 | 2606 (saturated) |
On a no-P2P box, scale-OUT beats scale-UP: TP=2 (2 GPU) peaks 2606 tok/s and is the most GPU-efficient (1303 tok/s/GPU); TP=4 (4 GPU) is slower (2494, all-reduce-bound); data-parallel 2×TP2 (4 GPU) wins only at very high concurrency (3343 tok/s).
Coding benchmarks
EvalPlus-hardened HumanEval+ (163) and MBPP+ (160) test suites, single sample, code extracted from one ```python block. (Local harness prompt/sampling; not identical to the official EvalPlus greedy methodology, so treat as tier-indicative.)
| mode | HumanEval+ /163 | MBPP+ /160 |
|---|---|---|
| greedy (temp 0, no-think) — best | 145 (89.0%) | 120 (75.0%) |
| temp 0.4, no-think | 138 (84.7%) | 116 (72.5%) |
| reasoning ON | 131 (80.4%) | 112 (70.0%) |
Notes from testing: greedy no-think is this model's sweet spot (sampling noise and long reasoning both lowered pass@1 here — possibly a token-budget effect for reasoning). For a ~3–4B-active model this lands in the same HumanEval+ tier as much larger "Flash"-class models, at a fraction of the compute.
Quantization recipe
llm-compressor 0.11.0 + compressed-tensors 0.16.0, scheme NVFP4, targets="Linear". Because cohere2_moe stores experts as fused 3-D parameters (gate_up_proj/down_proj) rather than nn.Linear, a custom MoE-calibration module was needed to unpack experts into per-expert linears before quantization (mirrors llm-compressor's gemma4 handler; verified numerically identical to the HF forward). Calibration: neuralmagic/calibration (LLM split), 512 samples, seq len 1024, calibrate_all_experts=True. Ignore: router (re:.*mlp\.gate$), all norms, embed_tokens, lm_head.
Attribution & license
- Base model:
CohereLabs/BLS-Mini-Code-1.0by Cohere Labs — all model capabilities and credit belong to them. This repo is a quantization only. - License: CC-BY-NC-4.0 (non-commercial). The base model does not state an explicit license upstream; this tag follows Cohere Labs' usual research-weight convention. Please refer to the original repository for the authoritative terms of use — if Cohere Labs specifies different terms, those govern the base weights.
Feedback welcome — open a discussion if you hit issues or have throughput/quality numbers to share. 🙏
- Downloads last month
- 12
Model tree for sakamakismile/BLS-Mini-Code-1.0-NVFP4
Base model
CohereLabs/BLS-Mini-Code-1.0