--- base_model: CohereLabs/BLS-Mini-Code-1.0 base_model_relation: quantized pipeline_tag: text-generation library_name: vllm tags: - cohere2_moe - nvfp4 - compressed-tensors - llm-compressor - code - vllm - blackwell - w4a4 language: - en license: cc-by-nc-4.0 --- # BLS-Mini-Code-1.0-NVFP4 **NVFP4 (W4A4) quantization** of [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) — a `cohere2_moe` code-specialist MoE — packed to **17.97 GB** (from ~61 GB BF16) so it runs a 30B-class code model on **two 16 GB GPUs** at **~104 tok/s** on vLLM. > Made by quantizing the original CohereLabs release with [llm-compressor](https://github.com/vllm-project/llm-compressor) + [compressed-tensors](https://github.com/neuralmagic/compressed-tensors). All credit for the model itself goes to **Cohere Labs**; this repo only adds the FP4 quantization. Please review the original model's terms before use. ## What it is | | | |---|---| | Architecture | `Cohere2MoeForCausalLM` (`cohere2_moe`) | | Total / active params | 30.5B total · **~3–4B active** (128 experts, top-8, sigmoid router; layer-0 dense) | | Context | 500k positions (rope θ=50000), sliding-window 4096 + interleaved full attention | | Quantization | **NVFP4** `nvfp4-pack-quantized`, W4A4, group size 16, FP8-E4M3 scales | | What's quantized | all **18,631** Linear layers (attention + dense layer-0 + **18,432 experts**); router/embeddings/norms kept BF16 | | Size | **17.97 GB** (single `model.safetensors`) | | EOS / chat | `<|END_OF_TURN_TOKEN|>` (255001); `chat_template.jinja` included; optional thinking mode | ## Serving with vLLM Requires **vLLM ≥ 0.21** (native `Cohere2MoeForCausalLM` + compressed-tensors NVFP4 auto-detect — **no `--quantization` flag needed**) and `transformers ≥ 5.8` (where `cohere2_moe` lives). Tested on RTX PRO 2000 Blackwell (sm_120), torch cu128. ```bash # TP=2 across two GPUs. On a box WITHOUT NVLink/P2P, the two NCCL flags + # --disable-custom-all-reduce are MANDATORY (else it hangs at the first all-reduce). NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 \ vllm serve sakamakismile/BLS-Mini-Code-1.0-NVFP4 \ --tensor-parallel-size 2 \ --disable-custom-all-reduce \ --max-model-len 16384 \ --gpu-memory-utilization 0.90 \ --trust-remote-code \ --port 8000 ``` **Clean output / reasoning split (optional).** The model wraps answers in `<|START_TEXT|>…<|END_TEXT|>` (and `<|START_THINKING|>…<|END_THINKING|>` when thinking). To get clean `content` + separate `reasoning_content` server-side, add `--reasoning-parser cohere_command4` (requires `pip install cohere_melody`). Otherwise strip the tags client-side. Toggle thinking per request with `chat_template_kwargs={"reasoning": true|false}`. > **NVLink boxes:** drop `NCCL_P2P_DISABLE`/`NCCL_IB_DISABLE`/`--disable-custom-all-reduce`. They are only needed on PCIe-only / no-P2P machines. ## Performance (throughput) Measured on 2× RTX PRO 2000 Blackwell (16 GB), TP=2, 256-token outputs. **TPS is a latency↔throughput curve** — fastest alone, more aggregate under load: | concurrent requests | per-request tok/s | aggregate tok/s | |---:|---:|---:| | 1 | **111** | 111 | | 8 | 81 | 650 | | 16 | 61 | 966 | | 32 | 50 | 1609 | | 64 | 33 | 2091 | | 384 | 9 | 2606 (saturated) | On a **no-P2P box, scale-OUT beats scale-UP**: TP=2 (2 GPU) peaks ~2606 tok/s and is the most GPU-efficient (1303 tok/s/GPU); TP=4 (4 GPU) is *slower* (2494, all-reduce-bound); data-parallel 2×TP2 (4 GPU) wins only at very high concurrency (~3343 tok/s). ## Coding benchmarks EvalPlus-hardened **HumanEval+ (163)** and **MBPP+ (160)** test suites, single sample, code extracted from one ```python block. *(Local harness prompt/sampling; not identical to the official EvalPlus greedy methodology, so treat as tier-indicative.)* | mode | HumanEval+ /163 | MBPP+ /160 | |---|---:|---:| | **greedy (temp 0, no-think)** — best | **145 (89.0%)** | **120 (75.0%)** | | temp 0.4, no-think | 138 (84.7%) | 116 (72.5%) | | reasoning ON | 131 (80.4%) | 112 (70.0%) | **Notes from testing:** greedy no-think is this model's sweet spot (sampling noise and long reasoning both *lowered* pass@1 here — possibly a token-budget effect for reasoning). For a ~3–4B-active model this lands in the same HumanEval+ tier as much larger "Flash"-class models, at a fraction of the compute. ## Quantization recipe `llm-compressor 0.11.0` + `compressed-tensors 0.16.0`, scheme `NVFP4`, `targets="Linear"`. Because `cohere2_moe` stores experts as **fused 3-D parameters** (`gate_up_proj`/`down_proj`) rather than `nn.Linear`, a custom MoE-calibration module was needed to unpack experts into per-expert linears before quantization (mirrors llm-compressor's gemma4 handler; verified numerically identical to the HF forward). Calibration: `neuralmagic/calibration` (LLM split), 512 samples, seq len 1024, `calibrate_all_experts=True`. Ignore: router (`re:.*mlp\.gate$`), all norms, `embed_tokens`, `lm_head`. ## Attribution & license - **Base model:** [`CohereLabs/BLS-Mini-Code-1.0`](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) by **Cohere Labs** — all model capabilities and credit belong to them. This repo is a quantization only. - License: **CC-BY-NC-4.0** (non-commercial). The base model does not state an explicit license upstream; this tag follows Cohere Labs' usual research-weight convention. **Please refer to the [original repository](https://huggingface.co/CohereLabs/BLS-Mini-Code-1.0) for the authoritative terms of use** — if Cohere Labs specifies different terms, those govern the base weights. Feedback welcome — open a discussion if you hit issues or have throughput/quality numbers to share. 🙏