| --- |
| license: other |
| license_name: cohere-license |
| license_link: https://huggingface.co/CohereLabs/command-a-plus-05-2026 |
| base_model: CohereLabs/command-a-plus-05-2026 |
| tags: |
| - quantization |
| - int2 |
| - int4 |
| - mixture-of-experts |
| - command-a-plus |
| library_name: command-a-plus-lite |
| --- |
| |
| # Command-A-Plus-Lite (int2 experts / int4 resident) |
|
|
| Pre-quantized weights for running Cohere's **Command-A-Plus** (218B-parameter |
| Mixture-of-Experts, 25B active) on a **single 24GB GPU**. |
|
|
| | Component | Precision | Where | |
| |---|---|---| |
| | Routed experts (128/layer) | **int2**, group-wise (g=64) | CPU RAM, streamed per active expert | |
| | Attention q/k/v/o + shared experts + embedding | **int4**, group-wise (g=64) | GPU-resident | |
| | Router gate / layernorms | fp16 | GPU-resident | |
|
|
| ``` |
| weights on disk ~67 GB |
| resident VRAM ~8.4 GB |
| host RAM (pinned) ~61 GB (peaks ~108 GB during load) |
| decode speed ~0.3 tok/s (single 24GB GPU, --pin --gemlite) |
| ``` |
|
|
| Decode is **transfer-bound** (CPU→GPU expert streaming dominates), so this is a |
| capacity play — fitting a 218B model on one 24GB card — not a throughput one. |
|
|
| ## Usage |
|
|
| Install the runtime: <https://github.com/kizuna-intelligence/Command-A-Plus-Lite> |
|
|
| ```bash |
| pip install -e ".[gemlite]" |
| hf download kizuna-intelligence/Command-A-Plus-Lite --local-dir ./cmda_int4 |
| ``` |
|
|
| ```python |
| import torch |
| from command_a_plus_lite import load_quantized |
| |
| model = load_quantized("./cmda_int4", device="cuda:0", dtype=torch.float16, |
| pin_experts=True, use_gemlite=True) |
| ``` |
|
|
| The tokenizer is **not** included here — use the one from the base model |
| [`CohereLabs/command-a-plus-05-2026`](https://huggingface.co/CohereLabs/command-a-plus-05-2026). |
|
|
| ## License |
|
|
| The model weights are governed by **Cohere's license** for Command-A-Plus. |
| The runtime code is MIT (see the GitHub repository). int2 routed experts are |
| blind RTN (no calibration); quality is below the bf16 original. |
|
|