kizuna-intelligence
/

Command-A-Plus-Lite

command-a-plus-lite

mixture-of-experts

Model card Files Files and versions

Command-A-Plus-Lite / README.md

yusuke-ai's picture

Add files using upload-large-folder tool

a4ac241 verified about 1 month ago

|

History Blame Contribute Delete

1.94 kB

	---
	license: other
	license_name: cohere-license
	license_link: https://huggingface.co/CohereLabs/command-a-plus-05-2026
	base_model: CohereLabs/command-a-plus-05-2026
	tags:
	- quantization
	- int2
	- int4
	- mixture-of-experts
	- command-a-plus
	library_name: command-a-plus-lite
	---

	# Command-A-Plus-Lite (int2 experts / int4 resident)

	Pre-quantized weights for running Cohere's Command-A-Plus (218B-parameter
	Mixture-of-Experts, 25B active) on a single 24GB GPU.

	\| Component \| Precision \| Where \|
	\|---\|---\|---\|
	\| Routed experts (128/layer) \| int2, group-wise (g=64) \| CPU RAM, streamed per active expert \|
	\| Attention q/k/v/o + shared experts + embedding \| int4, group-wise (g=64) \| GPU-resident \|
	\| Router gate / layernorms \| fp16 \| GPU-resident \|

	```
	weights on disk ~67 GB
	resident VRAM ~8.4 GB
	host RAM (pinned) ~61 GB (peaks ~108 GB during load)
	decode speed ~0.3 tok/s (single 24GB GPU, --pin --gemlite)
	```

	Decode is transfer-bound (CPU→GPU expert streaming dominates), so this is a
	capacity play — fitting a 218B model on one 24GB card — not a throughput one.

	## Usage

	Install the runtime: <https://github.com/kizuna-intelligence/Command-A-Plus-Lite>

	```bash
	pip install -e ".[gemlite]"
	hf download kizuna-intelligence/Command-A-Plus-Lite --local-dir ./cmda_int4
	```

	```python
	import torch
	from command_a_plus_lite import load_quantized

	model = load_quantized("./cmda_int4", device="cuda:0", dtype=torch.float16,
	pin_experts=True, use_gemlite=True)
	```

	The tokenizer is not included here — use the one from the base model
	[`CohereLabs/command-a-plus-05-2026`](https://huggingface.co/CohereLabs/command-a-plus-05-2026).

	## License

	The model weights are governed by Cohere's license for Command-A-Plus.
	The runtime code is MIT (see the GitHub repository). int2 routed experts are
	blind RTN (no calibration); quality is below the bf16 original.