Image-Text-to-Text
Transformers
Safetensors
gsq
gumbel-softmax
quantization
ptq
Mixture of Experts
kimi
vllm
humming
Instructions to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ISTA-DASLab/Kimi-K2.5-2Bit-GSQ")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-2Bit-GSQ
- SGLang
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with Docker Model Runner:
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-2Bit-GSQ
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -21,9 +21,9 @@ tags:
|
|
| 21 |
# Kimi-K2.5 — 2-bit GSQ
|
| 22 |
|
| 23 |
2-bit quantization of [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
|
| 24 |
-
(MoE, 384 experts,
|
| 25 |
-
(Gumbel-Softmax Quantization). The model is compressed from
|
| 26 |
-
**
|
| 27 |
and long-context behaviour — and slightly *exceeds* the FP base on MATH 500
|
| 28 |
and LiveCodeBench v6 under our evaluation pipeline.
|
| 29 |
|
|
@@ -35,7 +35,7 @@ and LiveCodeBench v6 under our evaluation pipeline.
|
|
| 35 |
## Quantization details
|
| 36 |
|
| 37 |
- **Base model:** [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
|
| 38 |
-
- **Bits / weight (effective):**
|
| 39 |
- **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
|
| 40 |
- **Group size:** 128
|
| 41 |
- **Format:** `compressed-tensors` (auto-detected by vLLM)
|
|
|
|
| 21 |
# Kimi-K2.5 — 2-bit GSQ
|
| 22 |
|
| 23 |
2-bit quantization of [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
|
| 24 |
+
(MoE, 384 experts, ≈260 GB FP) produced with **GSQ**
|
| 25 |
+
(Gumbel-Softmax Quantization). The model is compressed from ≈4.5 bpp down to
|
| 26 |
+
**≈2.13 bpp** while preserving most of the base model's reasoning, coding,
|
| 27 |
and long-context behaviour — and slightly *exceeds* the FP base on MATH 500
|
| 28 |
and LiveCodeBench v6 under our evaluation pipeline.
|
| 29 |
|
|
|
|
| 35 |
## Quantization details
|
| 36 |
|
| 37 |
- **Base model:** [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
|
| 38 |
+
- **Bits / weight (effective):** ≈2.13 bpp
|
| 39 |
- **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
|
| 40 |
- **Group size:** 128
|
| 41 |
- **Format:** `compressed-tensors` (auto-detected by vLLM)
|