Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -16,7 +16,9 @@ tags:
|
|
| 16 |
|
| 17 |
# Kimi-K2.5 — 2-bit GSQ Quantization
|
| 18 |
|
| 19 |
-
This is a 2-bit quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## Model Details
|
| 22 |
|
|
|
|
| 16 |
|
| 17 |
# Kimi-K2.5 — 2-bit GSQ Quantization
|
| 18 |
|
| 19 |
+
This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
|
| 20 |
+
|
| 21 |
+
> **Note — Simulated quantization:** GSQ optimizes quantized weight values at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|