daslab-testing
/

Kimi-K2.5-2bit-GSQ

compressed-tensors

Mixture of Experts

Model card Files Files and versions

soroushtabesh commited on Mar 24

Commit

74b3ecc

·

verified ·

1 Parent(s): 7aaaf63

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -16,7 +16,9 @@ tags:
 # Kimi-K2.5 — 2-bit GSQ Quantization
-This is a 2-bit quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
 ## Model Details

 # Kimi-K2.5 — 2-bit GSQ Quantization
+This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
+> **Note — Simulated quantization:** GSQ optimizes quantized weight values at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.
 ## Model Details