soroushtabesh commited on
Commit
74b3ecc
·
verified ·
1 Parent(s): 7aaaf63

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -16,7 +16,9 @@ tags:
16
 
17
  # Kimi-K2.5 — 2-bit GSQ Quantization
18
 
19
- This is a 2-bit quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
 
 
20
 
21
  ## Model Details
22
 
 
16
 
17
  # Kimi-K2.5 — 2-bit GSQ Quantization
18
 
19
+ This is a **simulated 2-bit** quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using **GSQ** (Gumbel Softmax Quantization), a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.
20
+
21
+ > **Note — Simulated quantization:** GSQ optimizes quantized weight values at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.
22
 
23
  ## Model Details
24