ISTA-DASLab
/

Kimi-K2.5-2Bit-GSQ

Image-Text-to-Text

Mixture of Experts

Model card Files Files and versions

soroushtabesh commited on 5 days ago

Commit

95b9b99

·

verified ·

1 Parent(s): 81de335

Add humming instructions

Files changed (1) hide show

README.md +9 -4

README.md CHANGED Viewed

@@ -84,10 +84,9 @@ weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
 }
 ```
-Loading this checkpoint requires a vLLM build with the
-[`humming`](https://github.com/inclusionAI/humming) MoE kernel installed (see
-the [GSQ repo](https://github.com/IST-DASLab/GSQ) `scripts/setup_env.sh` for
-the exact install line).
 > Note: GSQ training first writes shards in `compressed-tensors`
 > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
@@ -97,6 +96,12 @@ the exact install line).
 ## Serving with vLLM
 Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
 valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).

 }
 ```
+Loading this checkpoint requires vLLM plus the
+[`humming`](https://github.com/inclusionAI/humming) MoE kernels (`pip install
+humming-kernels`). See **Serving with vLLM** below.
 > Note: GSQ training first writes shards in `compressed-tensors`
 > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
 ## Serving with vLLM
+Install the Humming kernels (required for vLLM to load this checkpoint):
+```bash
+pip install humming-kernels
+```
 Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
 valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).