soroushtabesh commited on
Commit
95b9b99
·
verified ·
1 Parent(s): 81de335

Add humming instructions

Browse files
Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -84,10 +84,9 @@ weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
84
  }
85
  ```
86
 
87
- Loading this checkpoint requires a vLLM build with the
88
- [`humming`](https://github.com/inclusionAI/humming) MoE kernel installed (see
89
- the [GSQ repo](https://github.com/IST-DASLab/GSQ) `scripts/setup_env.sh` for
90
- the exact install line).
91
 
92
  > Note: GSQ training first writes shards in `compressed-tensors`
93
  > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
@@ -97,6 +96,12 @@ the exact install line).
97
 
98
  ## Serving with vLLM
99
 
 
 
 
 
 
 
100
  Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
101
  valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).
102
 
 
84
  }
85
  ```
86
 
87
+ Loading this checkpoint requires vLLM plus the
88
+ [`humming`](https://github.com/inclusionAI/humming) MoE kernels (`pip install
89
+ humming-kernels`). See **Serving with vLLM** below.
 
90
 
91
  > Note: GSQ training first writes shards in `compressed-tensors`
92
  > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
 
96
 
97
  ## Serving with vLLM
98
 
99
+ Install the Humming kernels (required for vLLM to load this checkpoint):
100
+
101
+ ```bash
102
+ pip install humming-kernels
103
+ ```
104
+
105
  Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
106
  valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).
107