soroushtabesh commited on
Commit
1802d56
·
verified ·
1 Parent(s): e1cc914

Add humming instructions

Browse files
Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -82,10 +82,9 @@ weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
82
  }
83
  ```
84
 
85
- Loading this checkpoint requires a vLLM build with the
86
- [`humming`](https://github.com/inclusionAI/humming) MoE kernel installed (see
87
- the [GSQ repo](https://github.com/IST-DASLab/GSQ) `scripts/setup_env.sh` for
88
- the exact install line).
89
 
90
  > Note: GSQ training first writes shards in `compressed-tensors`
91
  > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
@@ -95,6 +94,12 @@ the exact install line).
95
 
96
  ## Serving with vLLM
97
 
 
 
 
 
 
 
98
  Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving.
99
 
100
  ```bash
 
82
  }
83
  ```
84
 
85
+ Loading this checkpoint requires vLLM plus the
86
+ [`humming`](https://github.com/inclusionAI/humming) MoE kernels (`pip install
87
+ humming-kernels`). See **Serving with vLLM** below.
 
88
 
89
  > Note: GSQ training first writes shards in `compressed-tensors`
90
  > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
 
94
 
95
  ## Serving with vLLM
96
 
97
+ Install the Humming kernels (required for vLLM to load this checkpoint):
98
+
99
+ ```bash
100
+ pip install humming-kernels
101
+ ```
102
+
103
  Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving.
104
 
105
  ```bash