soroushtabesh commited on
Commit
b458ae6
·
verified ·
1 Parent(s): 4ba99ec

Add humming instructions

Browse files
Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -71,10 +71,9 @@ weight is `3 bits (packed) + 16 bits / 128 (group scale) ≈ 3.13 bpp`. The
71
  }
72
  ```
73
 
74
- Loading this checkpoint requires a vLLM build with the
75
- [`humming`](https://github.com/inclusionAI/humming) kernel installed (see the
76
- [GSQ repo](https://github.com/IST-DASLab/GSQ) `scripts/setup_env.sh` for the
77
- exact install line).
78
 
79
  > Note: GSQ training first writes shards in `compressed-tensors`
80
  > `pack-quantized` format (where a 3-bit codebook is padded into a 4-bit
@@ -85,6 +84,12 @@ exact install line).
85
 
86
  ## Serving with vLLM
87
 
 
 
 
 
 
 
88
  ```bash
89
  vllm serve ISTA-DASLab/Llama-3.1-70B-Instruct-3Bit-GSQ \
90
  --tensor-parallel-size 2
 
71
  }
72
  ```
73
 
74
+ Loading this checkpoint requires vLLM plus the
75
+ [`humming`](https://github.com/inclusionAI/humming) kernels (`pip install
76
+ humming-kernels`). See **Serving with vLLM** below.
 
77
 
78
  > Note: GSQ training first writes shards in `compressed-tensors`
79
  > `pack-quantized` format (where a 3-bit codebook is padded into a 4-bit
 
84
 
85
  ## Serving with vLLM
86
 
87
+ Install the Humming kernels (required for vLLM to load this checkpoint):
88
+
89
+ ```bash
90
+ pip install humming-kernels
91
+ ```
92
+
93
  ```bash
94
  vllm serve ISTA-DASLab/Llama-3.1-70B-Instruct-3Bit-GSQ \
95
  --tensor-parallel-size 2