soroushtabesh commited on
Commit
5893d4a
·
verified ·
1 Parent(s): 762bccb

Add storage layout details

Browse files
Files changed (1) hide show
  1. README.md +47 -2
README.md CHANGED
@@ -12,7 +12,7 @@ tags:
12
  - llama
13
  - llama-3.1
14
  - vllm
15
- - compressed-tensors
16
  - arxiv:2604.18556
17
  ---
18
 
@@ -46,9 +46,54 @@ standard zero-shot suite (ARC-C/E, HellaSwag, PIQA, Winogrande):
46
  - **Bits / weight (effective):** ≈2.13 bpp
47
  - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
48
  - **Group size:** 128
49
- - **Format:** `compressed-tensors` (auto-detected by vLLM)
50
  - **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## Serving with vLLM
53
 
54
  ```bash
 
12
  - llama
13
  - llama-3.1
14
  - vllm
15
+ - humming
16
  - arxiv:2604.18556
17
  ---
18
 
 
46
  - **Bits / weight (effective):** ≈2.13 bpp
47
  - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
48
  - **Group size:** 128
49
+ - **Format:** [Humming](https://github.com/IST-DASLab/humming) (`quant_method: "humming"`, `b_dtype: "uint2"`)
50
  - **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
51
 
52
+ ### Storage layout (why the HF UI shows I32 + BF16)
53
+
54
+ The Hugging Face "Tensor types" widget reports the **container dtype** of each
55
+ safetensor on disk, not the effective precision of the underlying weights.
56
+ This checkpoint uses the **Humming** on-disk layout (exact-width packing — no
57
+ sub-byte values are padded into a wider container). For every quantized
58
+ `Linear` layer with original weight shape `[out_features, in_features]`, the
59
+ following tensors are stored:
60
+
61
+ | Tensor | Dtype | Shape on disk | Meaning |
62
+ |------------------------------|-------|-------------------------------------|-------------------------------------------------------------------------------|
63
+ | `<layer>.weight` | I32 | `[out_features, in_features × 2 / 32]` = `[out_features, in_features / 16]` | 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. |
64
+ | `<layer>.weight_scale` | BF16 | `[out_features, in_features / 128]` | One symmetric scale per group of `group_size = 128` weights along the input dim. |
65
+ | Attention / norms / embed / LM-head | BF16 | unchanged | Not quantized; copied from the base checkpoint. |
66
+
67
+ **Example** (`model.layers.0.mlp.gate_proj`, original `[28672, 8192]`):
68
+ `weight` = `[28672, 512]` I32 (since `8192 × 2 / 32 = 512`),
69
+ `weight_scale` = `[28672, 64]` BF16 (since `8192 / 128 = 64`).
70
+
71
+ So although the UI says "I32 + BF16", the **effective storage** per quantized
72
+ weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
73
+ `quantization_config` block in `config.json` is:
74
+
75
+ ```json
76
+ {
77
+ "quant_method": "humming",
78
+ "b_dtype": "uint2",
79
+ "weight_scale_group_size": 128,
80
+ "weight_scale_type": "group",
81
+ "has_zero_point": false,
82
+ "ignore": ["lm_head", "embed_tokens"]
83
+ }
84
+ ```
85
+
86
+ Loading this checkpoint requires a vLLM build with the
87
+ [`humming`](https://github.com/IST-DASLab/humming) MoE kernel installed (see
88
+ the [GSQ repo](https://github.com/IST-DASLab/GSQ) `scripts/setup_env.sh` for
89
+ the exact install line).
90
+
91
+ > Note: GSQ training first writes shards in `compressed-tensors`
92
+ > `pack-quantized` format (where a sub-4-bit codebook is padded into a 4-bit
93
+ > INT32 container). The published checkpoint here has been re-packed via
94
+ > `convert_to_humming.py` into exact-width 2-bit Humming storage, hence the
95
+ > `2 / 32` shape factor you see above.
96
+
97
  ## Serving with vLLM
98
 
99
  ```bash