soroushtabesh commited on
Commit
405a1c8
·
verified ·
1 Parent(s): 4cb52aa

Add storage layout details

Browse files
Files changed (1) hide show
  1. README.md +56 -3
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  - moe
15
  - kimi
16
  - vllm
17
- - compressed-tensors
18
  - arxiv:2604.18556
19
  ---
20
 
@@ -36,9 +36,62 @@ group-wise scalar format that drops into existing INT inference kernels.
36
  - **Bits / weight (effective):** ≈2.13 bpp
37
  - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
38
  - **Group size:** 128
39
- - **Format:** `compressed-tensors` (auto-detected by vLLM)
40
  - **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
41
- - **Attention projections:** kept in FP (only experts / MLPs quantized)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Serving with vLLM
44
 
 
14
  - moe
15
  - kimi
16
  - vllm
17
+ - humming
18
  - arxiv:2604.18556
19
  ---
20
 
 
36
  - **Bits / weight (effective):** ≈2.13 bpp
37
  - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
38
  - **Group size:** 128
39
+ - **Format:** [Humming](https://github.com/IST-DASLab/humming) (`quant_method: "humming"`, `b_dtype: "uint2"`)
40
  - **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
41
+ - **What's quantized:** routed-expert MLPs from layer 1 onward (`gate_proj`, `up_proj`, `down_proj`). Attention (`self_attn`), layernorms, embeddings, LM head, vision tower, MM projector, MoE routing `gate`, shared experts, and the first dense MLP layer (`layers.0.mlp.*`) are kept in BF16.
42
+
43
+ ### Storage layout (why the HF UI shows I32 + BF16)
44
+
45
+ The Hugging Face "Tensor types" widget reports the **container dtype** of each
46
+ safetensor on disk, not the effective precision of the underlying weights.
47
+ This checkpoint uses the **Humming** on-disk layout (exact-width packing — no
48
+ sub-byte values are padded into a wider container). For every quantized
49
+ expert-MLP `Linear` with original weight shape `[out_features, in_features]`,
50
+ the following tensors are stored:
51
+
52
+ | Tensor | Dtype | Shape on disk | Meaning |
53
+ |-----------------------------------------|-------|-------------------------------------|-------------------------------------------------------------------------------|
54
+ | `<layer>.weight` | I32 | `[out_features, in_features × 2 / 32]` = `[out_features, in_features / 16]` | 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. |
55
+ | `<layer>.weight_scale` | BF16 | `[out_features, in_features / 128]` | One symmetric scale per group of `group_size = 128` weights along the input dim. |
56
+ | Attention / norms / embed / LM-head / vision / MM-projector / MoE `gate` / shared experts / `layers.0.mlp.*` | BF16 | unchanged | Not quantized; copied from the base checkpoint. |
57
+
58
+ So although the UI says "I32 + BF16", the **effective storage** per quantized
59
+ weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
60
+ `quantization_config` block in `config.json` is:
61
+
62
+ ```json
63
+ {
64
+ "quant_method": "humming",
65
+ "b_dtype": "uint2",
66
+ "weight_scale_group_size": 128,
67
+ "weight_scale_type": "group",
68
+ "has_zero_point": false,
69
+ "ignore": [
70
+ "lm_head",
71
+ "re:.*embed_tokens.*",
72
+ "re:.*self_attn.*",
73
+ "re:.*input_layernorm.*",
74
+ "re:.*post_attention_layernorm.*",
75
+ "re:.*\\.norm$",
76
+ "re:.*vision_tower.*",
77
+ "re:.*mm_projector.*",
78
+ "re:.*mlp\\.gate$",
79
+ "re:.*shared_expert.*",
80
+ "re:.*layers\\.(0)\\.mlp\\.(gate_proj|up_proj|down_proj|gate_up_proj).*"
81
+ ]
82
+ }
83
+ ```
84
+
85
+ Loading this checkpoint requires a vLLM build with the
86
+ [`humming`](https://github.com/IST-DASLab/humming) MoE kernel installed (see
87
+ the [GSQ repo](https://github.com/IST-DASLab/GSQ) `scripts/setup_env.sh` for
88
+ the exact install line).
89
+
90
+ > Note: GSQ training first writes shards in `compressed-tensors`
91
+ > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
92
+ > INT32 container). The published checkpoint here has been re-packed via
93
+ > `convert_to_humming.py` into exact-width 2-bit Humming storage, hence the
94
+ > `2 / 32` shape factor on `weight`.
95
 
96
  ## Serving with vLLM
97