kread commited on
Commit
45bd49a
·
verified ·
1 Parent(s): 4ea166c

Initial: research-gated MQ3-Lloyd + MQ4-Lloyd disclaimer

Browse files
Files changed (1) hide show
  1. README.md +130 -0
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - hipfire
7
+ - quantization
8
+ - lloyd-max
9
+ - research
10
+ - qwen3.6
11
+ gated: auto
12
+ extra_gated_prompt: |
13
+ These are research-grade quantizations of Qwen3.6 weights using
14
+ Lloyd-Max codebook quantization, distributed for compute-kernel
15
+ research and reproducibility of the hipfire WMMA prefill /
16
+ decode kernel-perf work.
17
+
18
+ By accessing these files you acknowledge:
19
+ - These are NOT a production quant release. The Lloyd format
20
+ family is research-stage; quality envelope and arch coverage
21
+ differ from the canonical hipfire quants. MQ4-Lloyd in
22
+ particular is earlier-stage than MQ3-Lloyd — see "What's in
23
+ this repo".
24
+ - The formats are HFQ4-stride-incompatible; misuse via mixed
25
+ stride dispatch causes silent corruption. Use the published
26
+ hipfire daemon with the matching `--allow-mq{3,4}-lloyd` flag
27
+ so dispatch routes through the Lloyd-specific arms.
28
+ - Canonical non-Lloyd variants for this model size live at
29
+ `schuttdev/hipfire-qwen3.6-27b` and are the recommended
30
+ starting point for typical inference.
31
+ extra_gated_fields:
32
+ I have read the research disclaimer: checkbox
33
+ ---
34
+
35
+ # Qwen3.6-27B — hipfire research quants (dev)
36
+
37
+ > **Research preview.** Lloyd-Max codebook quantization is an
38
+ > experimental format under active development. Quality envelope and
39
+ > arch coverage differ from the canonical hipfire quants — see
40
+ > "What's in this repo" below before downloading. The `-dev` repo
41
+ > name distinguishes these dev-stage variants from any future
42
+ > production-supported `hipfire-models/qwen3.6-27b` release.
43
+
44
+ ## What's in this repo
45
+
46
+ | File | Format | Size | Status |
47
+ |---|---|---:|---|
48
+ | `qwen3.6-27b.mq3-lloyd` | MQ3-Lloyd-G256 (112 B/group, 8-entry codebook, FWHT-rotated) | 12.6 GB | research; quantize-time-gated by `--allow-mq3-lloyd` |
49
+ | `qwen3.6-27b.mq4-lloyd` | MQ4-Lloyd-G256 (160 B/group, 16-entry codebook, FWHT-rotated) | 17.4 GB | research; quantize-time-gated by `--allow-mq4-lloyd`; earlier-stage than MQ3-Lloyd |
50
+
51
+ ## What is MQ{3,4}-Lloyd?
52
+
53
+ Lloyd-Max codebook quantization with a per-group LDS-staged
54
+ codebook. Each 256-element group carries an N-entry fp16 codebook
55
+ plus packed indices:
56
+
57
+ - **MQ3-Lloyd**: 8-entry codebook (16 B header) + 96 B 3-bit
58
+ cross-byte-packed indices = **112 B / group**.
59
+ - **MQ4-Lloyd**: 16-entry codebook (32 B header) + 128 B 4-bit
60
+ nibble-pair indices = **160 B / group**.
61
+
62
+ Reconstruction is a *codebook lookup* (`cb[index]`) rather than the
63
+ affine `scale * q + zero_point` of HFQ3 (104 B/group) / HFQ4
64
+ (136 B/group). Group strides differ; **mixing formats in a single
65
+ dispatch is silent corruption** — hence the `--allow-mq*-lloyd`
66
+ quantize-time gate and the matched batched-prefill dispatch arms in
67
+ hipfire.
68
+
69
+ ## Why "research"?
70
+
71
+ - **Quality drift on decode** (MQ3-Lloyd): the production GEMV
72
+ decode kernels carry a documented ~0.9 % PPL drift on the
73
+ Qwen3.5-9B reference model vs the slow-baseline path (universal
74
+ across gfx1100/1101/1102/1151), caused by a multi-accumulator
75
+ reordering that compounds across the inference loop. The same
76
+ envelope is expected on Qwen3.6-27B until measured. See
77
+ `feat/mq3-lloyd-gfx1151` follow-up devlog in the hipfire repo
78
+ for the root cause + measurement. Prefill kernels
79
+ (PR [#195](https://github.com/Kaden-Schutt/hipfire/pull/195))
80
+ are single-acc and drift-free; the decode-side fix is tracked
81
+ as a separate follow-up.
82
+ - **Earlier-stage** (MQ4-Lloyd): wired through batched WMMA
83
+ prefill in PR [#197](https://github.com/Kaden-Schutt/hipfire/pull/197)
84
+ (issue #182 Phase 5b). Phase C ship-gate bench on gfx1100 is
85
+ pending — current numbers are gfx1151-only.
86
+ - **Arch coverage**: gfx1100 / 1101 / 1102 / 1151 (RDNA3 + 3.5).
87
+ gfx1200 / 1201 (RDNA4) ship behind an opt-in env gate
88
+ (`HIPFIRE_LLOYD_GFX12=1`) pending external CI validation —
89
+ default behaviour on RDNA4 falls through to the per-token
90
+ fallback. gfx10 / gfx906 / gfx94x are not supported.
91
+
92
+ ## Usage with hipfire
93
+
94
+ ```bash
95
+ # Pull a Lloyd quant into the local hipfire model cache:
96
+ hf download hipfire-models/qwen3.6-27b-dev qwen3.6-27b.mq3-lloyd \
97
+ --local-dir ~/.hipfire/models
98
+
99
+ # Or, for the MQ4-Lloyd variant:
100
+ hf download hipfire-models/qwen3.6-27b-dev qwen3.6-27b.mq4-lloyd \
101
+ --local-dir ~/.hipfire/models
102
+
103
+ # Run via the daemon (engine auto-detects the dtype from the file):
104
+ ./target/release/examples/daemon < <(echo \
105
+ '{"type":"load","model":"~/.hipfire/models/qwen3.6-27b.mq3-lloyd","params":{"max_seq":4096}}')
106
+ ```
107
+
108
+ ## Provenance
109
+
110
+ - Quantization: post-training Lloyd-Max codebook fit on the FWHT-
111
+ rotated upstream Qwen3.6-27B weights via the hipfire quantizer
112
+ (`hipfire-quantize` with `--allow-mq3-lloyd` / `--allow-mq4-lloyd`).
113
+ - Research PRs:
114
+ - [#195](https://github.com/Kaden-Schutt/hipfire/pull/195) — WMMA prefill kernels for MQ3-Lloyd (issue #116 Phase 5).
115
+ - [#197](https://github.com/Kaden-Schutt/hipfire/pull/197) — WMMA prefill kernels for MQ4-Lloyd (issue #182 Phase 5b).
116
+ - Format details: `docs/plans/mq3-lloyd-wmma-prefill.md` and
117
+ `docs/plans/mq4-lloyd-wmma-prefill.md` in the hipfire repo.
118
+
119
+ ## Looking for the canonical (non-research) quants?
120
+
121
+ Production-grade MQ3 / MQ4 / DFlash-draft variants for Qwen3.6-27B
122
+ live at
123
+ [schuttdev/hipfire-qwen3.6-27b](https://huggingface.co/schuttdev/hipfire-qwen3.6-27b)
124
+ until those repos move under this org.
125
+
126
+ ## License
127
+
128
+ Inherits the upstream Qwen3.6 license terms (Apache 2.0). The
129
+ quantization metadata + codebooks are derived from the upstream
130
+ weights.