Initial: research-gated MQ3-Lloyd + MQ4-Lloyd disclaimer
Browse files
README.md
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- hipfire
|
| 7 |
+
- quantization
|
| 8 |
+
- lloyd-max
|
| 9 |
+
- research
|
| 10 |
+
- qwen3.6
|
| 11 |
+
gated: auto
|
| 12 |
+
extra_gated_prompt: |
|
| 13 |
+
These are research-grade quantizations of Qwen3.6 weights using
|
| 14 |
+
Lloyd-Max codebook quantization, distributed for compute-kernel
|
| 15 |
+
research and reproducibility of the hipfire WMMA prefill /
|
| 16 |
+
decode kernel-perf work.
|
| 17 |
+
|
| 18 |
+
By accessing these files you acknowledge:
|
| 19 |
+
- These are NOT a production quant release. The Lloyd format
|
| 20 |
+
family is research-stage; quality envelope and arch coverage
|
| 21 |
+
differ from the canonical hipfire quants. MQ4-Lloyd in
|
| 22 |
+
particular is earlier-stage than MQ3-Lloyd — see "What's in
|
| 23 |
+
this repo".
|
| 24 |
+
- The formats are HFQ4-stride-incompatible; misuse via mixed
|
| 25 |
+
stride dispatch causes silent corruption. Use the published
|
| 26 |
+
hipfire daemon with the matching `--allow-mq{3,4}-lloyd` flag
|
| 27 |
+
so dispatch routes through the Lloyd-specific arms.
|
| 28 |
+
- Canonical non-Lloyd variants for this model size live at
|
| 29 |
+
`schuttdev/hipfire-qwen3.6-27b` and are the recommended
|
| 30 |
+
starting point for typical inference.
|
| 31 |
+
extra_gated_fields:
|
| 32 |
+
I have read the research disclaimer: checkbox
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
# Qwen3.6-27B — hipfire research quants (dev)
|
| 36 |
+
|
| 37 |
+
> **Research preview.** Lloyd-Max codebook quantization is an
|
| 38 |
+
> experimental format under active development. Quality envelope and
|
| 39 |
+
> arch coverage differ from the canonical hipfire quants — see
|
| 40 |
+
> "What's in this repo" below before downloading. The `-dev` repo
|
| 41 |
+
> name distinguishes these dev-stage variants from any future
|
| 42 |
+
> production-supported `hipfire-models/qwen3.6-27b` release.
|
| 43 |
+
|
| 44 |
+
## What's in this repo
|
| 45 |
+
|
| 46 |
+
| File | Format | Size | Status |
|
| 47 |
+
|---|---|---:|---|
|
| 48 |
+
| `qwen3.6-27b.mq3-lloyd` | MQ3-Lloyd-G256 (112 B/group, 8-entry codebook, FWHT-rotated) | 12.6 GB | research; quantize-time-gated by `--allow-mq3-lloyd` |
|
| 49 |
+
| `qwen3.6-27b.mq4-lloyd` | MQ4-Lloyd-G256 (160 B/group, 16-entry codebook, FWHT-rotated) | 17.4 GB | research; quantize-time-gated by `--allow-mq4-lloyd`; earlier-stage than MQ3-Lloyd |
|
| 50 |
+
|
| 51 |
+
## What is MQ{3,4}-Lloyd?
|
| 52 |
+
|
| 53 |
+
Lloyd-Max codebook quantization with a per-group LDS-staged
|
| 54 |
+
codebook. Each 256-element group carries an N-entry fp16 codebook
|
| 55 |
+
plus packed indices:
|
| 56 |
+
|
| 57 |
+
- **MQ3-Lloyd**: 8-entry codebook (16 B header) + 96 B 3-bit
|
| 58 |
+
cross-byte-packed indices = **112 B / group**.
|
| 59 |
+
- **MQ4-Lloyd**: 16-entry codebook (32 B header) + 128 B 4-bit
|
| 60 |
+
nibble-pair indices = **160 B / group**.
|
| 61 |
+
|
| 62 |
+
Reconstruction is a *codebook lookup* (`cb[index]`) rather than the
|
| 63 |
+
affine `scale * q + zero_point` of HFQ3 (104 B/group) / HFQ4
|
| 64 |
+
(136 B/group). Group strides differ; **mixing formats in a single
|
| 65 |
+
dispatch is silent corruption** — hence the `--allow-mq*-lloyd`
|
| 66 |
+
quantize-time gate and the matched batched-prefill dispatch arms in
|
| 67 |
+
hipfire.
|
| 68 |
+
|
| 69 |
+
## Why "research"?
|
| 70 |
+
|
| 71 |
+
- **Quality drift on decode** (MQ3-Lloyd): the production GEMV
|
| 72 |
+
decode kernels carry a documented ~0.9 % PPL drift on the
|
| 73 |
+
Qwen3.5-9B reference model vs the slow-baseline path (universal
|
| 74 |
+
across gfx1100/1101/1102/1151), caused by a multi-accumulator
|
| 75 |
+
reordering that compounds across the inference loop. The same
|
| 76 |
+
envelope is expected on Qwen3.6-27B until measured. See
|
| 77 |
+
`feat/mq3-lloyd-gfx1151` follow-up devlog in the hipfire repo
|
| 78 |
+
for the root cause + measurement. Prefill kernels
|
| 79 |
+
(PR [#195](https://github.com/Kaden-Schutt/hipfire/pull/195))
|
| 80 |
+
are single-acc and drift-free; the decode-side fix is tracked
|
| 81 |
+
as a separate follow-up.
|
| 82 |
+
- **Earlier-stage** (MQ4-Lloyd): wired through batched WMMA
|
| 83 |
+
prefill in PR [#197](https://github.com/Kaden-Schutt/hipfire/pull/197)
|
| 84 |
+
(issue #182 Phase 5b). Phase C ship-gate bench on gfx1100 is
|
| 85 |
+
pending — current numbers are gfx1151-only.
|
| 86 |
+
- **Arch coverage**: gfx1100 / 1101 / 1102 / 1151 (RDNA3 + 3.5).
|
| 87 |
+
gfx1200 / 1201 (RDNA4) ship behind an opt-in env gate
|
| 88 |
+
(`HIPFIRE_LLOYD_GFX12=1`) pending external CI validation —
|
| 89 |
+
default behaviour on RDNA4 falls through to the per-token
|
| 90 |
+
fallback. gfx10 / gfx906 / gfx94x are not supported.
|
| 91 |
+
|
| 92 |
+
## Usage with hipfire
|
| 93 |
+
|
| 94 |
+
```bash
|
| 95 |
+
# Pull a Lloyd quant into the local hipfire model cache:
|
| 96 |
+
hf download hipfire-models/qwen3.6-27b-dev qwen3.6-27b.mq3-lloyd \
|
| 97 |
+
--local-dir ~/.hipfire/models
|
| 98 |
+
|
| 99 |
+
# Or, for the MQ4-Lloyd variant:
|
| 100 |
+
hf download hipfire-models/qwen3.6-27b-dev qwen3.6-27b.mq4-lloyd \
|
| 101 |
+
--local-dir ~/.hipfire/models
|
| 102 |
+
|
| 103 |
+
# Run via the daemon (engine auto-detects the dtype from the file):
|
| 104 |
+
./target/release/examples/daemon < <(echo \
|
| 105 |
+
'{"type":"load","model":"~/.hipfire/models/qwen3.6-27b.mq3-lloyd","params":{"max_seq":4096}}')
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
## Provenance
|
| 109 |
+
|
| 110 |
+
- Quantization: post-training Lloyd-Max codebook fit on the FWHT-
|
| 111 |
+
rotated upstream Qwen3.6-27B weights via the hipfire quantizer
|
| 112 |
+
(`hipfire-quantize` with `--allow-mq3-lloyd` / `--allow-mq4-lloyd`).
|
| 113 |
+
- Research PRs:
|
| 114 |
+
- [#195](https://github.com/Kaden-Schutt/hipfire/pull/195) — WMMA prefill kernels for MQ3-Lloyd (issue #116 Phase 5).
|
| 115 |
+
- [#197](https://github.com/Kaden-Schutt/hipfire/pull/197) — WMMA prefill kernels for MQ4-Lloyd (issue #182 Phase 5b).
|
| 116 |
+
- Format details: `docs/plans/mq3-lloyd-wmma-prefill.md` and
|
| 117 |
+
`docs/plans/mq4-lloyd-wmma-prefill.md` in the hipfire repo.
|
| 118 |
+
|
| 119 |
+
## Looking for the canonical (non-research) quants?
|
| 120 |
+
|
| 121 |
+
Production-grade MQ3 / MQ4 / DFlash-draft variants for Qwen3.6-27B
|
| 122 |
+
live at
|
| 123 |
+
[schuttdev/hipfire-qwen3.6-27b](https://huggingface.co/schuttdev/hipfire-qwen3.6-27b)
|
| 124 |
+
until those repos move under this org.
|
| 125 |
+
|
| 126 |
+
## License
|
| 127 |
+
|
| 128 |
+
Inherits the upstream Qwen3.6 license terms (Apache 2.0). The
|
| 129 |
+
quantization metadata + codebooks are derived from the upstream
|
| 130 |
+
weights.
|