hipfire-models
/

qwen3.6-27b-dev

+---
+license: apache-2.0
+language:
+- en
+tags:
+- hipfire
+- quantization
+- lloyd-max
+- research
+- qwen3.6
+gated: auto
+extra_gated_prompt: |
+  These are research-grade quantizations of Qwen3.6 weights using
+  Lloyd-Max codebook quantization, distributed for compute-kernel
+  research and reproducibility of the hipfire WMMA prefill /
+  decode kernel-perf work.
+  By accessing these files you acknowledge:
+    - These are NOT a production quant release. The Lloyd format
+      family is research-stage; quality envelope and arch coverage
+      differ from the canonical hipfire quants. MQ4-Lloyd in
+      particular is earlier-stage than MQ3-Lloyd — see "What's in
+      this repo".
+    - The formats are HFQ4-stride-incompatible; misuse via mixed
+      stride dispatch causes silent corruption. Use the published
+      hipfire daemon with the matching `--allow-mq{3,4}-lloyd` flag
+      so dispatch routes through the Lloyd-specific arms.
+    - Canonical non-Lloyd variants for this model size live at
+      `schuttdev/hipfire-qwen3.6-27b` and are the recommended
+      starting point for typical inference.
+extra_gated_fields:
+  I have read the research disclaimer: checkbox
+---
+# Qwen3.6-27B — hipfire research quants (dev)
+> **Research preview.** Lloyd-Max codebook quantization is an
+> experimental format under active development. Quality envelope and
+> arch coverage differ from the canonical hipfire quants — see
+> "What's in this repo" below before downloading. The `-dev` repo
+> name distinguishes these dev-stage variants from any future
+> production-supported `hipfire-models/qwen3.6-27b` release.
+## What's in this repo
+| File | Format | Size | Status |
+|---|---|---:|---|
+| `qwen3.6-27b.mq3-lloyd` | MQ3-Lloyd-G256 (112 B/group, 8-entry codebook, FWHT-rotated) | 12.6 GB | research; quantize-time-gated by `--allow-mq3-lloyd` |
+| `qwen3.6-27b.mq4-lloyd` | MQ4-Lloyd-G256 (160 B/group, 16-entry codebook, FWHT-rotated) | 17.4 GB | research; quantize-time-gated by `--allow-mq4-lloyd`; earlier-stage than MQ3-Lloyd |
+## What is MQ{3,4}-Lloyd?
+Lloyd-Max codebook quantization with a per-group LDS-staged
+codebook. Each 256-element group carries an N-entry fp16 codebook
+plus packed indices:
+- **MQ3-Lloyd**: 8-entry codebook (16 B header) + 96 B 3-bit
+  cross-byte-packed indices = **112 B / group**.
+- **MQ4-Lloyd**: 16-entry codebook (32 B header) + 128 B 4-bit
+  nibble-pair indices = **160 B / group**.
+Reconstruction is a *codebook lookup* (`cb[index]`) rather than the
+affine `scale * q + zero_point` of HFQ3 (104 B/group) / HFQ4
+(136 B/group). Group strides differ; **mixing formats in a single
+dispatch is silent corruption** — hence the `--allow-mq*-lloyd`
+quantize-time gate and the matched batched-prefill dispatch arms in
+hipfire.
+## Why "research"?
+- **Quality drift on decode** (MQ3-Lloyd): the production GEMV
+  decode kernels carry a documented ~0.9 % PPL drift on the
+  Qwen3.5-9B reference model vs the slow-baseline path (universal
+  across gfx1100/1101/1102/1151), caused by a multi-accumulator
+  reordering that compounds across the inference loop. The same
+  envelope is expected on Qwen3.6-27B until measured. See
+  `feat/mq3-lloyd-gfx1151` follow-up devlog in the hipfire repo
+  for the root cause + measurement. Prefill kernels
+  (PR [#195](https://github.com/Kaden-Schutt/hipfire/pull/195))
+  are single-acc and drift-free; the decode-side fix is tracked
+  as a separate follow-up.
+- **Earlier-stage** (MQ4-Lloyd): wired through batched WMMA
+  prefill in PR [#197](https://github.com/Kaden-Schutt/hipfire/pull/197)
+  (issue #182 Phase 5b). Phase C ship-gate bench on gfx1100 is
+  pending — current numbers are gfx1151-only.
+- **Arch coverage**: gfx1100 / 1101 / 1102 / 1151 (RDNA3 + 3.5).
+  gfx1200 / 1201 (RDNA4) ship behind an opt-in env gate
+  (`HIPFIRE_LLOYD_GFX12=1`) pending external CI validation —
+  default behaviour on RDNA4 falls through to the per-token
+  fallback. gfx10 / gfx906 / gfx94x are not supported.
+## Usage with hipfire
+```bash
+# Pull a Lloyd quant into the local hipfire model cache:
+hf download hipfire-models/qwen3.6-27b-dev qwen3.6-27b.mq3-lloyd \
+  --local-dir ~/.hipfire/models
+# Or, for the MQ4-Lloyd variant:
+hf download hipfire-models/qwen3.6-27b-dev qwen3.6-27b.mq4-lloyd \
+  --local-dir ~/.hipfire/models
+# Run via the daemon (engine auto-detects the dtype from the file):
+./target/release/examples/daemon < <(echo \
+  '{"type":"load","model":"~/.hipfire/models/qwen3.6-27b.mq3-lloyd","params":{"max_seq":4096}}')
+```
+## Provenance
+- Quantization: post-training Lloyd-Max codebook fit on the FWHT-
+  rotated upstream Qwen3.6-27B weights via the hipfire quantizer
+  (`hipfire-quantize` with `--allow-mq3-lloyd` / `--allow-mq4-lloyd`).
+- Research PRs:
+  - [#195](https://github.com/Kaden-Schutt/hipfire/pull/195) — WMMA prefill kernels for MQ3-Lloyd (issue #116 Phase 5).
+  - [#197](https://github.com/Kaden-Schutt/hipfire/pull/197) — WMMA prefill kernels for MQ4-Lloyd (issue #182 Phase 5b).
+- Format details: `docs/plans/mq3-lloyd-wmma-prefill.md` and
+  `docs/plans/mq4-lloyd-wmma-prefill.md` in the hipfire repo.
+## Looking for the canonical (non-research) quants?
+Production-grade MQ3 / MQ4 / DFlash-draft variants for Qwen3.6-27B
+live at
+[schuttdev/hipfire-qwen3.6-27b](https://huggingface.co/schuttdev/hipfire-qwen3.6-27b)
+until those repos move under this org.
+## License
+Inherits the upstream Qwen3.6 license terms (Apache 2.0). The
+quantization metadata + codebooks are derived from the upstream
+weights.