Qwen3.5-9B — hipfire research quants

Research preview. Lloyd-Max codebook quantization is an experimental format under active development. Quality envelope and arch coverage differ from the canonical hipfire quants — see "What's in this repo" below before downloading.

What's in this repo

File Format Size Status
qwen3.5-9b.mq3-lloyd MQ3-Lloyd-G256 (112 B/group, 8-entry codebook, FWHT-rotated) 4.3 GB research; quantize-time-gated by --allow-mq3-lloyd
qwen3.5-9b.mq4-lloyd MQ4-Lloyd-G256 (160 B/group, 16-entry codebook, FWHT-rotated) 5.6 GB research; quantize-time-gated by --allow-mq4-lloyd; earlier-stage than MQ3-Lloyd

What is MQ{3,4}-Lloyd?

Lloyd-Max codebook quantization with a per-group LDS-staged codebook. Each 256-element group carries an N-entry fp16 codebook plus packed indices:

  • MQ3-Lloyd: 8-entry codebook (16 B header) + 96 B 3-bit cross-byte-packed indices = 112 B / group.
  • MQ4-Lloyd: 16-entry codebook (32 B header) + 128 B 4-bit nibble-pair indices = 160 B / group.

Reconstruction is a codebook lookup (cb[index]) rather than the affine scale * q + zero_point of HFQ3 (104 B/group) / HFQ4 (136 B/group). Group strides differ; mixing formats in a single dispatch is silent corruption — hence the --allow-mq*-lloyd quantize-time gate and the matched batched-prefill dispatch arms in hipfire.

Why "research"?

  • Quality drift on decode (MQ3-Lloyd): the production GEMV decode kernels carry a documented ~0.9 % PPL drift on Qwen3.5-9B vs the slow-baseline path (universal across gfx1100/1101/1102/1151), caused by a multi-accumulator reordering that compounds across the inference loop. See feat/mq3-lloyd-gfx1151 follow-up devlog in the hipfire repo for the root cause + measurement. Prefill kernels (PR #195) are single-acc and drift-free; the decode-side fix is tracked as a separate follow-up.
  • Earlier-stage (MQ4-Lloyd): wired through batched WMMA prefill in PR #197 (issue #182 Phase 5b). Phase C ship-gate bench on gfx1100 is pending — current numbers are gfx1151-only.
  • Arch coverage: gfx1100 / 1101 / 1102 / 1151 (RDNA3 + 3.5). gfx1200 / 1201 (RDNA4) ship behind an opt-in env gate (HIPFIRE_LLOYD_GFX12=1) pending external CI validation — default behaviour on RDNA4 falls through to the per-token fallback. gfx10 / gfx906 / gfx94x are not supported.

Usage with hipfire

# Pull a Lloyd quant into the local hipfire model cache:
hf download hipfire-models/qwen3.5-9b qwen3.5-9b.mq3-lloyd \
  --local-dir ~/.hipfire/models

# Or, for the MQ4-Lloyd variant:
hf download hipfire-models/qwen3.5-9b qwen3.5-9b.mq4-lloyd \
  --local-dir ~/.hipfire/models

# Run via the daemon (engine auto-detects the dtype from the file):
./target/release/examples/daemon < <(echo \
  '{"type":"load","model":"~/.hipfire/models/qwen3.5-9b.mq3-lloyd","params":{"max_seq":4096}}')

Provenance

  • Quantization: post-training Lloyd-Max codebook fit on the FWHT- rotated upstream Qwen3.5-9B weights via the hipfire quantizer (hipfire-quantize with --allow-mq3-lloyd / --allow-mq4-lloyd).
  • Research PRs:
    • #195 — WMMA prefill kernels for MQ3-Lloyd (issue #116 Phase 5).
    • #197 — WMMA prefill kernels for MQ4-Lloyd (issue #182 Phase 5b).
  • Format details: docs/plans/mq3-lloyd-wmma-prefill.md and docs/plans/mq4-lloyd-wmma-prefill.md in the hipfire repo.

Looking for the canonical (non-research) quants?

Production-grade MQ3 / MQ4 / MQ6 / MQ8 / HFQ4 / HFQ6 / DFlash-draft variants for Qwen3.5-9B live at schuttdev/hipfire-qwen3.5-9b until those repos move under this org.

License

Inherits the upstream Qwen3.5 license terms (Apache 2.0). The quantization metadata + codebooks are derived from the upstream weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support