Qwen3.5-4B — hipfire research quants (dev)
Research preview. Lloyd-Max codebook quantization is an experimental format under active development. Quality envelope and arch coverage differ from the canonical hipfire quants — see "What's in this repo" below before downloading. The
-devrepo name distinguishes these dev-stage variants from any future production-supportedhipfire-models/qwen3.5-4brelease.
What's in this repo
| File | Format | Size | Status |
|---|---|---|---|
qwen3.5-4b.mq3-lloyd |
MQ3-Lloyd-G256 (112 B/group, 8-entry codebook, FWHT-rotated) | 2.1 GB | research; quantize-time-gated by --allow-mq3-lloyd |
qwen3.5-4b.mq4-lloyd |
MQ4-Lloyd-G256 (160 B/group, 16-entry codebook, FWHT-rotated) | 2.7 GB | research; quantize-time-gated by --allow-mq4-lloyd; earlier-stage than MQ3-Lloyd |
What is MQ{3,4}-Lloyd?
Lloyd-Max codebook quantization with a per-group LDS-staged codebook. Each 256-element group carries an N-entry fp16 codebook plus packed indices:
- MQ3-Lloyd: 8-entry codebook (16 B header) + 96 B 3-bit cross-byte-packed indices = 112 B / group.
- MQ4-Lloyd: 16-entry codebook (32 B header) + 128 B 4-bit nibble-pair indices = 160 B / group.
Reconstruction is a codebook lookup (cb[index]) rather than the
affine scale * q + zero_point of HFQ3 (104 B/group) / HFQ4
(136 B/group). Group strides differ; mixing formats in a single
dispatch is silent corruption — hence the --allow-mq*-lloyd
quantize-time gate and the matched batched-prefill dispatch arms in
hipfire.
Why "research"?
- Quality drift on decode (MQ3-Lloyd): the production GEMV
decode kernels carry a documented ~0.9 % PPL drift on
Qwen3.5-9B vs the slow-baseline path (universal across
gfx1100/1101/1102/1151), caused by a multi-accumulator
reordering that compounds across the inference loop. See
feat/mq3-lloyd-gfx1151follow-up devlog in the hipfire repo for the root cause + measurement. Prefill kernels (PR #195) are single-acc and drift-free; the decode-side fix is tracked as a separate follow-up. - Earlier-stage (MQ4-Lloyd): wired through batched WMMA prefill in PR #197 (issue #182 Phase 5b). Phase C ship-gate bench on gfx1100 is pending — current numbers are gfx1151-only.
- Arch coverage: gfx1100 / 1101 / 1102 / 1151 (RDNA3 + 3.5).
gfx1200 / 1201 (RDNA4) ship behind an opt-in env gate
(
HIPFIRE_LLOYD_GFX12=1) pending external CI validation — default behaviour on RDNA4 falls through to the per-token fallback. gfx10 / gfx906 / gfx94x are not supported.
Usage with hipfire
# Pull a Lloyd quant into the local hipfire model cache:
hf download hipfire-models/qwen3.5-4b-dev qwen3.5-4b.mq3-lloyd \
--local-dir ~/.hipfire/models
# Or, for the MQ4-Lloyd variant:
hf download hipfire-models/qwen3.5-4b-dev qwen3.5-4b.mq4-lloyd \
--local-dir ~/.hipfire/models
# Run via the daemon (engine auto-detects the dtype from the file):
./target/release/examples/daemon < <(echo \
'{"type":"load","model":"~/.hipfire/models/qwen3.5-4b.mq3-lloyd","params":{"max_seq":4096}}')
Provenance
- Quantization: post-training Lloyd-Max codebook fit on the FWHT-
rotated upstream Qwen3.5-4B weights via the hipfire quantizer
(
hipfire-quantizewith--allow-mq3-lloyd/--allow-mq4-lloyd). - Research PRs:
- Format details:
docs/plans/mq3-lloyd-wmma-prefill.mdanddocs/plans/mq4-lloyd-wmma-prefill.mdin the hipfire repo.
Looking for the canonical (non-research) quants?
Production-grade MQ3 / MQ4 / MQ6 / HFQ4 / HFQ6 / DFlash-draft variants for Qwen3.5-4B live at schuttdev/hipfire-qwen3.5-4b until those repos move under this org.
License
Inherits the upstream Qwen3.5 license terms (Apache 2.0). The quantization metadata + codebooks are derived from the upstream weights.