Dimensional Matrixing: B1p KnowledgeCorridor

Non-uniform quantization of Qwen3.5-27B using per-layer precision allocation based on attention head behavior.

	Uniform Q4	B1p (this model)	Delta
Avg Weight Bits	4.00	3.19	-20%
Model Size	14.0 GB	15.2 GB	--
MMLU-Pro	52.2%	49.4%	-2.8
ARC-Challenge	90.4%	92.3%	+1.9
GSM8K	90.9%	90.4%	-0.5

Beats uniform Q4 on reasoning. Matches on math. 20% fewer bits on average.

What is Dimensional Matrixing?

Qwen3.5-27B has 64 layers: 48 GatedDeltaNet (linear attention, no KV cache) and 16 standard attention layers. Uniform quantization treats them all the same. DM doesn't.

We profile every layer along three axes:

Weight sensitivity -- per-layer quantization sweep at 2/3/4/6 bits
KV cache sensitivity -- per-layer KV perturbation tests
Attention head classification -- sink heads, retrieval heads, mixing heads

The result: a precision map that allocates 2-bit to layers that don't care and 8-bit to layers that do.

Key insight: Sink heads demand precision

Layer 15 has 15 out of 16 heads dedicated to position-0 attention (risk=24.6). It gets 8-bit keys. Layer 55 has zero sink heads (risk=0.7). It gets 2-bit keys. The 4x spread in precision across layers is driven by measurable head behavior.

K/V Asymmetry

Keys participate in every Q*K dot product. Values are gated by attention weights. B1p allocates 5.69-bit keys vs 3.63-bit values on average.

Knowledge Corridor

GDN layers flanking high-risk attention layers need protection. Promoting 4 GDN layers around L15 from 2-bit to 4-bit boosted ARC by +1.2 points.

Precision Map

Layer  Wt  K   V   Risk  Sinks  Role
  3     3   2   2   0.8    0    Pure mixer (safe to compress)
  7     3   2   2   1.1    0    Pure mixer
 11     3   6   4   7.2    4    Transition zone
 15     4   8   4  24.6   15    CRITICAL: model's primary sink nexus
 19     3   8   4  13.6    8    High sink concentration
 23     3   8   4  19.4   12    Secondary sink nexus
 27     3   8   4  17.1   10    High sink concentration
 31     3   8   4  13.8    8    High sink concentration
 35     4   8   6  11.9    7    High risk (V needs extra precision)
 39     3   8   6  10.0    6    High risk (V needs extra precision)
 43     4   4   3   7.0    4    Transition zone
 47     4   8   4  23.6   15    CRITICAL: twin sink nexus with L15
 51     4   6   4   7.5    4    Transition zone
 55     4   2   2   0.7    0    Pure mixer (most compressible)
 59     4   2   2   1.2    0    Pure mixer
 63     4   3   3   2.9    0    Final layer, specialized behavior

48 GDN layers: 7 at 2-bit, 38 at 3-bit, 19 at 4-bit weights. No KV cache.

Usage

MLX (Apple Silicon)

import mlx_lm

model, tokenizer = mlx_lm.load("Funkylazer/dm-qwen3.5-27b-b1p-knowledgecorridor")
response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement.", max_tokens=512)
print(response)

Hardware Requirements

Minimum: 16 GB unified memory (Apple Silicon) or 16 GB VRAM
Comfortable: 24 GB+
Tested on: Mac mini M4 Pro, 64 GB unified memory

Files

model-0000{1-4}-of-00004.safetensors -- Quantized model weights (15.2 GB total)
dm_precision_map.json -- Full per-layer precision map with risk scores, sink counts, entropy
config.json -- MLX-compatible model configuration
tokenizer.json + tokenizer_config.json -- Tokenizer files
chat_template.jinja -- Chat template

Branch History

This model is the result of 16 iterative branches (B1a-B1p), each testing a specific quantization hypothesis:

Branch	Hypothesis	MMLU-Pro	ARC	GSM8K
B1e	Fix knowledge-critical GDN layers	51.4%	85.5%	91.9%
B1k	Ultra-aggressive compression	47.0%	50.2%	81.9%
B1o	Composite merge	49.6%	91.1%	91.3%
B1p	Knowledge corridor protection	49.4%	92.3%	90.4%

Community Testing Needed

We validated on 3 benchmarks on Apple Silicon. This model needs evaluation on:

Code generation (HumanEval, MBPP)
Long-context retrieval (NIAH at 32K-128K)
Instruction following (MT-Bench, AlpacaEval)
CUDA GPU performance
Inference speed vs uniform Q4

Please open an issue on GitHub with results.

Paper & Code

Paper: GitHub - paper/sections/
Interactive Site: funkylazer.github.io/dimensional-matrixing
Code: github.com/funkylazer/dimensional-matrixing

Citation

@misc{zhivelev2026dm,
  title={Dimensional Matrixing: Non-Uniform Quantization for Hybrid Attention-SSM Architectures},
  author={Zhivelev, Leon},
  year={2026},
  url={https://github.com/funkylazer/dimensional-matrixing}
}

License

Apache 2.0. Base model (Qwen3.5-27B) is Apache 2.0 by Alibaba/Qwen.

Downloads last month: 368

Safetensors

Model size

27B params

Tensor type

BF16

F32

U32

MLX

Hardware compatibility

4-bit

Model tree for Funkylazer/dm-qwen3.5-27b-b1p-knowledgecorridor

Base model

Qwen/Qwen3.5-27B

Quantized

(199)

this model