Dimensional Matrixing: B1p KnowledgeCorridor

Non-uniform quantization of Qwen3.5-27B using per-layer precision allocation based on attention head behavior.

Uniform Q4 B1p (this model) Delta
Avg Weight Bits 4.00 3.19 -20%
Model Size 14.0 GB 15.2 GB --
MMLU-Pro 52.2% 49.4% -2.8
ARC-Challenge 90.4% 92.3% +1.9
GSM8K 90.9% 90.4% -0.5

Beats uniform Q4 on reasoning. Matches on math. 20% fewer bits on average.

What is Dimensional Matrixing?

Qwen3.5-27B has 64 layers: 48 GatedDeltaNet (linear attention, no KV cache) and 16 standard attention layers. Uniform quantization treats them all the same. DM doesn't.

We profile every layer along three axes:

  1. Weight sensitivity -- per-layer quantization sweep at 2/3/4/6 bits
  2. KV cache sensitivity -- per-layer KV perturbation tests
  3. Attention head classification -- sink heads, retrieval heads, mixing heads

The result: a precision map that allocates 2-bit to layers that don't care and 8-bit to layers that do.

Key insight: Sink heads demand precision

Layer 15 has 15 out of 16 heads dedicated to position-0 attention (risk=24.6). It gets 8-bit keys. Layer 55 has zero sink heads (risk=0.7). It gets 2-bit keys. The 4x spread in precision across layers is driven by measurable head behavior.

K/V Asymmetry

Keys participate in every Q*K dot product. Values are gated by attention weights. B1p allocates 5.69-bit keys vs 3.63-bit values on average.

Knowledge Corridor

GDN layers flanking high-risk attention layers need protection. Promoting 4 GDN layers around L15 from 2-bit to 4-bit boosted ARC by +1.2 points.

Precision Map

Layer  Wt  K   V   Risk  Sinks  Role
  3     3   2   2   0.8    0    Pure mixer (safe to compress)
  7     3   2   2   1.1    0    Pure mixer
 11     3   6   4   7.2    4    Transition zone
 15     4   8   4  24.6   15    CRITICAL: model's primary sink nexus
 19     3   8   4  13.6    8    High sink concentration
 23     3   8   4  19.4   12    Secondary sink nexus
 27     3   8   4  17.1   10    High sink concentration
 31     3   8   4  13.8    8    High sink concentration
 35     4   8   6  11.9    7    High risk (V needs extra precision)
 39     3   8   6  10.0    6    High risk (V needs extra precision)
 43     4   4   3   7.0    4    Transition zone
 47     4   8   4  23.6   15    CRITICAL: twin sink nexus with L15
 51     4   6   4   7.5    4    Transition zone
 55     4   2   2   0.7    0    Pure mixer (most compressible)
 59     4   2   2   1.2    0    Pure mixer
 63     4   3   3   2.9    0    Final layer, specialized behavior

48 GDN layers: 7 at 2-bit, 38 at 3-bit, 19 at 4-bit weights. No KV cache.

Usage

MLX (Apple Silicon)

import mlx_lm

model, tokenizer = mlx_lm.load("Funkylazer/dm-qwen3.5-27b-b1p-knowledgecorridor")
response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement.", max_tokens=512)
print(response)

Hardware Requirements

  • Minimum: 16 GB unified memory (Apple Silicon) or 16 GB VRAM
  • Comfortable: 24 GB+
  • Tested on: Mac mini M4 Pro, 64 GB unified memory

Files

  • model-0000{1-4}-of-00004.safetensors -- Quantized model weights (15.2 GB total)
  • dm_precision_map.json -- Full per-layer precision map with risk scores, sink counts, entropy
  • config.json -- MLX-compatible model configuration
  • tokenizer.json + tokenizer_config.json -- Tokenizer files
  • chat_template.jinja -- Chat template

Branch History

This model is the result of 16 iterative branches (B1a-B1p), each testing a specific quantization hypothesis:

Branch Hypothesis MMLU-Pro ARC GSM8K
B1e Fix knowledge-critical GDN layers 51.4% 85.5% 91.9%
B1k Ultra-aggressive compression 47.0% 50.2% 81.9%
B1o Composite merge 49.6% 91.1% 91.3%
B1p Knowledge corridor protection 49.4% 92.3% 90.4%

Community Testing Needed

We validated on 3 benchmarks on Apple Silicon. This model needs evaluation on:

  • Code generation (HumanEval, MBPP)
  • Long-context retrieval (NIAH at 32K-128K)
  • Instruction following (MT-Bench, AlpacaEval)
  • CUDA GPU performance
  • Inference speed vs uniform Q4

Please open an issue on GitHub with results.

Paper & Code

Citation

@misc{zhivelev2026dm,
  title={Dimensional Matrixing: Non-Uniform Quantization for Hybrid Attention-SSM Architectures},
  author={Zhivelev, Leon},
  year={2026},
  url={https://github.com/funkylazer/dimensional-matrixing}
}

License

Apache 2.0. Base model (Qwen3.5-27B) is Apache 2.0 by Alibaba/Qwen.

Downloads last month
368
Safetensors
Model size
27B params
Tensor type
BF16
F32
U32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Funkylazer/dm-qwen3.5-27b-b1p-knowledgecorridor

Base model

Qwen/Qwen3.5-27B
Quantized
(199)
this model