tomkay's picture
Add real experiment results: activation profiling, DynaMINT tiered quantization, expert pruning, comparison table
1df3e61 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models</title>
<script>window.MathJax = {tex: {inlineMath: [['$','$'],['\\(','\\)']]}};</script>
<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<style>
:root {
--bg: #fafafa;
--text: #1a1a2e;
--accent: #2d5aa0;
--muted: #555;
--border: #ddd;
--code-bg: #f0f0f0;
--table-header: #e8eef6;
--highlight: #fff3cd;
}
* { box-sizing: border-box; margin: 0; padding: 0; }
body {
font-family: 'Georgia', 'Times New Roman', serif;
line-height: 1.7;
color: var(--text);
background: var(--bg);
max-width: 52em;
margin: 0 auto;
padding: 2em 1.5em 4em;
}
h1 { font-size: 1.7em; line-height: 1.3; text-align: center; margin: 1.5em 0 0.3em; color: var(--text); }
.authors { text-align: center; color: var(--muted); margin-bottom: 2em; font-style: italic; }
h2 { font-size: 1.3em; margin: 2em 0 0.7em; color: var(--accent); border-bottom: 2px solid var(--accent); padding-bottom: 0.2em; }
h3 { font-size: 1.1em; margin: 1.5em 0 0.5em; color: #333; }
h4 { font-size: 1em; font-weight: bold; margin: 1.2em 0 0.3em; color: var(--text); }
p { margin: 0.7em 0; text-align: justify; }
.abstract { background: #f5f5f5; border-left: 4px solid var(--accent); padding: 1.2em 1.5em; margin: 1.5em 0 2em; font-size: 0.95em; }
.abstract strong { color: var(--accent); }
ul, ol { margin: 0.5em 0 0.5em 2em; }
li { margin: 0.3em 0; }
table { border-collapse: collapse; margin: 1em auto; font-size: 0.9em; width: auto; min-width: 50%; }
th, td { border: 1px solid var(--border); padding: 0.45em 0.8em; text-align: center; }
th { background: var(--table-header); font-weight: bold; }
td:first-child, th:first-child { text-align: left; }
tr:nth-child(even) { background: #f9f9f9; }
.best { font-weight: bold; color: #1a7a1a; }
.caption { text-align: center; font-size: 0.88em; color: var(--muted); margin-top: 0.5em; margin-bottom: 1.5em; font-style: italic; }
.equation { display: block; text-align: center; margin: 1em 0; padding: 0.8em; background: var(--code-bg); border-radius: 4px; font-family: 'Courier New', monospace; font-size: 0.92em; overflow-x: auto; }
.algorithm { background: #fafafa; border: 1px solid var(--border); padding: 1em 1.5em; margin: 1em 0; font-family: 'Courier New', monospace; font-size: 0.88em; line-height: 1.5; border-radius: 4px; }
.algorithm .keyword { color: var(--accent); font-weight: bold; }
.algorithm .comment { color: #888; font-style: italic; }
code { background: var(--code-bg); padding: 0.15em 0.4em; border-radius: 3px; font-size: 0.9em; }
.ref { color: var(--accent); cursor: default; }
.footnote { font-size: 0.85em; color: var(--muted); border-top: 1px solid var(--border); margin-top: 2em; padding-top: 1em; }
.bib { font-size: 0.85em; margin: 0.3em 0; padding-left: 2.5em; text-indent: -2.5em; }
.highlight-row { background: var(--highlight) !important; }
@media (max-width: 600px) { body { padding: 1em 0.8em; font-size: 0.95em; } table { font-size: 0.8em; } th, td { padding: 0.3em 0.5em; } }
</style>
</head>
<body>
<!-- ============================================================ -->
<!-- TITLE & AUTHORS -->
<!-- ============================================================ -->
<h1>Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models</h1>
<div class="authors">Black Sheep AI Research &mdash; baa.ai</div>
<!-- ============================================================ -->
<!-- ABSTRACT -->
<!-- ============================================================ -->
<div class="abstract">
<strong>Abstract.</strong>
Mixture-of-Experts architectures with 128&ndash;512 experts per layer create unique challenges
for post-training quantization. Existing methods apply uniform bit-widths across all experts
or rely on coarse per-layer decisions. We present the first comprehensive data-free sensitivity
analysis and mixed-precision quantization study on real 512-expert MoE models. Using
weight-based sensitivity metrics&mdash;spectral analysis, per-group kurtosis, output noise
amplification, and reconstruction error&mdash;we profile 2,347 tensors across Qwen3.5-397B-A17B
(512 experts per layer) and validate across three architecture scales. We discover that kurtosis
is the dominant sensitivity predictor (Spearman $\rho = 0.795$), that 89.4% of expert
parameters tolerate 4-bit quantization under SQNR safety constraints, and that group size has a
larger impact on perplexity than bit-width allocation. Our pipeline formulates allocation as a
Multiple-Choice Knapsack Problem (MCKP), solving for the provably near-optimal
(bits, group_size) assignment per expert group in under 100&thinsp;ms. On commodity Apple Silicon,
we match the perplexity of a threshold-based mixed-precision baseline at 16% fewer average bits
(4.31 vs 5.06), and improve over uniform 4-bit quantization at matched group size. We provide ablations on codebook quantization (+41% MSE reduction) and Hadamard
rotations (+8.2%), establishing practical boundaries for future MoE compression.
We further introduce DynaMINT, a tiered expert quantization scheme informed by activation
profiling, which maintains quality at only +0.5% perplexity degradation with 11.6% of experts
at 2-bit and 3.6% pruned. An expert pruning study reveals that even 5% expert removal causes
13x perplexity degradation, establishing that activation frequency alone is not a safe pruning
criterion. All code, manifests, and models are released.
</div>
<!-- ============================================================ -->
<!-- 1. INTRODUCTION -->
<!-- ============================================================ -->
<h2>1. Introduction</h2>
<p>
Mixture-of-Experts (MoE) has become the dominant architecture for frontier large language
models. Qwen3.5-397B-A17B <span class="ref">[7]</span> deploys 512 experts per MoE layer with
only 17 billion parameters active per token, achieving state-of-the-art quality at a fraction
of the compute cost implied by total parameter count. Llama&nbsp;4 Maverick
<span class="ref">[8]</span> uses 128 experts across 24 MoE layers, while earlier models such
as Mixtral-8x7B <span class="ref">[3]</span> and DeepSeek-V3 <span class="ref">[5]</span>
established the pattern at smaller expert counts. Despite the compute efficiency of sparse
activation, deployment remains constrained by total model size: Qwen3.5-397B requires over
740&thinsp;GB in BF16, far exceeding the memory of any single accelerator.
</p>
<p>
Post-training quantization is the standard solution, but existing methods apply uniform
bit-widths across all experts. This ignores a fundamental property of MoE architectures:
experts are trained semi-independently through routing and develop heterogeneous weight
distributions. Some experts have near-Gaussian weight distributions that compress gracefully
to 4-bit; others exhibit heavy-tailed distributions with high kurtosis that suffer catastrophic
accuracy loss at the same precision. Uniform quantization wastes bits on robust experts while
under-protecting sensitive ones.
</p>
<p>
Prior work on MoE quantization&mdash;MC-MoE <span class="ref">[15]</span>, QMoE
<span class="ref">[16]</span>, MoEQuant <span class="ref">[17]</span>, and DynaExq
<span class="ref">[18]</span>&mdash;all require calibration data to estimate expert
sensitivity via activation traces or routing statistics. This creates practical barriers:
calibration sets must be representative of deployment distribution, the profiling pass
requires loading the full model in high precision, and results may not transfer across
domains. More critically, none of these methods has been validated at 512-expert scale, where
the combinatorial space of per-expert configurations explodes and calibration cost becomes
prohibitive.
</p>
<p>
We make four contributions: <strong>(1)</strong> the first data-free sensitivity study at
512-expert scale, profiling 2,347 tensors across Qwen3.5-397B-A17B using only weight
statistics; <strong>(2)</strong> an MCKP-based allocation pipeline with expert grouping
constraints that solves in under 100&thinsp;ms for any model size; <strong>(3)</strong>
comprehensive ablations on codebook quantization and Hadamard rotation techniques,
establishing practical boundaries for future MoE compression; and <strong>(4)</strong> open
release of all code, sensitivity manifests, and quantized models. The entire pipeline runs on
a single Apple M2 Ultra with 192&thinsp;GB unified memory <span class="ref">[21]</span>,
requiring no GPU cluster and no calibration data.
</p>
<!-- ============================================================ -->
<!-- 2. RELATED WORK -->
<!-- ============================================================ -->
<h2>2. Related Work</h2>
<h3>2.1 Mixture-of-Experts Architectures</h3>
<p>
The modern MoE paradigm traces from GShard <span class="ref">[1]</span> and the Switch
Transformer <span class="ref">[2]</span>, which demonstrated that sparsely-activated expert
layers could scale model capacity without proportional compute cost. Mixtral-8x7B
<span class="ref">[3]</span> brought MoE to open-weight models with 8 experts per layer,
selecting 2 per token. DeepSeek-V2 <span class="ref">[4]</span> introduced fine-grained
experts (up to 160 per layer), and DeepSeek-V3 <span class="ref">[5]</span> scaled to 256
experts with auxiliary-loss-free load balancing. Qwen3 <span class="ref">[6]</span> and
Qwen3.5 <span class="ref">[7]</span> pushed further to 512 experts per layer while activating
only 17B of 397B total parameters. Llama&nbsp;4 <span class="ref">[8]</span> adopted a hybrid
design with both dense and MoE layers. The trend is clear: expert counts are growing
rapidly, and quantization methods must keep pace.
</p>
<h3>2.2 MoE-Specific Quantization</h3>
<p>
MC-MoE <span class="ref">[15]</span> uses calibration data to identify and protect
frequently-activated experts, applying lower precision to rarely-used ones. QMoE
<span class="ref">[16]</span> compresses all experts to under 1 bit per parameter using
learned codebooks with calibration-based distillation. MoEQuant <span class="ref">[17]</span>
proposes expert-wise calibration to handle activation outliers specific to each expert. DynaExq
<span class="ref">[18]</span> dynamically adjusts expert quantization based on runtime routing
patterns. All of these methods require calibration data and activation traces, creating
practical deployment barriers. None has been demonstrated at 512-expert scale.
</p>
<table>
<tr>
<th>Method</th>
<th>Expert Count Tested</th>
<th>Granularity</th>
<th>Calibration Data</th>
<th>Data-Free</th>
<th>Hardware</th>
</tr>
<tr>
<td>MC-MoE <span class="ref">[15]</span></td>
<td>&le;16</td>
<td>Per-expert</td>
<td>Required</td>
<td>No</td>
<td>GPU</td>
</tr>
<tr>
<td>QMoE <span class="ref">[16]</span></td>
<td>128</td>
<td>Per-expert</td>
<td>Required</td>
<td>No</td>
<td>GPU</td>
</tr>
<tr>
<td>MoEQuant <span class="ref">[17]</span></td>
<td>&le;64</td>
<td>Per-expert</td>
<td>Required</td>
<td>No</td>
<td>GPU</td>
</tr>
<tr>
<td>DynaExq <span class="ref">[18]</span></td>
<td>&le;128</td>
<td>Dynamic</td>
<td>Required</td>
<td>No</td>
<td>GPU</td>
</tr>
<tr class="highlight-row">
<td><strong>Ours</strong></td>
<td><strong>256</strong></td>
<td><strong>Per-expert tiered</strong></td>
<td><strong>None</strong></td>
<td class="best"><strong>Yes</strong></td>
<td><strong>Apple Silicon</strong></td>
</tr>
</table>
<div class="caption">Table A: Comparison of MoE quantization methods. Our approach is the only data-free method and scales to the highest expert count.</div>
<h3>2.3 Mixed-Precision Quantization for Dense Models</h3>
<p>
For dense models, GPTQ <span class="ref">[9]</span> uses second-order information for
layer-wise quantization; AWQ <span class="ref">[10]</span> identifies salient weight channels
via activation magnitudes; SqueezeLLM <span class="ref">[11]</span> separates outliers into a
sparse format; HQQ <span class="ref">[12]</span> provides fast half-quadratic quantization
without calibration data; and MXQ <span class="ref">[13]</span> assigns mixed precision at
sub-layer granularity. QuIP# <span class="ref">[14]</span> applies random orthogonal
transformations to incoherify weight matrices before quantization. While these methods have
advanced the state of the art for dense models, they do not address the unique challenges of
MoE: heterogeneous expert sensitivity, the combinatorial explosion of per-expert configuration
space, and framework constraints that tie all experts within a layer to a shared quantization
config. Our work fills this gap with a data-free method validated at 512-expert scale under a
budget-constrained optimization framework.
</p>
<!-- ============================================================ -->
<!-- 3. EXPERT SENSITIVITY PROFILING -->
<!-- ============================================================ -->
<h2>3. Expert Sensitivity Profiling</h2>
<h3>3.1 Weight-Based Sensitivity Metrics</h3>
<p>
Unlike activation-based methods that require calibration data and forward passes, we analyze
sensitivity entirely from weight tensor properties. This makes profiling data-free and
embarrassingly parallel across shards. We compute four complementary metrics for each tensor:
</p>
<ul>
<li><strong>SVD spectral features.</strong> We compute the singular value decomposition of
each weight matrix and extract three scale-invariant features: the stable rank
($\|W\|_F^2 / \|W\|_2^2$, measuring effective dimensionality), spectral tail mass
(fraction of Frobenius energy outside the top-$k$ singular values, $k = \text{rank}/10$),
and log condition number ($\log_{10}(\sigma_1 / \sigma_{\min})$, capped at 10.0).
Tensors with low stable rank and high tail mass have information concentrated in a few
directions and are more sensitive to quantization noise.</li>
<li><strong>Per-group kurtosis.</strong> We partition weight tensors into groups (matching
the quantization group size) and compute excess kurtosis per group, then aggregate via the
95th percentile. High kurtosis indicates heavy-tailed distributions with outlier weights
that are poorly represented by uniform quantization grids. We extract four features:
mean, median, 95th percentile, and max kurtosis across groups.</li>
<li><strong>Output noise amplification.</strong> We estimate how quantization noise in
weights amplifies through the linear transformation by computing
$\sigma_{\text{out}} / \sigma_{\text{noise}}$ where
$\sigma_{\text{noise}}$ is the expected quantization step size. This captures the
condition-number-like sensitivity of the weight matrix.</li>
<li><strong>Reconstruction error (NRMSE).</strong> We simulate quantization at each
candidate (bits, group_size) configuration and measure
$\text{NRMSE} = \|W - Q(W)\|_F / \|W\|_F$. This serves as both a metric and the
optimization objective for bit allocation.</li>
</ul>
<h3>3.2 Expert Analysis Modes</h3>
<p>
MoE models present a scaling challenge: Qwen3.5-397B has 512 experts per layer across 60
layers, yielding thousands of expert weight tensors. Profiling every expert independently is
feasible but expensive. We implement two analysis modes in the expert handler:
</p>
<ul>
<li><strong>Mode A (individual):</strong> For models with $\leq 32$ experts per layer,
compute full sensitivity metrics for every expert tensor and use worst-case (maximum
sensitivity) across experts within each group for conservative allocation.</li>
<li><strong>Mode B (clustered):</strong> For models with $> 32$ experts per layer, cluster
experts using $k$-means on (Frobenius norm, kurtosis) features into $\sqrt{n_{\text{experts}}}$
clusters, then sample one representative expert per cluster. This reduces profiling cost
by $10\text{--}20\times$ while preserving the distribution of sensitivity scores.</li>
</ul>
<p>
For Qwen3.5-397B (512 experts), Mode B reduces the number of fully-profiled expert tensors
from 30,720 to approximately 2,347 while maintaining coverage of the sensitivity distribution.
</p>
<h3>3.3 Cross-Architecture Scaling Study</h3>
<p>
We profile three architectures spanning dense to large-scale MoE to understand how sensitivity
distributions change with expert count and model scale:
</p>
<table>
<tr>
<th>Model</th>
<th>Architecture</th>
<th>Parameters</th>
<th>Tensors</th>
<th>Layers</th>
<th>Avg Bits</th>
<th>Est. Size</th>
</tr>
<tr>
<td>Qwen3-8B</td>
<td>Dense</td>
<td>8.2B</td>
<td>399</td>
<td>36</td>
<td>6.84</td>
<td>6.7 GB</td>
</tr>
<tr>
<td>Llama4-Maverick</td>
<td>MoE 128E</td>
<td>401.6B</td>
<td>1,061</td>
<td>48</td>
<td>4.78</td>
<td>230.4 GB</td>
</tr>
<tr>
<td>Qwen3.5-397B</td>
<td>MoE 512E</td>
<td>403.4B</td>
<td>2,924</td>
<td>60</td>
<td>5.06</td>
<td>245.1 GB</td>
</tr>
</table>
<div class="caption">Table 1: Cross-architecture scaling study. Tensor counts reflect the profiled set (Mode B clustering for MoE models).</div>
<h3>3.4 Metric Correlation Analysis</h3>
<p>
To identify which metrics best predict quantization sensitivity, we compute rank correlations
between each metric and reconstruction error (NRMSE at 4-bit) across all 2,347 profiled
tensors in Qwen3.5-397B:
</p>
<table>
<tr>
<th>Metric</th>
<th>Spearman $\rho$</th>
<th>Pearson $r$</th>
<th>$p$-value</th>
</tr>
<tr class="highlight-row">
<td>Per-group kurtosis</td>
<td class="best">0.795</td>
<td>0.480</td>
<td>&lt;1e-135</td>
</tr>
<tr>
<td>Cross-layer position</td>
<td>&minus;0.468</td>
<td>&minus;0.224</td>
<td>&lt;1e-128</td>
</tr>
<tr>
<td>SVD spectral features</td>
<td>0.391</td>
<td>0.303</td>
<td>&lt;1e-86</td>
</tr>
<tr>
<td>Composite (weighted)</td>
<td>0.374</td>
<td>0.400</td>
<td>&lt;1e-79</td>
</tr>
<tr>
<td>Output sensitivity</td>
<td>0.212</td>
<td>0.455</td>
<td>&lt;1e-25</td>
</tr>
</table>
<div class="caption">Table 2: Metric correlation with reconstruction error (NRMSE at 4-bit) across 2,347 tensors in Qwen3.5-397B-A17B.</div>
<p>
Kurtosis dominates as the sensitivity predictor with Spearman $\rho = 0.795$, substantially
ahead of the next best metric. Output sensitivity, despite its intuitive appeal, achieves
only $\rho = 0.212$. Investigation reveals that output sensitivity saturates at 1.0 for 99.5%
of MoE expert tensors (median = 1.0), losing all discriminatory power at 512-expert scale.
This saturation occurs because expert weight matrices in large MoE models tend to have similar
spectral norms, making the noise amplification ratio nearly identical across experts.
</p>
<p>
The negative cross-layer position correlation ($\rho = -0.468$) confirms the well-known
U-shaped sensitivity pattern: early and late layers are more sensitive to quantization than
middle layers. This pattern holds across all three architectures and is captured by the soft
protection priors in our allocation pipeline.
</p>
<!-- ============================================================ -->
<!-- 4. PER-EXPERT MIXED-PRECISION PIPELINE -->
<!-- ============================================================ -->
<h2>4. Per-Expert Mixed-Precision Pipeline</h2>
<h3>4.1 Rate-Distortion Profiling</h3>
<p>
For each tensor, we compute the reconstruction error (NRMSE) at eight candidate
(bits, group_size) configurations, forming a rate-distortion curve:
</p>
<div class="equation">
$\mathcal{C} = \{(2, 32),\; (3, 64),\; (4, 32),\; (4, 64),\; (4, 128),\; (8, 64),\; (8, 128),\; (16, {-})\}$
</div>
<p>
Each configuration implies a specific size cost (bits per parameter plus scale/zero-point
overhead from the group size) and a distortion level. The rate-distortion curve captures the
tensor-specific tradeoff: some tensors see a large NRMSE jump between 4-bit and 8-bit while
others degrade gracefully, making them good candidates for aggressive quantization.
</p>
<h3>4.2 Expert Grouping Constraint</h3>
<p>
MLX's <code>SwitchLinear</code> module <span class="ref">[21]</span> requires all experts
within a layer to share a single quantization configuration (bits and group_size). This is a
hard framework constraint: the quantized weight tensor for all experts in a layer is stored as
a single contiguous array with uniform element width. Consequently, our analysis is per-expert
but the allocation must be per-expert-group, where each group corresponds to all experts
sharing a (layer, projection_type) pair.
</p>
<p>
We aggregate per-expert NRMSE values into a group-level distortion estimate using the
parameter-weighted mean across all experts in the group:
</p>
<div class="equation">
$\text{NRMSE}_{\text{group}}(b, g) = \frac{\sum_{e \in \text{group}} n_e \cdot \text{NRMSE}_e(b, g)}{\sum_{e \in \text{group}} n_e}$
</div>
<p>
where $n_e$ is the number of parameters in expert $e$. This weighting ensures that larger
experts (which contribute more to total model size) have proportionally more influence on the
group allocation decision.
</p>
<h3>4.3 MCKP Formulation</h3>
<p>
We formulate the bit-width and group-size allocation as a Multiple-Choice Knapsack Problem
(MCKP) <span class="ref">[23]</span>. Let $i$ index the tensor groups, $\pi_i$ denote soft
protection priors, and $(b_i, g_i) \in \mathcal{C}_i$ denote the candidate configurations for
group $i$. The optimization problem is:
</p>
<div class="equation">
$\min_{\{(b_i, g_i)\}} \sum_i \pi_i \cdot \text{NRMSE}_i(b_i, g_i) \quad \text{s.t.} \quad \sum_i \text{size}_i(b_i, g_i) \leq B, \quad (b_i, g_i) \in \mathcal{C}_i \;\; \forall i$
</div>
<p>
where $B$ is the memory budget and $\text{size}_i(b_i, g_i)$ computes the storage cost
including scale and zero-point overhead. The soft protection priors $\pi_i$ increase the
effective distortion cost for structurally important components, discouraging aggressive
quantization of embeddings, layer norms, and boundary layers:
</p>
<table>
<tr>
<th>Component</th>
<th>Prior Weight ($\pi$)</th>
</tr>
<tr>
<td>Embeddings</td>
<td>10.0x</td>
</tr>
<tr>
<td>LM head</td>
<td>10.0x</td>
</tr>
<tr>
<td>Router weights</td>
<td>8.0x</td>
</tr>
<tr>
<td>First 2 layers</td>
<td>3.0x</td>
</tr>
<tr>
<td>Last 2 layers</td>
<td>2.0x</td>
</tr>
<tr>
<td>LayerNorm</td>
<td>$\infty$ (never quantize)</td>
</tr>
<tr>
<td>All other tensors</td>
<td>1.0x</td>
</tr>
</table>
<div class="caption">Soft protection priors. Higher weights penalize low-precision assignment for structurally important components.</div>
<h3>4.4 SQNR Safety Veto</h3>
<p>
Before optimization, we apply a Signal-to-Quantization-Noise Ratio (SQNR) safety veto
<span class="ref">[22]</span>. For each tensor and each candidate configuration, we compute:
</p>
<div class="equation">
$\text{SQNR}(W, b, g) = 10 \cdot \log_{10} \frac{\|W\|_F^2}{\|W - Q_{b,g}(W)\|_F^2}$
</div>
<p>
Any configuration with $\text{SQNR} < 9\;\text{dB}$ (the default floor) is removed from the
candidate set $\mathcal{C}_i$ before the MCKP solver runs. This hard constraint prevents
catastrophic quantization of tensors where the quantization noise exceeds approximately 35%
of the signal power, regardless of what the budget optimization might prefer.
</p>
<h3>4.5 eCDF Normalization</h3>
<p>
Raw metric values span vastly different scales across architectures and model sizes. We
replace all hardcoded normalization bounds with empirical CDF (eCDF) normalization: each
metric value is transformed to its percentile rank across all tensors in the model, yielding
scale-invariant scores in $[0, 1]$. For a metric value $x$ with observed values
$\{x_1, \ldots, x_n\}$:
</p>
<div class="equation">
$\text{eCDF}(x) = \frac{1}{n} \sum_{j=1}^{n} \mathbf{1}[x_j \leq x]$
</div>
<p>
This eliminates the need for per-metric normalization constants and adapts automatically to
the distribution of any model, whether dense or MoE, 8B or 400B parameters.
</p>
<h3>4.6 Greedy Solver</h3>
<p>
We solve the MCKP using a greedy efficiency ordering. Starting from the minimum-cost (lowest
bit-width) feasible assignment, we enumerate all possible upgrades (transitions to higher
precision) for every group and sort them by efficiency:
</p>
<div class="equation">
$\text{efficiency}(i, c \to c') = \frac{\pi_i \cdot [\text{NRMSE}_i(c) - \text{NRMSE}_i(c')]}{\text{size}_i(c') - \text{size}_i(c)}$
</div>
<p>
Upgrades are applied greedily in decreasing efficiency order until the budget $B$ is
exhausted. For the MCKP with concave distortion curves (which holds empirically for
quantization), this greedy approach is provably near-optimal. The solver completes in under
100&thinsp;ms for Qwen3.5-397B (2,924 tensor groups), making it practical for interactive
experimentation with different budgets.
</p>
<h3>4.7 Full Pipeline</h3>
<div class="algorithm">
<strong>Algorithm 1:</strong> Per-Expert Mixed-Precision Quantization Pipeline<br><br>
<span class="keyword">Input:</span> Model weights $W = \{W_1, \ldots, W_n\}$, memory budget $B$<br>
<span class="keyword">Output:</span> Per-tensor assignment $\{(b_i, g_i)\}$ satisfying budget $B$<br><br>
<span class="comment">// Phase 1: Sensitivity profiling</span><br>
1. <span class="keyword">for each</span> tensor $W_i$ <span class="keyword">do</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;Compute 4 sensitivity metrics: spectral, kurtosis, output noise, NRMSE<br>
2. Normalize all metrics via eCDF (percentile rank across model)<br><br>
<span class="comment">// Phase 2: Rate-distortion curves</span><br>
3. <span class="keyword">for each</span> tensor $W_i$ <span class="keyword">do</span><br>
&nbsp;&nbsp;&nbsp;&nbsp;Compute NRMSE at 8 (bits, group_size) configurations<br><br>
<span class="comment">// Phase 3: Expert grouping</span><br>
4. Group MoE experts by (layer, projection_type)<br>
5. Aggregate NRMSE across experts (parameter-weighted mean)<br><br>
<span class="comment">// Phase 4: Safety and priors</span><br>
6. Apply SQNR veto: remove configs with SQNR &lt; 9 dB from candidate sets<br>
7. Apply soft protection priors $\pi_i$ to group distortion costs<br><br>
<span class="comment">// Phase 5: Budget-constrained optimization</span><br>
8. Solve MCKP via greedy efficiency ordering under budget $B$<br><br>
<span class="keyword">return</span> $\{(b_i, g_i)\}_{i=1}^{n}$
</div>
<!-- ============================================================ -->
<!-- 5. EXPERIMENTAL SETUP -->
<!-- ============================================================ -->
<h2>5. Experimental Setup</h2>
<h3>5.1 Models</h3>
<p>
We evaluate on three models spanning dense and MoE architectures at different scales:
<strong>Qwen3-8B</strong> <span class="ref">[6]</span> (dense, 8.2B parameters, 399 tensors,
36 layers), <strong>Llama4-Maverick-17B-128E</strong> <span class="ref">[8]</span> (MoE with
128 experts, 401.6B parameters, 1,061 tensors, 48 layers including 24 MoE and 24 dense), and
<strong>Qwen3.5-397B-A17B</strong> <span class="ref">[7]</span> (MoE with 512 experts per
layer, 403.4B parameters, 2,924 tensors, 60 layers). These models represent three distinct
regimes: a small dense model where mixed-precision has limited headroom, a medium-scale MoE
with a hybrid dense/MoE architecture, and a large-scale MoE where the expert parameter count
dwarfs the shared backbone.
</p>
<h3>5.2 Hardware and Framework</h3>
<p>
All experiments run on a single Apple M2 Ultra with 192&thinsp;GB unified memory, using the
MLX framework <span class="ref">[21]</span> for inference and quantization. The unified memory
architecture eliminates CPU-GPU transfer overhead, and MLX's lazy evaluation enables
processing models that would not fit in discrete GPU memory. Sensitivity analysis of
Qwen3.5-397B completes in approximately 163 minutes, scanning all weight tensors across 55
safetensor shards and profiling 2,924 tensor groups (after expert clustering via Mode B).
</p>
<h3>5.3 Evaluation Protocol</h3>
<p>
We evaluate perplexity on WikiText-2 using 256 sequences of 2,048 tokens each (seed = 42).
We report both mean and median perplexity; the median is more robust to outlier sequences that
can inflate the mean, particularly for MoE models where routing decisions introduce
sequence-level variance. For downstream benchmarks on Qwen3.5-397B, we evaluate MMLU-Pro
(thinking mode), ARC-Challenge, GSM8K, and HumanEval using standard evaluation harnesses.
Baselines include BF16 (full precision), uniform 4-bit with group_size = 128, and uniform
4-bit with group_size = 64.
</p>
<!-- ============================================================ -->
<!-- 6. RESULTS -->
<!-- ============================================================ -->
<h2>6. Results</h2>
<h3>6.1 Perplexity Comparison</h3>
<p>
Table 3 presents the central perplexity comparison on Qwen3.5-397B-A17B. We compare two
versions of our pipeline&mdash;SWAN v1 (threshold-based allocation from sensitivity scores)
and SWAN v2 (MCKP-based optimization)&mdash;against uniform 4-bit baselines at two group sizes.
</p>
<table>
<tr>
<th>Variant</th>
<th>Avg Bits</th>
<th>Group Size</th>
<th>Size (GB)</th>
<th>Perplexity</th>
<th>vs Uniform</th>
</tr>
<tr>
<td>SWAN v1 (threshold)</td>
<td>5.06</td>
<td>128</td>
<td>199.1</td>
<td>4.283</td>
<td>&minus;0.3%</td>
</tr>
<tr class="highlight-row">
<td><strong>SWAN v2 (MCKP)</strong></td>
<td class="best">4.31</td>
<td>128</td>
<td>199.1</td>
<td><strong>4.283</strong></td>
<td><strong>&minus;0.3%</strong></td>
</tr>
<tr>
<td>Uniform 4-bit</td>
<td>4.25</td>
<td>128</td>
<td>196.0</td>
<td>4.298</td>
<td>&mdash;</td>
</tr>
<tr>
<td>Uniform 4-bit</td>
<td>4.25</td>
<td>64</td>
<td>208.5</td>
<td class="best">3.931</td>
<td>&mdash;</td>
</tr>
<tr>
<td>SWAN v2</td>
<td>4.56</td>
<td>64</td>
<td>210.6</td>
<td>4.058</td>
<td>+3.2% worse</td>
</tr>
</table>
<div class="caption">Table 3: Perplexity comparison on Qwen3.5-397B-A17B (WikiText-2, 256 sequences, 2048 tokens, seed=42).</div>
<p>
The key result is that SWAN v2 MCKP matches the threshold-based v1 at 16% fewer average
bits (4.31 vs 5.06) while achieving identical perplexity (4.283). This demonstrates that the
budget-constrained optimizer efficiently reallocates bits from over-protected tensors to where
they matter. At matched group_size = 128, SWAN v2 beats uniform 4-bit by 0.015 perplexity
points (4.283 vs 4.298), a modest but consistent improvement. However, at group_size = 64,
uniform 4-bit (3.931) outperforms SWAN v2 (4.058) by 3.2%, a finding we discuss in detail in
Section 6.4.
</p>
<h3>6.2 Downstream Benchmarks</h3>
<p>
To verify that perplexity improvements translate to downstream task quality, we evaluate the
SWAN-quantized Qwen3.5-397B on four standard benchmarks:
</p>
<table>
<tr>
<th>Benchmark</th>
<th>Score</th>
</tr>
<tr>
<td>MMLU-Pro (thinking)</td>
<td>77.1%</td>
</tr>
<tr>
<td>ARC-Challenge</td>
<td>96.0%</td>
</tr>
<tr>
<td>GSM8K</td>
<td>88.7%</td>
</tr>
<tr>
<td>HumanEval</td>
<td>78.7%</td>
</tr>
</table>
<div class="caption">Table 4: Downstream benchmarks for Qwen3.5-397B-A17B quantized with SWAN (4.31 avg bits).</div>
<p>
Note: we were unable to run the BF16 baseline on our hardware (740+ GB exceeds 192 GB),
so direct degradation measurement is not possible. However, these scores are competitive with
published BF16 results for this model (NVIDIA reports MMLU-Pro 83.7% at BF16), suggesting the
quantized model retains the majority of its reasoning and coding capability. The 96.0% on
ARC-Challenge and 88.7% on GSM8K are particularly strong, as these structured reasoning tasks
are often sensitive to quantization noise.
</p>
<h3>6.3 Bit Allocation Distribution</h3>
<p>
Table 5 shows the breakdown of bit-width assignments across all parameters in Qwen3.5-397B
under the 226&thinsp;GB budget. The vast majority of expert parameters (89.4%) safely quantize
to 4-bit under the SQNR safety floor, with only 1.6% requiring 8-bit and 1.0% remaining at
16-bit precision. The 8.0% at 6-bit represents tensors where the MCKP solver found the
intermediate precision to be the most efficient allocation.
</p>
<table>
<tr>
<th>Precision</th>
<th>Parameters</th>
<th>Percentage</th>
</tr>
<tr>
<td>4-bit</td>
<td>360.8B</td>
<td>89.4%</td>
</tr>
<tr>
<td>6-bit</td>
<td>32.2B</td>
<td>8.0%</td>
</tr>
<tr>
<td>8-bit</td>
<td>6.5B</td>
<td>1.6%</td>
</tr>
<tr>
<td>16-bit</td>
<td>3.9B</td>
<td>1.0%</td>
</tr>
</table>
<div class="caption">Table 5: Bit allocation detail for Qwen3.5-397B-A17B (226 GB budget). 89.4% of parameters tolerate 4-bit quantization.</div>
<p>
The 16-bit parameters (3.9B, 1.0%) correspond primarily to embeddings, the LM head, and
LayerNorm parameters&mdash;components protected by the soft priors and the SQNR veto. The
8-bit parameters (6.5B, 1.6%) include router weights and attention projections in the first
and last two layers, which exhibit the highest kurtosis values in the model.
</p>
<h3>6.4 The Group Size Effect</h3>
<p>
Comparing rows in Table 3 reveals a striking finding: reducing group size from 128 to 64
yields a 0.225 perplexity improvement for SWAN v2 (4.283 to 4.058), while the entire SWAN
mixed-precision allocation yields only 0.015 improvement over uniform 4-bit at matched group
size. The uniform 4-bit baseline at group_size = 64 achieves 3.931 perplexity&mdash;better than
any SWAN variant at group_size = 128.
</p>
<p>
This is the most practically important finding in our study: for MoE models at 400B+ scale,
group size optimization is the primary quality lever. Halving the group size doubles the
number of scale and zero-point parameters, providing finer-grained adaptation to local weight
distributions. This benefit is orthogonal to and substantially larger than mixed-precision
bit-width allocation.
</p>
<h3>6.5 Codebook Quantization Ablation</h3>
<table>
<tr>
<th>Method</th>
<th>Mean MSE Reduction</th>
<th>Range</th>
</tr>
<tr class="highlight-row">
<td>$k$-means codebook (256 centroids)</td>
<td class="best">41.1%</td>
<td>39.7% &ndash; 42.6%</td>
</tr>
<tr>
<td>Hadamard rotation</td>
<td>8.2%</td>
<td>2.2% &ndash; 16.4%</td>
</tr>
<tr>
<td>MSE-optimal clipping</td>
<td>2.7%</td>
<td>&mdash;</td>
</tr>
</table>
<div class="caption">Table 6: Codebook quantization ablation (30 MoE expert tensors, Qwen3.5-397B).</div>
<p>
Codebook quantization using $k$-means with 256 centroids yields a uniformly large MSE
reduction of approximately 41% across all expert tensors, regardless of kurtosis. The
kurtosis-MSE correlation for codebook improvement is only $-0.058$, confirming that the
benefit is structural (non-linear quantization better fits arbitrary weight distributions)
rather than targeted at specific tensor properties. However, codebook quantization requires a
lookup-table (LUT) dequantization kernel that is not available in MLX or most inference
frameworks, creating a deployment blocker. On BF16 models, the correlation is higher
($+0.537$), but the absolute improvement is similar ($\sim$45% MSE reduction).
</p>
<p>
Hadamard rotation provides a modest 8.2% mean MSE improvement at 4-bit across 40
tested tensors, with individual improvements ranging from 2.2% to 16.4%. However, the rotation cannot bridge bit levels:
4-bit with Hadamard rotation is still 187&ndash;343$\times$ worse in MSE than 8-bit
quantization (0 of 16 candidate tensors could be downgraded from 8-bit to 4-bit+Hadamard).
MSE-optimal clipping contributes only 2.7%, suggesting that the default clipping in
standard quantization is already near-optimal for these weight distributions.
</p>
<h3>6.6 Cross-Architecture Validation</h3>
<p>
To validate our findings beyond Qwen3.5-397B, we applied the full pipeline to Llama4-Maverick
(128 experts) and Qwen3-8B (dense). On Maverick, the SWAN pipeline assigns 50 expert tensor
groups to 2-bit and 18 to 8-bit, producing a 171.9&thinsp;GB quantized model with perplexity
6.343 on WikiText-2. On the dense Qwen3-8B, mixed-precision provides negligible benefit over
uniform 4-bit, consistent with prior observations that small dense models have insufficient
sensitivity heterogeneity for mixed-precision to exploit.
</p>
<p>
We additionally validated on several other architectures during development. Llama 3.1 70B
achieved perplexity 4.221 (vs uniform4-g128 at 4.771, an 11.5% improvement). Llama 3.3 70B
achieved 4.379 (vs 5.052, a 13.3% improvement). These results confirm that mixed-precision
provides significant gains for large dense models (70B+), with the benefit increasing as
model heterogeneity grows. On smaller dense models (8B), the benefit is negligible, and on
very large MoE models (397B), the benefit exists but is dwarfed by group size effects.
</p>
<h3>6.7 Expert Routing Analysis</h3>
<p>
To understand how expert utilization patterns might inform quantization decisions, we profiled
the routing behavior of Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across
100 diverse prompts spanning 40 MoE layers.
</p>
<table>
<tr>
<th>Metric</th>
<th>Value</th>
</tr>
<tr>
<td>Experts per layer</td>
<td>256</td>
</tr>
<tr>
<td>Active per token</td>
<td>8</td>
</tr>
<tr>
<td>Avg dead experts per layer</td>
<td>1.5 / 256 (0.6%)</td>
</tr>
<tr>
<td>Avg entropy ratio</td>
<td>0.91</td>
</tr>
<tr>
<td>Avg Gini coefficient</td>
<td>0.53</td>
</tr>
<tr>
<td>Top-10 expert traffic share</td>
<td>20.4% (vs 3.9% if uniform)</td>
</tr>
</table>
<div class="caption">Table 7: Expert routing statistics on Qwen3.5-35B-A3B (256 experts, 40 MoE layers, 100 prompts).</div>
<p>
Expert utilization is moderately concentrated: the entropy ratio of 0.91 indicates fairly
uniform but not perfectly balanced routing, while the Gini coefficient of 0.53 shows moderate
concentration. The top-10 experts carry 20.4% of all traffic, approximately 5x their fair
share under uniform routing (3.9%). The average number of completely dead experts is only
1.5 per layer (0.6%), indicating that nearly all experts contribute to at least some inputs.
</p>
<p>
A critical methodological finding is that prompt diversity dramatically affects dead expert
counts. With only 5 prompts, approximately 30% of experts appeared dead; at 100 prompts this
dropped to 0.6%. This demonstrates that a diverse prompt set reveals most experts have value,
and that pruning decisions based on small calibration sets risk removing experts that are
essential for less common but valid inputs.
</p>
<h3>6.8 DynaMINT: Tiered Expert Quantization</h3>
<p>
Motivated by the routing analysis in Section 6.7, we developed DynaMINT (Dynamic MINT), a
tiered expert quantization scheme that assigns different precisions to experts based on their
activation frequency. Using the routing statistics from 100-prompt profiling, experts are
classified into four tiers:
</p>
<table>
<tr>
<th>Tier</th>
<th>Precision</th>
<th>Share of Experts</th>
</tr>
<tr>
<td>Critical (high-traffic)</td>
<td>8-bit</td>
<td>19.9%</td>
</tr>
<tr>
<td>Standard</td>
<td>4-bit</td>
<td>64.8%</td>
</tr>
<tr>
<td>Deprioritized (low-traffic)</td>
<td>2-bit</td>
<td>11.6%</td>
</tr>
<tr>
<td>Prunable (near-zero traffic)</td>
<td>0-bit (removed)</td>
<td>3.6%</td>
</tr>
</table>
<div class="caption">Table 8: DynaMINT tier distribution on Qwen3.5-35B-A3B.</div>
<p>
We evaluated DynaMINT against the uniform MINT baseline on Qwen3.5-35B-A3B:
</p>
<table>
<tr>
<th>Variant</th>
<th>Perplexity</th>
<th>vs Baseline</th>
<th>Speed (tok/s)</th>
</tr>
<tr>
<td>MINT uniform (baseline)</td>
<td>6.580</td>
<td>&mdash;</td>
<td>70.1</td>
</tr>
<tr class="highlight-row">
<td><strong>DynaMINT (tiered)</strong></td>
<td>6.613</td>
<td>+0.5%</td>
<td>9.6</td>
</tr>
</table>
<div class="caption">Table 9: DynaMINT evaluation on Qwen3.5-35B-A3B (WikiText-2).</div>
<p>
DynaMINT maintains quality with only +0.5% perplexity degradation despite 11.6% of experts
at 2-bit and 3.6% pruned entirely. The conversion process is fast, completing all 40 layers
in 1.7 seconds. MoE weight size increases by 10.5% due to the 8-bit critical tier; this
could be offset by adjusting tier thresholds to reduce the critical tier percentage.
Generation quality is preserved: the tiered model produces coherent chain-of-thought responses
on all three test prompts.
</p>
<p>
The primary limitation is inference speed: DynaMINT achieves only 9.6 tok/s compared to
70.1 tok/s for the uniform baseline, a 7x slowdown. This overhead comes entirely from
Python-level per-tier dispatch&mdash;the current prototype launches separate kernel calls for
each precision tier. This is an engineering rather than fundamental limitation: sorted dispatch
(grouping tokens by their routed expert's tier before kernel launch) or a native multi-precision
kernel would eliminate most of this overhead.
</p>
<h3>6.9 Expert Pruning</h3>
<p>
To evaluate expert pruning as an orthogonal compression technique, we measured the perplexity
impact of zeroing out the least-activated experts in Qwen3.5-35B-A3B based on the routing
statistics from Section 6.7.
</p>
<table>
<tr>
<th>Pruning Level</th>
<th>Experts Removed</th>
<th>Perplexity</th>
<th>Degradation</th>
</tr>
<tr>
<td>0% (baseline)</td>
<td>0</td>
<td class="best">6.580</td>
<td>&mdash;</td>
</tr>
<tr>
<td>5%</td>
<td>480</td>
<td>86.906</td>
<td>13.2x</td>
</tr>
<tr>
<td>10%</td>
<td>960</td>
<td>15,894</td>
<td>2,416x</td>
</tr>
<tr>
<td>25%</td>
<td>2,400</td>
<td>906,762</td>
<td>137,805x</td>
</tr>
</table>
<div class="caption">Table 10: Expert pruning curve on Qwen3.5-35B-A3B. Even 5% pruning causes catastrophic degradation.</div>
<p>
This is a strong negative result: Qwen3.5-35B-A3B is extremely sensitive to expert pruning.
Removing just 5% of experts (480 out of 9,600 total across 40 layers) causes a 13x perplexity
degradation, from 6.580 to 86.906. At 10% pruning, the model effectively collapses with
perplexity exceeding 15,000. At 25%, the model produces near-random output.
</p>
<p>
This result directly contradicts the assumption that rarely-activated experts can be safely
removed. The routing mechanism in this architecture relies on having all experts available:
even experts with low average activation frequency appear to be essential for specific input
distributions. Activation frequency alone is not a safe pruning criterion&mdash;an expert
activated on only 0.1% of tokens may still be critical for those tokens, and its removal
cascades through the routing softmax, redistributing probability mass in ways that compound
across layers. This finding supports our quantization-first approach over pruning for MoE
compression, and suggests that methods proposing significant expert pruning
<span class="ref">[15]</span> may not generalize to architectures with 256+ experts and
fine-grained routing.
</p>
<!-- ============================================================ -->
<!-- 7. DISCUSSION & LIMITATIONS -->
<!-- ============================================================ -->
<h2>7. Discussion &amp; Limitations</h2>
<h3>7.1 Group Size Dominates Bit-Width</h3>
<p>
The perplexity data presents a clear hierarchy of quantization quality levers for MoE models.
Halving group size from 128 to 64 yields a 0.225 perplexity improvement&mdash;an order of
magnitude larger than the 0.015 improvement from SWAN's mixed-precision allocation at matched
group size. This finding has immediate practical implications: practitioners should prioritize
group size reduction (accepting the $\sim$7% size increase from additional scale parameters)
before investing in mixed-precision profiling. The mixed-precision pipeline remains valuable
for squeezing the last fraction of quality at a given size budget, but it is a secondary
lever.
</p>
<h3>7.2 The SwitchLinear Constraint</h3>
<p>
MLX's <code>SwitchLinear</code> module stores all expert weights in a single contiguous
tensor, requiring all experts within a layer to share one quantization configuration. This
prevents true per-expert bit-width differentiation: our analysis computes per-expert
sensitivity, but the allocation decision is necessarily per-expert-group. We aggregate via
parameter-weighted mean, which is conservative but not optimal. A native
<code>MixedBitSwitchGLU</code> kernel that supports heterogeneous expert quantization within
a single layer would unlock the full potential of per-expert sensitivity analysis. Our
metric correlation data (Table 2) suggests that meaningful sensitivity variance exists within
expert groups, as kurtosis scores span a wide interquartile range (0.014 to 0.338 across all
2,347 tensors) and this variance is present both within and across expert groups.
</p>
<h3>7.3 Output Sensitivity Saturation</h3>
<p>
On 512-expert models, the output noise amplification metric saturates at 1.0 (its normalized
maximum) for 99.5% of expert tensors. This is because MoE expert weight matrices tend to have
similar spectral norms&mdash;the routing mechanism and load balancing during training encourage
experts to operate at similar scales. The practical implication is that simpler profiling
pipelines using only kurtosis and reconstruction error may be sufficient for large MoE models,
reducing profiling cost without sacrificing allocation quality. For dense models and small MoE
models ($\leq 32$ experts), output sensitivity retains discriminatory power and should be
included.
</p>
<h3>7.4 The Codebook Opportunity</h3>
<p>
The 41% MSE reduction from $k$-means codebook quantization represents a substantial
untapped opportunity for MoE compression. Unlike mixed-precision allocation (which
redistributes a fixed bit budget) or Hadamard rotation (which provides modest within-bitwidth
improvement), codebook quantization fundamentally changes the representation power per bit.
The uniform improvement across kurtosis levels ($\rho = -0.058$) indicates this benefit is
structural&mdash;non-linear quantization grids better fit arbitrary weight distributions
regardless of their statistical properties. This makes codebook quantization a complementary
technique rather than a replacement for mixed-precision allocation: one optimizes the
quantization grid, the other optimizes the bit budget distribution. The primary barrier is
kernel support: an efficient LUT dequantization kernel is needed for practical deployment,
which is absent from MLX, CUDA (for standard formats), and most inference frameworks.
</p>
<h3>7.5 Limitations</h3>
<p>
Several limitations constrain the scope of our findings:
</p>
<ul>
<li><strong>Dispatch kernel overhead.</strong> True per-expert mixed precision at
inference time requires a <code>MixedBitSwitchGLU</code> kernel that dispatches tokens to
experts quantized at different bit-widths. DynaMINT demonstrates a pure-Python prototype
achieving only 0.5% perplexity degradation with tiered quantization, but with 7x speed
overhead from per-tier kernel launches. Sorted dispatch or native kernel support would
eliminate this overhead.</li>
<li><strong>Expert pruning is destructive.</strong> Our pruning study on Qwen3.5-35B-A3B
(Section 6.9) shows that MoE routing relies on expert diversity rather than individual
expert quality. Even 5% expert removal causes 13x perplexity degradation, indicating that
activation frequency alone is not a safe pruning criterion. Methods proposing significant
expert pruning may not generalize to fine-grained MoE architectures.</li>
<li><strong>Single hardware platform.</strong> All experiments run on Apple Silicon (M2
Ultra, 192 GB). While the sensitivity analysis and MCKP formulation are
hardware-independent, the quantization implementation and performance characteristics are
specific to MLX. Validation on CUDA-based frameworks (e.g., with GPTQ or AWQ backends)
would strengthen the generalizability of our results.</li>
<li><strong>No perplexity measurement at BF16 for Qwen3.5-397B.</strong> The full BF16
model (740+ GB) does not fit in 192 GB, precluding direct BF16 baseline measurement on our
hardware. Our comparisons are against uniform quantized baselines.</li>
</ul>
<!-- ============================================================ -->
<!-- 8. CONCLUSION -->
<!-- ============================================================ -->
<h2>8. Conclusion</h2>
<p>
We have presented the first data-free mixed-precision quantization pipeline validated on
512-expert MoE models at 403 billion parameters. By profiling weight sensitivity using four
complementary metrics&mdash;spectral features, per-group kurtosis, output noise amplification,
and reconstruction error&mdash;we characterize 2,347 tensor groups across Qwen3.5-397B-A17B
without requiring any calibration data. Our MCKP formulation with expert grouping constraints
and SQNR safety vetoes solves in under 100&thinsp;ms for any model size, producing provably
near-optimal (bits, group_size) assignments per expert group. Key findings include: kurtosis is
the dominant sensitivity predictor (Spearman $\rho = 0.795$), 89.4% of expert parameters
safely quantize to 4-bit, and group size has a larger impact on perplexity than bit-width
allocation at this scale. Codebook quantization (+41% MSE reduction) and Hadamard rotation
(+8.2%) establish practical boundaries for future MoE compression techniques.
</p>
<p>
We further contribute DynaMINT, a tiered expert quantization scheme informed by activation
profiling that assigns critical experts to 8-bit, standard experts to 4-bit, and deprioritized
experts to 2-bit. DynaMINT maintains quality at only +0.5% perplexity degradation despite
11.6% of experts at 2-bit and 3.6% pruned entirely, demonstrating that activation-aware
tiering is a viable complement to weight-based sensitivity analysis. Our expert pruning study
provides an important negative result: even 5% expert removal causes 13x perplexity
degradation on Qwen3.5-35B-A3B, establishing that activation frequency alone is not a safe
pruning criterion and that MoE routing relies fundamentally on expert diversity.
</p>
<p>
We release all code, sensitivity manifests, and quantized models to facilitate reproduction
and extension. The MINT pipeline is available at
<a href="https://github.com/baa-ai/MINT">github.com/baa-ai/MINT</a>, with pre-quantized
models hosted at <a href="https://huggingface.co/baa-ai">huggingface.co/baa-ai</a>. We
believe the key actionable insight&mdash;that group size dominates bit-width allocation for
large MoE models&mdash;should inform both practitioners choosing quantization configurations
and framework developers prioritizing kernel optimizations. Future work should focus on
native multi-precision dispatch kernels (eliminating DynaMINT's 7x Python overhead), codebook
dequantization support, and joint optimization of group size and bit-width allocation.
</p>
<!-- ============================================================ -->
<!-- REFERENCES -->
<!-- ============================================================ -->
<h2>References</h2>
<p class="bib">[1] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. <em>ICLR</em>, 2021.</p>
<p class="bib">[2] Fedus, W., Zoph, B., and Shazeer, N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. <em>JMLR</em>, 23(120):1&ndash;39, 2022.</p>
<p class="bib">[3] Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., et al. Mixtral of Experts. <em>arXiv preprint arXiv:2401.04088</em>, 2024.</p>
<p class="bib">[4] DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient Mixture-of-Experts language model. <em>arXiv preprint arXiv:2405.04434</em>, 2024.</p>
<p class="bib">[5] DeepSeek-AI. DeepSeek-V3 technical report. <em>arXiv preprint arXiv:2412.19437</em>, 2025.</p>
<p class="bib">[6] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen3 technical report. <em>arXiv preprint arXiv:2505.09388</em>, 2025.</p>
<p class="bib">[7] Alibaba Cloud. Qwen3.5: Advancing the frontier with 512 experts. <em>Technical report</em>, 2025.</p>
<p class="bib">[8] Meta AI. Llama 4: Open-weight Mixture-of-Experts models. <em>Technical report</em>, 2025.</p>
<p class="bib">[9] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. <em>ICLR</em>, 2023.</p>
<p class="bib">[10] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. <em>MLSys</em>, 2024.</p>
<p class="bib">[11] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W., and Keutzer, K. SqueezeLLM: Dense-and-sparse quantization. <em>ICML</em>, 2024.</p>
<p class="bib">[12] Badri, H. and Shaji, A. HQQ: Half-quadratic quantization of large language models. <em>NeurIPS Workshop on Efficient Natural Language and Speech Processing</em>, 2024.</p>
<p class="bib">[13] Guo, C., Chen, J., Li, J., Zhou, Y., Chen, T., Xie, L., and Zhang, B. MXQ: Mixed-precision quantization for efficient LLM deployment. <em>arXiv preprint arXiv:2401.12917</em>, 2024.</p>
<p class="bib">[14] Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and De Sa, C. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. <em>ICML</em>, 2024.</p>
<p class="bib">[15] Li, W., Zhang, Y., Sun, H., Wang, X., and Qiu, X. MC-MoE: Mixture compressor for Mixture-of-Experts LLMs gains more. <em>arXiv preprint arXiv:2410.06270</em>, 2024.</p>
<p class="bib">[16] Frantar, E. and Alistarh, D. QMoE: Practical sub-1-bit compression of trillion-parameter models. <em>arXiv preprint arXiv:2310.16795</em>, 2023.</p>
<p class="bib">[17] Kim, Y., Lee, J., Park, S., and Shin, J. MoEQuant: Expert-wise quantization for Mixture-of-Experts models. <em>arXiv preprint arXiv:2406.02279</em>, 2024.</p>
<p class="bib">[18] Chen, Z., Qin, K., Zhang, Y., Li, P., Zhao, J., and Liang, X. DynaExq: Dynamic expert-level mixed-precision quantization for Mixture-of-Experts. <em>arXiv preprint arXiv:2405.11009</em>, 2024.</p>
<p class="bib">[19] Black Sheep AI. SWAN: SmartQuant data-free per-tensor mixed-precision quantization for LLMs on Apple Silicon. <em>Technical report, baa.ai</em>, 2026.</p>
<p class="bib">[20] Black Sheep AI. MINT: Memory-Informed N-bit Tuning &mdash; compute-optimal data-free mixed-precision quantization for LLMs. <em>Technical report, baa.ai</em>, 2026.</p>
<p class="bib">[21] Apple. MLX: An array framework for Apple Silicon. <em>GitHub repository, github.com/ml-explore/mlx</em>, 2024.</p>
<p class="bib">[22] Gray, R.M. and Neuhoff, D.L. Quantization. <em>IEEE Transactions on Information Theory</em>, 44(6):2325&ndash;2383, 1998.</p>
<p class="bib">[23] Kellerer, H., Pferschy, U., and Pisinger, D. <em>Knapsack Problems</em>. Springer, Berlin, 2004.</p>
<p class="bib">[24] Barlow, R.E., Bartholomew, D.J., Bremner, J.M., and Brunk, H.D. <em>Statistical Inference under Order Restrictions: The Theory and Application of Isotonic Regression</em>. Wiley, New York, 1972.</p>
<p class="bib">[25] Hampel, F.R. The influence curve and its role in robust estimation. <em>Journal of the American Statistical Association</em>, 69(346):383&ndash;393, 1974.</p>
<div class="footnote">
<p>Code and models: <a href="https://github.com/baa-ai/MINT">github.com/baa-ai/MINT</a> |
<a href="https://huggingface.co/baa-ai">huggingface.co/baa-ai</a></p>
<p>Correspondence: research@baa.ai</p>
</div>
</body>
</html>