Spaces:

baa-ai
/

MoE-Expert-Quantization

Running

App Files Files Community

MoE-Expert-Quantization / index.html

tomkay

Add real experiment results: activation profiling, DynaMINT tiered quantization, expert pruning, comparison table

1df3e61 verified 5 days ago

raw

history blame contribute delete

57.4 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models</title>
	<script>window.MathJax = {tex: {inlineMath: [['$','$'],['\$','\$']]}};</script>
	<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
	<style>
	:root {
	--bg: #fafafa;
	--text: #1a1a2e;
	--accent: #2d5aa0;
	--muted: #555;
	--border: #ddd;
	--code-bg: #f0f0f0;
	--table-header: #e8eef6;
	--highlight: #fff3cd;
	}
	* { box-sizing: border-box; margin: 0; padding: 0; }
	body {
	font-family: 'Georgia', 'Times New Roman', serif;
	line-height: 1.7;
	color: var(--text);
	background: var(--bg);
	max-width: 52em;
	margin: 0 auto;
	padding: 2em 1.5em 4em;
	}
	h1 { font-size: 1.7em; line-height: 1.3; text-align: center; margin: 1.5em 0 0.3em; color: var(--text); }
	.authors { text-align: center; color: var(--muted); margin-bottom: 2em; font-style: italic; }
	h2 { font-size: 1.3em; margin: 2em 0 0.7em; color: var(--accent); border-bottom: 2px solid var(--accent); padding-bottom: 0.2em; }
	h3 { font-size: 1.1em; margin: 1.5em 0 0.5em; color: #333; }
	h4 { font-size: 1em; font-weight: bold; margin: 1.2em 0 0.3em; color: var(--text); }
	p { margin: 0.7em 0; text-align: justify; }
	.abstract { background: #f5f5f5; border-left: 4px solid var(--accent); padding: 1.2em 1.5em; margin: 1.5em 0 2em; font-size: 0.95em; }
	.abstract strong { color: var(--accent); }
	ul, ol { margin: 0.5em 0 0.5em 2em; }
	li { margin: 0.3em 0; }
	table { border-collapse: collapse; margin: 1em auto; font-size: 0.9em; width: auto; min-width: 50%; }
	th, td { border: 1px solid var(--border); padding: 0.45em 0.8em; text-align: center; }
	th { background: var(--table-header); font-weight: bold; }
	td:first-child, th:first-child { text-align: left; }
	tr:nth-child(even) { background: #f9f9f9; }
	.best { font-weight: bold; color: #1a7a1a; }
	.caption { text-align: center; font-size: 0.88em; color: var(--muted); margin-top: 0.5em; margin-bottom: 1.5em; font-style: italic; }
	.equation { display: block; text-align: center; margin: 1em 0; padding: 0.8em; background: var(--code-bg); border-radius: 4px; font-family: 'Courier New', monospace; font-size: 0.92em; overflow-x: auto; }
	.algorithm { background: #fafafa; border: 1px solid var(--border); padding: 1em 1.5em; margin: 1em 0; font-family: 'Courier New', monospace; font-size: 0.88em; line-height: 1.5; border-radius: 4px; }
	.algorithm .keyword { color: var(--accent); font-weight: bold; }
	.algorithm .comment { color: #888; font-style: italic; }
	code { background: var(--code-bg); padding: 0.15em 0.4em; border-radius: 3px; font-size: 0.9em; }
	.ref { color: var(--accent); cursor: default; }
	.footnote { font-size: 0.85em; color: var(--muted); border-top: 1px solid var(--border); margin-top: 2em; padding-top: 1em; }
	.bib { font-size: 0.85em; margin: 0.3em 0; padding-left: 2.5em; text-indent: -2.5em; }
	.highlight-row { background: var(--highlight) !important; }
	@media (max-width: 600px) { body { padding: 1em 0.8em; font-size: 0.95em; } table { font-size: 0.8em; } th, td { padding: 0.3em 0.5em; } }
	</style>
	</head>
	<body>

	<!-- ============================================================ -->
	<!-- TITLE & AUTHORS -->
	<!-- ============================================================ -->

	<h1>Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models</h1>
	<div class="authors">Black Sheep AI Research — baa.ai</div>

	<!-- ============================================================ -->
	<!-- ABSTRACT -->
	<!-- ============================================================ -->

	<div class="abstract">
	<strong>Abstract.</strong>
	Mixture-of-Experts architectures with 128–512 experts per layer create unique challenges
	for post-training quantization. Existing methods apply uniform bit-widths across all experts
	or rely on coarse per-layer decisions. We present the first comprehensive data-free sensitivity
	analysis and mixed-precision quantization study on real 512-expert MoE models. Using
	weight-based sensitivity metrics—spectral analysis, per-group kurtosis, output noise
	amplification, and reconstruction error—we profile 2,347 tensors across Qwen3.5-397B-A17B
	(512 experts per layer) and validate across three architecture scales. We discover that kurtosis
	is the dominant sensitivity predictor (Spearman $\rho = 0.795$), that 89.4% of expert
	parameters tolerate 4-bit quantization under SQNR safety constraints, and that group size has a
	larger impact on perplexity than bit-width allocation. Our pipeline formulates allocation as a
	Multiple-Choice Knapsack Problem (MCKP), solving for the provably near-optimal
	(bits, group_size) assignment per expert group in under 100 ms. On commodity Apple Silicon,
	we match the perplexity of a threshold-based mixed-precision baseline at 16% fewer average bits
	(4.31 vs 5.06), and improve over uniform 4-bit quantization at matched group size. We provide ablations on codebook quantization (+41% MSE reduction) and Hadamard
	rotations (+8.2%), establishing practical boundaries for future MoE compression.
	We further introduce DynaMINT, a tiered expert quantization scheme informed by activation
	profiling, which maintains quality at only +0.5% perplexity degradation with 11.6% of experts
	at 2-bit and 3.6% pruned. An expert pruning study reveals that even 5% expert removal causes
	13x perplexity degradation, establishing that activation frequency alone is not a safe pruning
	criterion. All code, manifests, and models are released.
	</div>

	<!-- ============================================================ -->
	<!-- 1. INTRODUCTION -->
	<!-- ============================================================ -->

	<h2>1. Introduction</h2>

	<p>
	Mixture-of-Experts (MoE) has become the dominant architecture for frontier large language
	models. Qwen3.5-397B-A17B <span class="ref">[7]</span> deploys 512 experts per MoE layer with
	only 17 billion parameters active per token, achieving state-of-the-art quality at a fraction
	of the compute cost implied by total parameter count. Llama 4 Maverick
	<span class="ref">[8]</span> uses 128 experts across 24 MoE layers, while earlier models such
	as Mixtral-8x7B <span class="ref">[3]</span> and DeepSeek-V3 <span class="ref">[5]</span>
	established the pattern at smaller expert counts. Despite the compute efficiency of sparse
	activation, deployment remains constrained by total model size: Qwen3.5-397B requires over
	740 GB in BF16, far exceeding the memory of any single accelerator.
	</p>

	<p>
	Post-training quantization is the standard solution, but existing methods apply uniform
	bit-widths across all experts. This ignores a fundamental property of MoE architectures:
	experts are trained semi-independently through routing and develop heterogeneous weight
	distributions. Some experts have near-Gaussian weight distributions that compress gracefully
	to 4-bit; others exhibit heavy-tailed distributions with high kurtosis that suffer catastrophic
	accuracy loss at the same precision. Uniform quantization wastes bits on robust experts while
	under-protecting sensitive ones.
	</p>

	<p>
	Prior work on MoE quantization—MC-MoE <span class="ref">[15]</span>, QMoE
	<span class="ref">[16]</span>, MoEQuant <span class="ref">[17]</span>, and DynaExq
	<span class="ref">[18]</span>—all require calibration data to estimate expert
	sensitivity via activation traces or routing statistics. This creates practical barriers:
	calibration sets must be representative of deployment distribution, the profiling pass
	requires loading the full model in high precision, and results may not transfer across
	domains. More critically, none of these methods has been validated at 512-expert scale, where
	the combinatorial space of per-expert configurations explodes and calibration cost becomes
	prohibitive.
	</p>

	<p>
	We make four contributions: <strong>(1)</strong> the first data-free sensitivity study at
	512-expert scale, profiling 2,347 tensors across Qwen3.5-397B-A17B using only weight
	statistics; <strong>(2)</strong> an MCKP-based allocation pipeline with expert grouping
	constraints that solves in under 100 ms for any model size; <strong>(3)</strong>
	comprehensive ablations on codebook quantization and Hadamard rotation techniques,
	establishing practical boundaries for future MoE compression; and <strong>(4)</strong> open
	release of all code, sensitivity manifests, and quantized models. The entire pipeline runs on
	a single Apple M2 Ultra with 192 GB unified memory <span class="ref">[21]</span>,
	requiring no GPU cluster and no calibration data.
	</p>

	<!-- ============================================================ -->
	<!-- 2. RELATED WORK -->
	<!-- ============================================================ -->

	<h2>2. Related Work</h2>

	<h3>2.1 Mixture-of-Experts Architectures</h3>

	<p>
	The modern MoE paradigm traces from GShard <span class="ref">[1]</span> and the Switch
	Transformer <span class="ref">[2]</span>, which demonstrated that sparsely-activated expert
	layers could scale model capacity without proportional compute cost. Mixtral-8x7B
	<span class="ref">[3]</span> brought MoE to open-weight models with 8 experts per layer,
	selecting 2 per token. DeepSeek-V2 <span class="ref">[4]</span> introduced fine-grained
	experts (up to 160 per layer), and DeepSeek-V3 <span class="ref">[5]</span> scaled to 256
	experts with auxiliary-loss-free load balancing. Qwen3 <span class="ref">[6]</span> and
	Qwen3.5 <span class="ref">[7]</span> pushed further to 512 experts per layer while activating
	only 17B of 397B total parameters. Llama 4 <span class="ref">[8]</span> adopted a hybrid
	design with both dense and MoE layers. The trend is clear: expert counts are growing
	rapidly, and quantization methods must keep pace.
	</p>

	<h3>2.2 MoE-Specific Quantization</h3>

	<p>
	MC-MoE <span class="ref">[15]</span> uses calibration data to identify and protect
	frequently-activated experts, applying lower precision to rarely-used ones. QMoE
	<span class="ref">[16]</span> compresses all experts to under 1 bit per parameter using
	learned codebooks with calibration-based distillation. MoEQuant <span class="ref">[17]</span>
	proposes expert-wise calibration to handle activation outliers specific to each expert. DynaExq
	<span class="ref">[18]</span> dynamically adjusts expert quantization based on runtime routing
	patterns. All of these methods require calibration data and activation traces, creating
	practical deployment barriers. None has been demonstrated at 512-expert scale.
	</p>

	<table>
	<tr>
	<th>Method</th>
	<th>Expert Count Tested</th>
	<th>Granularity</th>
	<th>Calibration Data</th>
	<th>Data-Free</th>
	<th>Hardware</th>
	</tr>
	<tr>
	<td>MC-MoE <span class="ref">[15]</span></td>
	<td>≤16</td>
	<td>Per-expert</td>
	<td>Required</td>
	<td>No</td>
	<td>GPU</td>
	</tr>
	<tr>
	<td>QMoE <span class="ref">[16]</span></td>
	<td>128</td>
	<td>Per-expert</td>
	<td>Required</td>
	<td>No</td>
	<td>GPU</td>
	</tr>
	<tr>
	<td>MoEQuant <span class="ref">[17]</span></td>
	<td>≤64</td>
	<td>Per-expert</td>
	<td>Required</td>
	<td>No</td>
	<td>GPU</td>
	</tr>
	<tr>
	<td>DynaExq <span class="ref">[18]</span></td>
	<td>≤128</td>
	<td>Dynamic</td>
	<td>Required</td>
	<td>No</td>
	<td>GPU</td>
	</tr>
	<tr class="highlight-row">
	<td><strong>Ours</strong></td>
	<td><strong>256</strong></td>
	<td><strong>Per-expert tiered</strong></td>
	<td><strong>None</strong></td>
	<td class="best"><strong>Yes</strong></td>
	<td><strong>Apple Silicon</strong></td>
	</tr>
	</table>
	<div class="caption">Table A: Comparison of MoE quantization methods. Our approach is the only data-free method and scales to the highest expert count.</div>

	<h3>2.3 Mixed-Precision Quantization for Dense Models</h3>

	<p>
	For dense models, GPTQ <span class="ref">[9]</span> uses second-order information for
	layer-wise quantization; AWQ <span class="ref">[10]</span> identifies salient weight channels
	via activation magnitudes; SqueezeLLM <span class="ref">[11]</span> separates outliers into a
	sparse format; HQQ <span class="ref">[12]</span> provides fast half-quadratic quantization
	without calibration data; and MXQ <span class="ref">[13]</span> assigns mixed precision at
	sub-layer granularity. QuIP# <span class="ref">[14]</span> applies random orthogonal
	transformations to incoherify weight matrices before quantization. While these methods have
	advanced the state of the art for dense models, they do not address the unique challenges of
	MoE: heterogeneous expert sensitivity, the combinatorial explosion of per-expert configuration
	space, and framework constraints that tie all experts within a layer to a shared quantization
	config. Our work fills this gap with a data-free method validated at 512-expert scale under a
	budget-constrained optimization framework.
	</p>

	<!-- ============================================================ -->
	<!-- 3. EXPERT SENSITIVITY PROFILING -->
	<!-- ============================================================ -->

	<h2>3. Expert Sensitivity Profiling</h2>

	<h3>3.1 Weight-Based Sensitivity Metrics</h3>

	<p>
	Unlike activation-based methods that require calibration data and forward passes, we analyze
	sensitivity entirely from weight tensor properties. This makes profiling data-free and
	embarrassingly parallel across shards. We compute four complementary metrics for each tensor:
	</p>

	<ul>
	<li><strong>SVD spectral features.</strong> We compute the singular value decomposition of
	each weight matrix and extract three scale-invariant features: the stable rank
	($\\|W\\|_F^2 / \\|W\\|_2^2$, measuring effective dimensionality), spectral tail mass
	(fraction of Frobenius energy outside the top-$k$ singular values, $k = \text{rank}/10$),
	and log condition number ($\log_{10}(\sigma_1 / \sigma_{\min})$, capped at 10.0).
	Tensors with low stable rank and high tail mass have information concentrated in a few
	directions and are more sensitive to quantization noise.</li>
	<li><strong>Per-group kurtosis.</strong> We partition weight tensors into groups (matching
	the quantization group size) and compute excess kurtosis per group, then aggregate via the
	95th percentile. High kurtosis indicates heavy-tailed distributions with outlier weights
	that are poorly represented by uniform quantization grids. We extract four features:
	mean, median, 95th percentile, and max kurtosis across groups.</li>
	<li><strong>Output noise amplification.</strong> We estimate how quantization noise in
	weights amplifies through the linear transformation by computing
	$\sigma_{\text{out}} / \sigma_{\text{noise}}$ where
	$\sigma_{\text{noise}}$ is the expected quantization step size. This captures the
	condition-number-like sensitivity of the weight matrix.</li>
	<li><strong>Reconstruction error (NRMSE).</strong> We simulate quantization at each
	candidate (bits, group_size) configuration and measure
	$\text{NRMSE} = \\|W - Q(W)\\|_F / \\|W\\|_F$. This serves as both a metric and the
	optimization objective for bit allocation.</li>
	</ul>

	<h3>3.2 Expert Analysis Modes</h3>

	<p>
	MoE models present a scaling challenge: Qwen3.5-397B has 512 experts per layer across 60
	layers, yielding thousands of expert weight tensors. Profiling every expert independently is
	feasible but expensive. We implement two analysis modes in the expert handler:
	</p>

	<ul>
	<li><strong>Mode A (individual):</strong> For models with $\leq 32$ experts per layer,
	compute full sensitivity metrics for every expert tensor and use worst-case (maximum
	sensitivity) across experts within each group for conservative allocation.</li>
	<li><strong>Mode B (clustered):</strong> For models with $> 32$ experts per layer, cluster
	experts using $k$-means on (Frobenius norm, kurtosis) features into $\sqrt{n_{\text{experts}}}$
	clusters, then sample one representative expert per cluster. This reduces profiling cost
	by $10\text{--}20\times$ while preserving the distribution of sensitivity scores.</li>
	</ul>

	<p>
	For Qwen3.5-397B (512 experts), Mode B reduces the number of fully-profiled expert tensors
	from 30,720 to approximately 2,347 while maintaining coverage of the sensitivity distribution.
	</p>

	<h3>3.3 Cross-Architecture Scaling Study</h3>

	<p>
	We profile three architectures spanning dense to large-scale MoE to understand how sensitivity
	distributions change with expert count and model scale:
	</p>

	<table>
	<tr>
	<th>Model</th>
	<th>Architecture</th>
	<th>Parameters</th>
	<th>Tensors</th>
	<th>Layers</th>
	<th>Avg Bits</th>
	<th>Est. Size</th>
	</tr>
	<tr>
	<td>Qwen3-8B</td>
	<td>Dense</td>
	<td>8.2B</td>
	<td>399</td>
	<td>36</td>
	<td>6.84</td>
	<td>6.7 GB</td>
	</tr>
	<tr>
	<td>Llama4-Maverick</td>
	<td>MoE 128E</td>
	<td>401.6B</td>
	<td>1,061</td>
	<td>48</td>
	<td>4.78</td>
	<td>230.4 GB</td>
	</tr>
	<tr>
	<td>Qwen3.5-397B</td>
	<td>MoE 512E</td>
	<td>403.4B</td>
	<td>2,924</td>
	<td>60</td>
	<td>5.06</td>
	<td>245.1 GB</td>
	</tr>
	</table>
	<div class="caption">Table 1: Cross-architecture scaling study. Tensor counts reflect the profiled set (Mode B clustering for MoE models).</div>

	<h3>3.4 Metric Correlation Analysis</h3>

	<p>
	To identify which metrics best predict quantization sensitivity, we compute rank correlations
	between each metric and reconstruction error (NRMSE at 4-bit) across all 2,347 profiled
	tensors in Qwen3.5-397B:
	</p>

	<table>
	<tr>
	<th>Metric</th>
	<th>Spearman $\rho$</th>
	<th>Pearson $r$</th>
	<th>$p$-value</th>
	</tr>
	<tr class="highlight-row">
	<td>Per-group kurtosis</td>
	<td class="best">0.795</td>
	<td>0.480</td>
	<td><1e-135</td>
	</tr>
	<tr>
	<td>Cross-layer position</td>
	<td>−0.468</td>
	<td>−0.224</td>
	<td><1e-128</td>
	</tr>
	<tr>
	<td>SVD spectral features</td>
	<td>0.391</td>
	<td>0.303</td>
	<td><1e-86</td>
	</tr>
	<tr>
	<td>Composite (weighted)</td>
	<td>0.374</td>
	<td>0.400</td>
	<td><1e-79</td>
	</tr>
	<tr>
	<td>Output sensitivity</td>
	<td>0.212</td>
	<td>0.455</td>
	<td><1e-25</td>
	</tr>
	</table>
	<div class="caption">Table 2: Metric correlation with reconstruction error (NRMSE at 4-bit) across 2,347 tensors in Qwen3.5-397B-A17B.</div>

	<p>
	Kurtosis dominates as the sensitivity predictor with Spearman $\rho = 0.795$, substantially
	ahead of the next best metric. Output sensitivity, despite its intuitive appeal, achieves
	only $\rho = 0.212$. Investigation reveals that output sensitivity saturates at 1.0 for 99.5%
	of MoE expert tensors (median = 1.0), losing all discriminatory power at 512-expert scale.
	This saturation occurs because expert weight matrices in large MoE models tend to have similar
	spectral norms, making the noise amplification ratio nearly identical across experts.
	</p>

	<p>
	The negative cross-layer position correlation ($\rho = -0.468$) confirms the well-known
	U-shaped sensitivity pattern: early and late layers are more sensitive to quantization than
	middle layers. This pattern holds across all three architectures and is captured by the soft
	protection priors in our allocation pipeline.
	</p>

	<!-- ============================================================ -->
	<!-- 4. PER-EXPERT MIXED-PRECISION PIPELINE -->
	<!-- ============================================================ -->

	<h2>4. Per-Expert Mixed-Precision Pipeline</h2>

	<h3>4.1 Rate-Distortion Profiling</h3>

	<p>
	For each tensor, we compute the reconstruction error (NRMSE) at eight candidate
	(bits, group_size) configurations, forming a rate-distortion curve:
	</p>

	<div class="equation">
	$\mathcal{C} = \{(2, 32),\; (3, 64),\; (4, 32),\; (4, 64),\; (4, 128),\; (8, 64),\; (8, 128),\; (16, {-})\}$
	</div>

	<p>
	Each configuration implies a specific size cost (bits per parameter plus scale/zero-point
	overhead from the group size) and a distortion level. The rate-distortion curve captures the
	tensor-specific tradeoff: some tensors see a large NRMSE jump between 4-bit and 8-bit while
	others degrade gracefully, making them good candidates for aggressive quantization.
	</p>

	<h3>4.2 Expert Grouping Constraint</h3>

	<p>
	MLX's <code>SwitchLinear</code> module <span class="ref">[21]</span> requires all experts
	within a layer to share a single quantization configuration (bits and group_size). This is a
	hard framework constraint: the quantized weight tensor for all experts in a layer is stored as
	a single contiguous array with uniform element width. Consequently, our analysis is per-expert
	but the allocation must be per-expert-group, where each group corresponds to all experts
	sharing a (layer, projection_type) pair.
	</p>

	<p>
	We aggregate per-expert NRMSE values into a group-level distortion estimate using the
	parameter-weighted mean across all experts in the group:
	</p>

	<div class="equation">
	$\text{NRMSE}_{\text{group}}(b, g) = \frac{\sum_{e \in \text{group}} n_e \cdot \text{NRMSE}_e(b, g)}{\sum_{e \in \text{group}} n_e}$
	</div>

	<p>
	where $n_e$ is the number of parameters in expert $e$. This weighting ensures that larger
	experts (which contribute more to total model size) have proportionally more influence on the
	group allocation decision.
	</p>

	<h3>4.3 MCKP Formulation</h3>

	<p>
	We formulate the bit-width and group-size allocation as a Multiple-Choice Knapsack Problem
	(MCKP) <span class="ref">[23]</span>. Let $i$ index the tensor groups, $\pi_i$ denote soft
	protection priors, and $(b_i, g_i) \in \mathcal{C}_i$ denote the candidate configurations for
	group $i$. The optimization problem is:
	</p>

	<div class="equation">
	$\min_{\{(b_i, g_i)\}} \sum_i \pi_i \cdot \text{NRMSE}_i(b_i, g_i) \quad \text{s.t.} \quad \sum_i \text{size}_i(b_i, g_i) \leq B, \quad (b_i, g_i) \in \mathcal{C}_i \;\; \forall i$
	</div>

	<p>
	where $B$ is the memory budget and $\text{size}_i(b_i, g_i)$ computes the storage cost
	including scale and zero-point overhead. The soft protection priors $\pi_i$ increase the
	effective distortion cost for structurally important components, discouraging aggressive
	quantization of embeddings, layer norms, and boundary layers:
	</p>

	<table>
	<tr>
	<th>Component</th>
	<th>Prior Weight ($\pi$)</th>
	</tr>
	<tr>
	<td>Embeddings</td>
	<td>10.0x</td>
	</tr>
	<tr>
	<td>LM head</td>
	<td>10.0x</td>
	</tr>
	<tr>
	<td>Router weights</td>
	<td>8.0x</td>
	</tr>
	<tr>
	<td>First 2 layers</td>
	<td>3.0x</td>
	</tr>
	<tr>
	<td>Last 2 layers</td>
	<td>2.0x</td>
	</tr>
	<tr>
	<td>LayerNorm</td>
	<td>$\infty$ (never quantize)</td>
	</tr>
	<tr>
	<td>All other tensors</td>
	<td>1.0x</td>
	</tr>
	</table>
	<div class="caption">Soft protection priors. Higher weights penalize low-precision assignment for structurally important components.</div>

	<h3>4.4 SQNR Safety Veto</h3>

	<p>
	Before optimization, we apply a Signal-to-Quantization-Noise Ratio (SQNR) safety veto
	<span class="ref">[22]</span>. For each tensor and each candidate configuration, we compute:
	</p>

	<div class="equation">
	$\text{SQNR}(W, b, g) = 10 \cdot \log_{10} \frac{\\|W\\|_F^2}{\\|W - Q_{b,g}(W)\\|_F^2}$
	</div>

	<p>
	Any configuration with $\text{SQNR} < 9\;\text{dB}$ (the default floor) is removed from the
	candidate set $\mathcal{C}_i$ before the MCKP solver runs. This hard constraint prevents
	catastrophic quantization of tensors where the quantization noise exceeds approximately 35%
	of the signal power, regardless of what the budget optimization might prefer.
	</p>

	<h3>4.5 eCDF Normalization</h3>

	<p>
	Raw metric values span vastly different scales across architectures and model sizes. We
	replace all hardcoded normalization bounds with empirical CDF (eCDF) normalization: each
	metric value is transformed to its percentile rank across all tensors in the model, yielding
	scale-invariant scores in $[0, 1]$. For a metric value $x$ with observed values
	$\{x_1, \ldots, x_n\}$:
	</p>

	<div class="equation">
	$\text{eCDF}(x) = \frac{1}{n} \sum_{j=1}^{n} \mathbf{1}[x_j \leq x]$
	</div>

	<p>
	This eliminates the need for per-metric normalization constants and adapts automatically to
	the distribution of any model, whether dense or MoE, 8B or 400B parameters.
	</p>

	<h3>4.6 Greedy Solver</h3>

	<p>
	We solve the MCKP using a greedy efficiency ordering. Starting from the minimum-cost (lowest
	bit-width) feasible assignment, we enumerate all possible upgrades (transitions to higher
	precision) for every group and sort them by efficiency:
	</p>

	<div class="equation">
	$\text{efficiency}(i, c \to c') = \frac{\pi_i \cdot [\text{NRMSE}_i(c) - \text{NRMSE}_i(c')]}{\text{size}_i(c') - \text{size}_i(c)}$
	</div>

	<p>
	Upgrades are applied greedily in decreasing efficiency order until the budget $B$ is
	exhausted. For the MCKP with concave distortion curves (which holds empirically for
	quantization), this greedy approach is provably near-optimal. The solver completes in under
	100 ms for Qwen3.5-397B (2,924 tensor groups), making it practical for interactive
	experimentation with different budgets.
	</p>

	<h3>4.7 Full Pipeline</h3>

	<div class="algorithm">
	<strong>Algorithm 1:</strong> Per-Expert Mixed-Precision Quantization Pipeline<br><br>
	<span class="keyword">Input:</span> Model weights $W = \{W_1, \ldots, W_n\}$, memory budget $B$<br>
	<span class="keyword">Output:</span> Per-tensor assignment $\{(b_i, g_i)\}$ satisfying budget $B$<br><br>
	<span class="comment">// Phase 1: Sensitivity profiling</span><br>
	1. <span class="keyword">for each</span> tensor $W_i$ <span class="keyword">do</span><br>
	Compute 4 sensitivity metrics: spectral, kurtosis, output noise, NRMSE<br>
	2. Normalize all metrics via eCDF (percentile rank across model)<br><br>
	<span class="comment">// Phase 2: Rate-distortion curves</span><br>
	3. <span class="keyword">for each</span> tensor $W_i$ <span class="keyword">do</span><br>
	Compute NRMSE at 8 (bits, group_size) configurations<br><br>
	<span class="comment">// Phase 3: Expert grouping</span><br>
	4. Group MoE experts by (layer, projection_type)<br>
	5. Aggregate NRMSE across experts (parameter-weighted mean)<br><br>
	<span class="comment">// Phase 4: Safety and priors</span><br>
	6. Apply SQNR veto: remove configs with SQNR < 9 dB from candidate sets<br>
	7. Apply soft protection priors $\pi_i$ to group distortion costs<br><br>
	<span class="comment">// Phase 5: Budget-constrained optimization</span><br>
	8. Solve MCKP via greedy efficiency ordering under budget $B$<br><br>
	<span class="keyword">return</span> $\{(b_i, g_i)\}_{i=1}^{n}$
	</div>

	<!-- ============================================================ -->
	<!-- 5. EXPERIMENTAL SETUP -->
	<!-- ============================================================ -->

	<h2>5. Experimental Setup</h2>

	<h3>5.1 Models</h3>

	<p>
	We evaluate on three models spanning dense and MoE architectures at different scales:
	<strong>Qwen3-8B</strong> <span class="ref">[6]</span> (dense, 8.2B parameters, 399 tensors,
	36 layers), <strong>Llama4-Maverick-17B-128E</strong> <span class="ref">[8]</span> (MoE with
	128 experts, 401.6B parameters, 1,061 tensors, 48 layers including 24 MoE and 24 dense), and
	<strong>Qwen3.5-397B-A17B</strong> <span class="ref">[7]</span> (MoE with 512 experts per
	layer, 403.4B parameters, 2,924 tensors, 60 layers). These models represent three distinct
	regimes: a small dense model where mixed-precision has limited headroom, a medium-scale MoE
	with a hybrid dense/MoE architecture, and a large-scale MoE where the expert parameter count
	dwarfs the shared backbone.
	</p>

	<h3>5.2 Hardware and Framework</h3>

	<p>
	All experiments run on a single Apple M2 Ultra with 192 GB unified memory, using the
	MLX framework <span class="ref">[21]</span> for inference and quantization. The unified memory
	architecture eliminates CPU-GPU transfer overhead, and MLX's lazy evaluation enables
	processing models that would not fit in discrete GPU memory. Sensitivity analysis of
	Qwen3.5-397B completes in approximately 163 minutes, scanning all weight tensors across 55
	safetensor shards and profiling 2,924 tensor groups (after expert clustering via Mode B).
	</p>

	<h3>5.3 Evaluation Protocol</h3>

	<p>
	We evaluate perplexity on WikiText-2 using 256 sequences of 2,048 tokens each (seed = 42).
	We report both mean and median perplexity; the median is more robust to outlier sequences that
	can inflate the mean, particularly for MoE models where routing decisions introduce
	sequence-level variance. For downstream benchmarks on Qwen3.5-397B, we evaluate MMLU-Pro
	(thinking mode), ARC-Challenge, GSM8K, and HumanEval using standard evaluation harnesses.
	Baselines include BF16 (full precision), uniform 4-bit with group_size = 128, and uniform
	4-bit with group_size = 64.
	</p>

	<!-- ============================================================ -->
	<!-- 6. RESULTS -->
	<!-- ============================================================ -->

	<h2>6. Results</h2>

	<h3>6.1 Perplexity Comparison</h3>

	<p>
	Table 3 presents the central perplexity comparison on Qwen3.5-397B-A17B. We compare two
	versions of our pipeline—SWAN v1 (threshold-based allocation from sensitivity scores)
	and SWAN v2 (MCKP-based optimization)—against uniform 4-bit baselines at two group sizes.
	</p>

	<table>
	<tr>
	<th>Variant</th>
	<th>Avg Bits</th>
	<th>Group Size</th>
	<th>Size (GB)</th>
	<th>Perplexity</th>
	<th>vs Uniform</th>
	</tr>
	<tr>
	<td>SWAN v1 (threshold)</td>
	<td>5.06</td>
	<td>128</td>
	<td>199.1</td>
	<td>4.283</td>
	<td>−0.3%</td>
	</tr>
	<tr class="highlight-row">
	<td><strong>SWAN v2 (MCKP)</strong></td>
	<td class="best">4.31</td>
	<td>128</td>
	<td>199.1</td>
	<td><strong>4.283</strong></td>
	<td><strong>−0.3%</strong></td>
	</tr>
	<tr>
	<td>Uniform 4-bit</td>
	<td>4.25</td>
	<td>128</td>
	<td>196.0</td>
	<td>4.298</td>
	<td>—</td>
	</tr>
	<tr>
	<td>Uniform 4-bit</td>
	<td>4.25</td>
	<td>64</td>
	<td>208.5</td>
	<td class="best">3.931</td>
	<td>—</td>
	</tr>
	<tr>
	<td>SWAN v2</td>
	<td>4.56</td>
	<td>64</td>
	<td>210.6</td>
	<td>4.058</td>
	<td>+3.2% worse</td>
	</tr>
	</table>
	<div class="caption">Table 3: Perplexity comparison on Qwen3.5-397B-A17B (WikiText-2, 256 sequences, 2048 tokens, seed=42).</div>

	<p>
	The key result is that SWAN v2 MCKP matches the threshold-based v1 at 16% fewer average
	bits (4.31 vs 5.06) while achieving identical perplexity (4.283). This demonstrates that the
	budget-constrained optimizer efficiently reallocates bits from over-protected tensors to where
	they matter. At matched group_size = 128, SWAN v2 beats uniform 4-bit by 0.015 perplexity
	points (4.283 vs 4.298), a modest but consistent improvement. However, at group_size = 64,
	uniform 4-bit (3.931) outperforms SWAN v2 (4.058) by 3.2%, a finding we discuss in detail in
	Section 6.4.
	</p>

	<h3>6.2 Downstream Benchmarks</h3>

	<p>
	To verify that perplexity improvements translate to downstream task quality, we evaluate the
	SWAN-quantized Qwen3.5-397B on four standard benchmarks:
	</p>

	<table>
	<tr>
	<th>Benchmark</th>
	<th>Score</th>
	</tr>
	<tr>
	<td>MMLU-Pro (thinking)</td>
	<td>77.1%</td>
	</tr>
	<tr>
	<td>ARC-Challenge</td>
	<td>96.0%</td>
	</tr>
	<tr>
	<td>GSM8K</td>
	<td>88.7%</td>
	</tr>
	<tr>
	<td>HumanEval</td>
	<td>78.7%</td>
	</tr>
	</table>
	<div class="caption">Table 4: Downstream benchmarks for Qwen3.5-397B-A17B quantized with SWAN (4.31 avg bits).</div>

	<p>
	Note: we were unable to run the BF16 baseline on our hardware (740+ GB exceeds 192 GB),
	so direct degradation measurement is not possible. However, these scores are competitive with
	published BF16 results for this model (NVIDIA reports MMLU-Pro 83.7% at BF16), suggesting the
	quantized model retains the majority of its reasoning and coding capability. The 96.0% on
	ARC-Challenge and 88.7% on GSM8K are particularly strong, as these structured reasoning tasks
	are often sensitive to quantization noise.
	</p>

	<h3>6.3 Bit Allocation Distribution</h3>

	<p>
	Table 5 shows the breakdown of bit-width assignments across all parameters in Qwen3.5-397B
	under the 226 GB budget. The vast majority of expert parameters (89.4%) safely quantize
	to 4-bit under the SQNR safety floor, with only 1.6% requiring 8-bit and 1.0% remaining at
	16-bit precision. The 8.0% at 6-bit represents tensors where the MCKP solver found the
	intermediate precision to be the most efficient allocation.
	</p>

	<table>
	<tr>
	<th>Precision</th>
	<th>Parameters</th>
	<th>Percentage</th>
	</tr>
	<tr>
	<td>4-bit</td>
	<td>360.8B</td>
	<td>89.4%</td>
	</tr>
	<tr>
	<td>6-bit</td>
	<td>32.2B</td>
	<td>8.0%</td>
	</tr>
	<tr>
	<td>8-bit</td>
	<td>6.5B</td>
	<td>1.6%</td>
	</tr>
	<tr>
	<td>16-bit</td>
	<td>3.9B</td>
	<td>1.0%</td>
	</tr>
	</table>
	<div class="caption">Table 5: Bit allocation detail for Qwen3.5-397B-A17B (226 GB budget). 89.4% of parameters tolerate 4-bit quantization.</div>

	<p>
	The 16-bit parameters (3.9B, 1.0%) correspond primarily to embeddings, the LM head, and
	LayerNorm parameters—components protected by the soft priors and the SQNR veto. The
	8-bit parameters (6.5B, 1.6%) include router weights and attention projections in the first
	and last two layers, which exhibit the highest kurtosis values in the model.
	</p>

	<h3>6.4 The Group Size Effect</h3>

	<p>
	Comparing rows in Table 3 reveals a striking finding: reducing group size from 128 to 64
	yields a 0.225 perplexity improvement for SWAN v2 (4.283 to 4.058), while the entire SWAN
	mixed-precision allocation yields only 0.015 improvement over uniform 4-bit at matched group
	size. The uniform 4-bit baseline at group_size = 64 achieves 3.931 perplexity—better than
	any SWAN variant at group_size = 128.
	</p>

	<p>
	This is the most practically important finding in our study: for MoE models at 400B+ scale,
	group size optimization is the primary quality lever. Halving the group size doubles the
	number of scale and zero-point parameters, providing finer-grained adaptation to local weight
	distributions. This benefit is orthogonal to and substantially larger than mixed-precision
	bit-width allocation.
	</p>

	<h3>6.5 Codebook Quantization Ablation</h3>

	<table>
	<tr>
	<th>Method</th>
	<th>Mean MSE Reduction</th>
	<th>Range</th>
	</tr>
	<tr class="highlight-row">
	<td>$k$-means codebook (256 centroids)</td>
	<td class="best">41.1%</td>
	<td>39.7% – 42.6%</td>
	</tr>
	<tr>
	<td>Hadamard rotation</td>
	<td>8.2%</td>
	<td>2.2% – 16.4%</td>
	</tr>
	<tr>
	<td>MSE-optimal clipping</td>
	<td>2.7%</td>
	<td>—</td>
	</tr>
	</table>
	<div class="caption">Table 6: Codebook quantization ablation (30 MoE expert tensors, Qwen3.5-397B).</div>

	<p>
	Codebook quantization using $k$-means with 256 centroids yields a uniformly large MSE
	reduction of approximately 41% across all expert tensors, regardless of kurtosis. The
	kurtosis-MSE correlation for codebook improvement is only $-0.058$, confirming that the
	benefit is structural (non-linear quantization better fits arbitrary weight distributions)
	rather than targeted at specific tensor properties. However, codebook quantization requires a
	lookup-table (LUT) dequantization kernel that is not available in MLX or most inference
	frameworks, creating a deployment blocker. On BF16 models, the correlation is higher
	($+0.537$), but the absolute improvement is similar ($\sim$45% MSE reduction).
	</p>

	<p>
	Hadamard rotation provides a modest 8.2% mean MSE improvement at 4-bit across 40
	tested tensors, with individual improvements ranging from 2.2% to 16.4%. However, the rotation cannot bridge bit levels:
	4-bit with Hadamard rotation is still 187–343$\times$ worse in MSE than 8-bit
	quantization (0 of 16 candidate tensors could be downgraded from 8-bit to 4-bit+Hadamard).
	MSE-optimal clipping contributes only 2.7%, suggesting that the default clipping in
	standard quantization is already near-optimal for these weight distributions.
	</p>

	<h3>6.6 Cross-Architecture Validation</h3>

	<p>
	To validate our findings beyond Qwen3.5-397B, we applied the full pipeline to Llama4-Maverick
	(128 experts) and Qwen3-8B (dense). On Maverick, the SWAN pipeline assigns 50 expert tensor
	groups to 2-bit and 18 to 8-bit, producing a 171.9 GB quantized model with perplexity
	6.343 on WikiText-2. On the dense Qwen3-8B, mixed-precision provides negligible benefit over
	uniform 4-bit, consistent with prior observations that small dense models have insufficient
	sensitivity heterogeneity for mixed-precision to exploit.
	</p>

	<p>
	We additionally validated on several other architectures during development. Llama 3.1 70B
	achieved perplexity 4.221 (vs uniform4-g128 at 4.771, an 11.5% improvement). Llama 3.3 70B
	achieved 4.379 (vs 5.052, a 13.3% improvement). These results confirm that mixed-precision
	provides significant gains for large dense models (70B+), with the benefit increasing as
	model heterogeneity grows. On smaller dense models (8B), the benefit is negligible, and on
	very large MoE models (397B), the benefit exists but is dwarfed by group size effects.
	</p>

	<h3>6.7 Expert Routing Analysis</h3>

	<p>
	To understand how expert utilization patterns might inform quantization decisions, we profiled
	the routing behavior of Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across
	100 diverse prompts spanning 40 MoE layers.
	</p>

	<table>
	<tr>
	<th>Metric</th>
	<th>Value</th>
	</tr>
	<tr>
	<td>Experts per layer</td>
	<td>256</td>
	</tr>
	<tr>
	<td>Active per token</td>
	<td>8</td>
	</tr>
	<tr>
	<td>Avg dead experts per layer</td>
	<td>1.5 / 256 (0.6%)</td>
	</tr>
	<tr>
	<td>Avg entropy ratio</td>
	<td>0.91</td>
	</tr>
	<tr>
	<td>Avg Gini coefficient</td>
	<td>0.53</td>
	</tr>
	<tr>
	<td>Top-10 expert traffic share</td>
	<td>20.4% (vs 3.9% if uniform)</td>
	</tr>
	</table>
	<div class="caption">Table 7: Expert routing statistics on Qwen3.5-35B-A3B (256 experts, 40 MoE layers, 100 prompts).</div>

	<p>
	Expert utilization is moderately concentrated: the entropy ratio of 0.91 indicates fairly
	uniform but not perfectly balanced routing, while the Gini coefficient of 0.53 shows moderate
	concentration. The top-10 experts carry 20.4% of all traffic, approximately 5x their fair
	share under uniform routing (3.9%). The average number of completely dead experts is only
	1.5 per layer (0.6%), indicating that nearly all experts contribute to at least some inputs.
	</p>

	<p>
	A critical methodological finding is that prompt diversity dramatically affects dead expert
	counts. With only 5 prompts, approximately 30% of experts appeared dead; at 100 prompts this
	dropped to 0.6%. This demonstrates that a diverse prompt set reveals most experts have value,
	and that pruning decisions based on small calibration sets risk removing experts that are
	essential for less common but valid inputs.
	</p>

	<h3>6.8 DynaMINT: Tiered Expert Quantization</h3>

	<p>
	Motivated by the routing analysis in Section 6.7, we developed DynaMINT (Dynamic MINT), a
	tiered expert quantization scheme that assigns different precisions to experts based on their
	activation frequency. Using the routing statistics from 100-prompt profiling, experts are
	classified into four tiers:
	</p>

	<table>
	<tr>
	<th>Tier</th>
	<th>Precision</th>
	<th>Share of Experts</th>
	</tr>
	<tr>
	<td>Critical (high-traffic)</td>
	<td>8-bit</td>
	<td>19.9%</td>
	</tr>
	<tr>
	<td>Standard</td>
	<td>4-bit</td>
	<td>64.8%</td>
	</tr>
	<tr>
	<td>Deprioritized (low-traffic)</td>
	<td>2-bit</td>
	<td>11.6%</td>
	</tr>
	<tr>
	<td>Prunable (near-zero traffic)</td>
	<td>0-bit (removed)</td>
	<td>3.6%</td>
	</tr>
	</table>
	<div class="caption">Table 8: DynaMINT tier distribution on Qwen3.5-35B-A3B.</div>

	<p>
	We evaluated DynaMINT against the uniform MINT baseline on Qwen3.5-35B-A3B:
	</p>

	<table>
	<tr>
	<th>Variant</th>
	<th>Perplexity</th>
	<th>vs Baseline</th>
	<th>Speed (tok/s)</th>
	</tr>
	<tr>
	<td>MINT uniform (baseline)</td>
	<td>6.580</td>
	<td>—</td>
	<td>70.1</td>
	</tr>
	<tr class="highlight-row">
	<td><strong>DynaMINT (tiered)</strong></td>
	<td>6.613</td>
	<td>+0.5%</td>
	<td>9.6</td>
	</tr>
	</table>
	<div class="caption">Table 9: DynaMINT evaluation on Qwen3.5-35B-A3B (WikiText-2).</div>

	<p>
	DynaMINT maintains quality with only +0.5% perplexity degradation despite 11.6% of experts
	at 2-bit and 3.6% pruned entirely. The conversion process is fast, completing all 40 layers
	in 1.7 seconds. MoE weight size increases by 10.5% due to the 8-bit critical tier; this
	could be offset by adjusting tier thresholds to reduce the critical tier percentage.
	Generation quality is preserved: the tiered model produces coherent chain-of-thought responses
	on all three test prompts.
	</p>

	<p>
	The primary limitation is inference speed: DynaMINT achieves only 9.6 tok/s compared to
	70.1 tok/s for the uniform baseline, a 7x slowdown. This overhead comes entirely from
	Python-level per-tier dispatch—the current prototype launches separate kernel calls for
	each precision tier. This is an engineering rather than fundamental limitation: sorted dispatch
	(grouping tokens by their routed expert's tier before kernel launch) or a native multi-precision
	kernel would eliminate most of this overhead.
	</p>

	<h3>6.9 Expert Pruning</h3>

	<p>
	To evaluate expert pruning as an orthogonal compression technique, we measured the perplexity
	impact of zeroing out the least-activated experts in Qwen3.5-35B-A3B based on the routing
	statistics from Section 6.7.
	</p>

	<table>
	<tr>
	<th>Pruning Level</th>
	<th>Experts Removed</th>
	<th>Perplexity</th>
	<th>Degradation</th>
	</tr>
	<tr>
	<td>0% (baseline)</td>
	<td>0</td>
	<td class="best">6.580</td>
	<td>—</td>
	</tr>
	<tr>
	<td>5%</td>
	<td>480</td>
	<td>86.906</td>
	<td>13.2x</td>
	</tr>
	<tr>
	<td>10%</td>
	<td>960</td>
	<td>15,894</td>
	<td>2,416x</td>
	</tr>
	<tr>
	<td>25%</td>
	<td>2,400</td>
	<td>906,762</td>
	<td>137,805x</td>
	</tr>
	</table>
	<div class="caption">Table 10: Expert pruning curve on Qwen3.5-35B-A3B. Even 5% pruning causes catastrophic degradation.</div>

	<p>
	This is a strong negative result: Qwen3.5-35B-A3B is extremely sensitive to expert pruning.
	Removing just 5% of experts (480 out of 9,600 total across 40 layers) causes a 13x perplexity
	degradation, from 6.580 to 86.906. At 10% pruning, the model effectively collapses with
	perplexity exceeding 15,000. At 25%, the model produces near-random output.
	</p>

	<p>
	This result directly contradicts the assumption that rarely-activated experts can be safely
	removed. The routing mechanism in this architecture relies on having all experts available:
	even experts with low average activation frequency appear to be essential for specific input
	distributions. Activation frequency alone is not a safe pruning criterion—an expert
	activated on only 0.1% of tokens may still be critical for those tokens, and its removal
	cascades through the routing softmax, redistributing probability mass in ways that compound
	across layers. This finding supports our quantization-first approach over pruning for MoE
	compression, and suggests that methods proposing significant expert pruning
	<span class="ref">[15]</span> may not generalize to architectures with 256+ experts and
	fine-grained routing.
	</p>

	<!-- ============================================================ -->
	<!-- 7. DISCUSSION & LIMITATIONS -->
	<!-- ============================================================ -->

	<h2>7. Discussion & Limitations</h2>

	<h3>7.1 Group Size Dominates Bit-Width</h3>

	<p>
	The perplexity data presents a clear hierarchy of quantization quality levers for MoE models.
	Halving group size from 128 to 64 yields a 0.225 perplexity improvement—an order of
	magnitude larger than the 0.015 improvement from SWAN's mixed-precision allocation at matched
	group size. This finding has immediate practical implications: practitioners should prioritize
	group size reduction (accepting the $\sim$7% size increase from additional scale parameters)
	before investing in mixed-precision profiling. The mixed-precision pipeline remains valuable
	for squeezing the last fraction of quality at a given size budget, but it is a secondary
	lever.
	</p>

	<h3>7.2 The SwitchLinear Constraint</h3>

	<p>
	MLX's <code>SwitchLinear</code> module stores all expert weights in a single contiguous
	tensor, requiring all experts within a layer to share one quantization configuration. This
	prevents true per-expert bit-width differentiation: our analysis computes per-expert
	sensitivity, but the allocation decision is necessarily per-expert-group. We aggregate via
	parameter-weighted mean, which is conservative but not optimal. A native
	<code>MixedBitSwitchGLU</code> kernel that supports heterogeneous expert quantization within
	a single layer would unlock the full potential of per-expert sensitivity analysis. Our
	metric correlation data (Table 2) suggests that meaningful sensitivity variance exists within
	expert groups, as kurtosis scores span a wide interquartile range (0.014 to 0.338 across all
	2,347 tensors) and this variance is present both within and across expert groups.
	</p>

	<h3>7.3 Output Sensitivity Saturation</h3>

	<p>
	On 512-expert models, the output noise amplification metric saturates at 1.0 (its normalized
	maximum) for 99.5% of expert tensors. This is because MoE expert weight matrices tend to have
	similar spectral norms—the routing mechanism and load balancing during training encourage
	experts to operate at similar scales. The practical implication is that simpler profiling
	pipelines using only kurtosis and reconstruction error may be sufficient for large MoE models,
	reducing profiling cost without sacrificing allocation quality. For dense models and small MoE
	models ($\leq 32$ experts), output sensitivity retains discriminatory power and should be
	included.
	</p>

	<h3>7.4 The Codebook Opportunity</h3>

	<p>
	The 41% MSE reduction from $k$-means codebook quantization represents a substantial
	untapped opportunity for MoE compression. Unlike mixed-precision allocation (which
	redistributes a fixed bit budget) or Hadamard rotation (which provides modest within-bitwidth
	improvement), codebook quantization fundamentally changes the representation power per bit.
	The uniform improvement across kurtosis levels ($\rho = -0.058$) indicates this benefit is
	structural—non-linear quantization grids better fit arbitrary weight distributions
	regardless of their statistical properties. This makes codebook quantization a complementary
	technique rather than a replacement for mixed-precision allocation: one optimizes the
	quantization grid, the other optimizes the bit budget distribution. The primary barrier is
	kernel support: an efficient LUT dequantization kernel is needed for practical deployment,
	which is absent from MLX, CUDA (for standard formats), and most inference frameworks.
	</p>

	<h3>7.5 Limitations</h3>

	<p>
	Several limitations constrain the scope of our findings:
	</p>

	<ul>
	<li><strong>Dispatch kernel overhead.</strong> True per-expert mixed precision at
	inference time requires a <code>MixedBitSwitchGLU</code> kernel that dispatches tokens to
	experts quantized at different bit-widths. DynaMINT demonstrates a pure-Python prototype
	achieving only 0.5% perplexity degradation with tiered quantization, but with 7x speed
	overhead from per-tier kernel launches. Sorted dispatch or native kernel support would
	eliminate this overhead.</li>
	<li><strong>Expert pruning is destructive.</strong> Our pruning study on Qwen3.5-35B-A3B
	(Section 6.9) shows that MoE routing relies on expert diversity rather than individual
	expert quality. Even 5% expert removal causes 13x perplexity degradation, indicating that
	activation frequency alone is not a safe pruning criterion. Methods proposing significant
	expert pruning may not generalize to fine-grained MoE architectures.</li>
	<li><strong>Single hardware platform.</strong> All experiments run on Apple Silicon (M2
	Ultra, 192 GB). While the sensitivity analysis and MCKP formulation are
	hardware-independent, the quantization implementation and performance characteristics are
	specific to MLX. Validation on CUDA-based frameworks (e.g., with GPTQ or AWQ backends)
	would strengthen the generalizability of our results.</li>
	<li><strong>No perplexity measurement at BF16 for Qwen3.5-397B.</strong> The full BF16
	model (740+ GB) does not fit in 192 GB, precluding direct BF16 baseline measurement on our
	hardware. Our comparisons are against uniform quantized baselines.</li>
	</ul>

	<!-- ============================================================ -->
	<!-- 8. CONCLUSION -->
	<!-- ============================================================ -->

	<h2>8. Conclusion</h2>

	<p>
	We have presented the first data-free mixed-precision quantization pipeline validated on
	512-expert MoE models at 403 billion parameters. By profiling weight sensitivity using four
	complementary metrics—spectral features, per-group kurtosis, output noise amplification,
	and reconstruction error—we characterize 2,347 tensor groups across Qwen3.5-397B-A17B
	without requiring any calibration data. Our MCKP formulation with expert grouping constraints
	and SQNR safety vetoes solves in under 100 ms for any model size, producing provably
	near-optimal (bits, group_size) assignments per expert group. Key findings include: kurtosis is
	the dominant sensitivity predictor (Spearman $\rho = 0.795$), 89.4% of expert parameters
	safely quantize to 4-bit, and group size has a larger impact on perplexity than bit-width
	allocation at this scale. Codebook quantization (+41% MSE reduction) and Hadamard rotation
	(+8.2%) establish practical boundaries for future MoE compression techniques.
	</p>

	<p>
	We further contribute DynaMINT, a tiered expert quantization scheme informed by activation
	profiling that assigns critical experts to 8-bit, standard experts to 4-bit, and deprioritized
	experts to 2-bit. DynaMINT maintains quality at only +0.5% perplexity degradation despite
	11.6% of experts at 2-bit and 3.6% pruned entirely, demonstrating that activation-aware
	tiering is a viable complement to weight-based sensitivity analysis. Our expert pruning study
	provides an important negative result: even 5% expert removal causes 13x perplexity
	degradation on Qwen3.5-35B-A3B, establishing that activation frequency alone is not a safe
	pruning criterion and that MoE routing relies fundamentally on expert diversity.
	</p>

	<p>
	We release all code, sensitivity manifests, and quantized models to facilitate reproduction
	and extension. The MINT pipeline is available at
	<a href="https://github.com/baa-ai/MINT">github.com/baa-ai/MINT</a>, with pre-quantized
	models hosted at <a href="https://huggingface.co/baa-ai">huggingface.co/baa-ai</a>. We
	believe the key actionable insight—that group size dominates bit-width allocation for
	large MoE models—should inform both practitioners choosing quantization configurations
	and framework developers prioritizing kernel optimizations. Future work should focus on
	native multi-precision dispatch kernels (eliminating DynaMINT's 7x Python overhead), codebook
	dequantization support, and joint optimization of group size and bit-width allocation.
	</p>

	<!-- ============================================================ -->
	<!-- REFERENCES -->
	<!-- ============================================================ -->

	<h2>References</h2>

	<p class="bib">[1] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. <em>ICLR</em>, 2021.</p>

	<p class="bib">[2] Fedus, W., Zoph, B., and Shazeer, N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. <em>JMLR</em>, 23(120):1–39, 2022.</p>

	<p class="bib">[3] Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., et al. Mixtral of Experts. <em>arXiv preprint arXiv:2401.04088</em>, 2024.</p>

	<p class="bib">[4] DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient Mixture-of-Experts language model. <em>arXiv preprint arXiv:2405.04434</em>, 2024.</p>

	<p class="bib">[5] DeepSeek-AI. DeepSeek-V3 technical report. <em>arXiv preprint arXiv:2412.19437</em>, 2025.</p>

	<p class="bib">[6] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen3 technical report. <em>arXiv preprint arXiv:2505.09388</em>, 2025.</p>

	<p class="bib">[7] Alibaba Cloud. Qwen3.5: Advancing the frontier with 512 experts. <em>Technical report</em>, 2025.</p>

	<p class="bib">[8] Meta AI. Llama 4: Open-weight Mixture-of-Experts models. <em>Technical report</em>, 2025.</p>

	<p class="bib">[9] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. <em>ICLR</em>, 2023.</p>

	<p class="bib">[10] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. <em>MLSys</em>, 2024.</p>

	<p class="bib">[11] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W., and Keutzer, K. SqueezeLLM: Dense-and-sparse quantization. <em>ICML</em>, 2024.</p>

	<p class="bib">[12] Badri, H. and Shaji, A. HQQ: Half-quadratic quantization of large language models. <em>NeurIPS Workshop on Efficient Natural Language and Speech Processing</em>, 2024.</p>

	<p class="bib">[13] Guo, C., Chen, J., Li, J., Zhou, Y., Chen, T., Xie, L., and Zhang, B. MXQ: Mixed-precision quantization for efficient LLM deployment. <em>arXiv preprint arXiv:2401.12917</em>, 2024.</p>

	<p class="bib">[14] Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and De Sa, C. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. <em>ICML</em>, 2024.</p>

	<p class="bib">[15] Li, W., Zhang, Y., Sun, H., Wang, X., and Qiu, X. MC-MoE: Mixture compressor for Mixture-of-Experts LLMs gains more. <em>arXiv preprint arXiv:2410.06270</em>, 2024.</p>

	<p class="bib">[16] Frantar, E. and Alistarh, D. QMoE: Practical sub-1-bit compression of trillion-parameter models. <em>arXiv preprint arXiv:2310.16795</em>, 2023.</p>

	<p class="bib">[17] Kim, Y., Lee, J., Park, S., and Shin, J. MoEQuant: Expert-wise quantization for Mixture-of-Experts models. <em>arXiv preprint arXiv:2406.02279</em>, 2024.</p>

	<p class="bib">[18] Chen, Z., Qin, K., Zhang, Y., Li, P., Zhao, J., and Liang, X. DynaExq: Dynamic expert-level mixed-precision quantization for Mixture-of-Experts. <em>arXiv preprint arXiv:2405.11009</em>, 2024.</p>

	<p class="bib">[19] Black Sheep AI. SWAN: SmartQuant data-free per-tensor mixed-precision quantization for LLMs on Apple Silicon. <em>Technical report, baa.ai</em>, 2026.</p>

	<p class="bib">[20] Black Sheep AI. MINT: Memory-Informed N-bit Tuning — compute-optimal data-free mixed-precision quantization for LLMs. <em>Technical report, baa.ai</em>, 2026.</p>

	<p class="bib">[21] Apple. MLX: An array framework for Apple Silicon. <em>GitHub repository, github.com/ml-explore/mlx</em>, 2024.</p>

	<p class="bib">[22] Gray, R.M. and Neuhoff, D.L. Quantization. <em>IEEE Transactions on Information Theory</em>, 44(6):2325–2383, 1998.</p>

	<p class="bib">[23] Kellerer, H., Pferschy, U., and Pisinger, D. <em>Knapsack Problems</em>. Springer, Berlin, 2004.</p>

	<p class="bib">[24] Barlow, R.E., Bartholomew, D.J., Bremner, J.M., and Brunk, H.D. <em>Statistical Inference under Order Restrictions: The Theory and Application of Isotonic Regression</em>. Wiley, New York, 1972.</p>

	<p class="bib">[25] Hampel, F.R. The influence curve and its role in robust estimation. <em>Journal of the American Statistical Association</em>, 69(346):383–393, 1974.</p>

	<div class="footnote">
	<p>Code and models: <a href="https://github.com/baa-ai/MINT">github.com/baa-ai/MINT</a> \|
	<a href="https://huggingface.co/baa-ai">huggingface.co/baa-ai</a></p>
	<p>Correspondence: research@baa.ai</p>
	</div>

	</body>
	</html>