Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models</title> | |
| <script>window.MathJax = {tex: {inlineMath: [['$','$'],['\\(','\\)']]}};</script> | |
| <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> | |
| <style> | |
| :root { | |
| --bg: #fafafa; | |
| --text: #1a1a2e; | |
| --accent: #2d5aa0; | |
| --muted: #555; | |
| --border: #ddd; | |
| --code-bg: #f0f0f0; | |
| --table-header: #e8eef6; | |
| --highlight: #fff3cd; | |
| } | |
| * { box-sizing: border-box; margin: 0; padding: 0; } | |
| body { | |
| font-family: 'Georgia', 'Times New Roman', serif; | |
| line-height: 1.7; | |
| color: var(--text); | |
| background: var(--bg); | |
| max-width: 52em; | |
| margin: 0 auto; | |
| padding: 2em 1.5em 4em; | |
| } | |
| h1 { font-size: 1.7em; line-height: 1.3; text-align: center; margin: 1.5em 0 0.3em; color: var(--text); } | |
| .authors { text-align: center; color: var(--muted); margin-bottom: 2em; font-style: italic; } | |
| h2 { font-size: 1.3em; margin: 2em 0 0.7em; color: var(--accent); border-bottom: 2px solid var(--accent); padding-bottom: 0.2em; } | |
| h3 { font-size: 1.1em; margin: 1.5em 0 0.5em; color: #333; } | |
| h4 { font-size: 1em; font-weight: bold; margin: 1.2em 0 0.3em; color: var(--text); } | |
| p { margin: 0.7em 0; text-align: justify; } | |
| .abstract { background: #f5f5f5; border-left: 4px solid var(--accent); padding: 1.2em 1.5em; margin: 1.5em 0 2em; font-size: 0.95em; } | |
| .abstract strong { color: var(--accent); } | |
| ul, ol { margin: 0.5em 0 0.5em 2em; } | |
| li { margin: 0.3em 0; } | |
| table { border-collapse: collapse; margin: 1em auto; font-size: 0.9em; width: auto; min-width: 50%; } | |
| th, td { border: 1px solid var(--border); padding: 0.45em 0.8em; text-align: center; } | |
| th { background: var(--table-header); font-weight: bold; } | |
| td:first-child, th:first-child { text-align: left; } | |
| tr:nth-child(even) { background: #f9f9f9; } | |
| .best { font-weight: bold; color: #1a7a1a; } | |
| .caption { text-align: center; font-size: 0.88em; color: var(--muted); margin-top: 0.5em; margin-bottom: 1.5em; font-style: italic; } | |
| .equation { display: block; text-align: center; margin: 1em 0; padding: 0.8em; background: var(--code-bg); border-radius: 4px; font-family: 'Courier New', monospace; font-size: 0.92em; overflow-x: auto; } | |
| .algorithm { background: #fafafa; border: 1px solid var(--border); padding: 1em 1.5em; margin: 1em 0; font-family: 'Courier New', monospace; font-size: 0.88em; line-height: 1.5; border-radius: 4px; } | |
| .algorithm .keyword { color: var(--accent); font-weight: bold; } | |
| .algorithm .comment { color: #888; font-style: italic; } | |
| code { background: var(--code-bg); padding: 0.15em 0.4em; border-radius: 3px; font-size: 0.9em; } | |
| .ref { color: var(--accent); cursor: default; } | |
| .footnote { font-size: 0.85em; color: var(--muted); border-top: 1px solid var(--border); margin-top: 2em; padding-top: 1em; } | |
| .bib { font-size: 0.85em; margin: 0.3em 0; padding-left: 2.5em; text-indent: -2.5em; } | |
| .highlight-row { background: var(--highlight) ; } | |
| @media (max-width: 600px) { body { padding: 1em 0.8em; font-size: 0.95em; } table { font-size: 0.8em; } th, td { padding: 0.3em 0.5em; } } | |
| </style> | |
| </head> | |
| <body> | |
| <!-- ============================================================ --> | |
| <!-- TITLE & AUTHORS --> | |
| <!-- ============================================================ --> | |
| <h1>Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models</h1> | |
| <div class="authors">Black Sheep AI Research — baa.ai</div> | |
| <!-- ============================================================ --> | |
| <!-- ABSTRACT --> | |
| <!-- ============================================================ --> | |
| <div class="abstract"> | |
| <strong>Abstract.</strong> | |
| Mixture-of-Experts architectures with 128–512 experts per layer create unique challenges | |
| for post-training quantization. Existing methods apply uniform bit-widths across all experts | |
| or rely on coarse per-layer decisions. We present the first comprehensive data-free sensitivity | |
| analysis and mixed-precision quantization study on real 512-expert MoE models. Using | |
| weight-based sensitivity metrics—spectral analysis, per-group kurtosis, output noise | |
| amplification, and reconstruction error—we profile 2,347 tensors across Qwen3.5-397B-A17B | |
| (512 experts per layer) and validate across three architecture scales. We discover that kurtosis | |
| is the dominant sensitivity predictor (Spearman $\rho = 0.795$), that 89.4% of expert | |
| parameters tolerate 4-bit quantization under SQNR safety constraints, and that group size has a | |
| larger impact on perplexity than bit-width allocation. Our pipeline formulates allocation as a | |
| Multiple-Choice Knapsack Problem (MCKP), solving for the provably near-optimal | |
| (bits, group_size) assignment per expert group in under 100 ms. On commodity Apple Silicon, | |
| we match the perplexity of a threshold-based mixed-precision baseline at 16% fewer average bits | |
| (4.31 vs 5.06), and improve over uniform 4-bit quantization at matched group size. We provide ablations on codebook quantization (+41% MSE reduction) and Hadamard | |
| rotations (+8.2%), establishing practical boundaries for future MoE compression. | |
| We further introduce DynaMINT, a tiered expert quantization scheme informed by activation | |
| profiling, which maintains quality at only +0.5% perplexity degradation with 11.6% of experts | |
| at 2-bit and 3.6% pruned. An expert pruning study reveals that even 5% expert removal causes | |
| 13x perplexity degradation, establishing that activation frequency alone is not a safe pruning | |
| criterion. All code, manifests, and models are released. | |
| </div> | |
| <!-- ============================================================ --> | |
| <!-- 1. INTRODUCTION --> | |
| <!-- ============================================================ --> | |
| <h2>1. Introduction</h2> | |
| <p> | |
| Mixture-of-Experts (MoE) has become the dominant architecture for frontier large language | |
| models. Qwen3.5-397B-A17B <span class="ref">[7]</span> deploys 512 experts per MoE layer with | |
| only 17 billion parameters active per token, achieving state-of-the-art quality at a fraction | |
| of the compute cost implied by total parameter count. Llama 4 Maverick | |
| <span class="ref">[8]</span> uses 128 experts across 24 MoE layers, while earlier models such | |
| as Mixtral-8x7B <span class="ref">[3]</span> and DeepSeek-V3 <span class="ref">[5]</span> | |
| established the pattern at smaller expert counts. Despite the compute efficiency of sparse | |
| activation, deployment remains constrained by total model size: Qwen3.5-397B requires over | |
| 740 GB in BF16, far exceeding the memory of any single accelerator. | |
| </p> | |
| <p> | |
| Post-training quantization is the standard solution, but existing methods apply uniform | |
| bit-widths across all experts. This ignores a fundamental property of MoE architectures: | |
| experts are trained semi-independently through routing and develop heterogeneous weight | |
| distributions. Some experts have near-Gaussian weight distributions that compress gracefully | |
| to 4-bit; others exhibit heavy-tailed distributions with high kurtosis that suffer catastrophic | |
| accuracy loss at the same precision. Uniform quantization wastes bits on robust experts while | |
| under-protecting sensitive ones. | |
| </p> | |
| <p> | |
| Prior work on MoE quantization—MC-MoE <span class="ref">[15]</span>, QMoE | |
| <span class="ref">[16]</span>, MoEQuant <span class="ref">[17]</span>, and DynaExq | |
| <span class="ref">[18]</span>—all require calibration data to estimate expert | |
| sensitivity via activation traces or routing statistics. This creates practical barriers: | |
| calibration sets must be representative of deployment distribution, the profiling pass | |
| requires loading the full model in high precision, and results may not transfer across | |
| domains. More critically, none of these methods has been validated at 512-expert scale, where | |
| the combinatorial space of per-expert configurations explodes and calibration cost becomes | |
| prohibitive. | |
| </p> | |
| <p> | |
| We make four contributions: <strong>(1)</strong> the first data-free sensitivity study at | |
| 512-expert scale, profiling 2,347 tensors across Qwen3.5-397B-A17B using only weight | |
| statistics; <strong>(2)</strong> an MCKP-based allocation pipeline with expert grouping | |
| constraints that solves in under 100 ms for any model size; <strong>(3)</strong> | |
| comprehensive ablations on codebook quantization and Hadamard rotation techniques, | |
| establishing practical boundaries for future MoE compression; and <strong>(4)</strong> open | |
| release of all code, sensitivity manifests, and quantized models. The entire pipeline runs on | |
| a single Apple M2 Ultra with 192 GB unified memory <span class="ref">[21]</span>, | |
| requiring no GPU cluster and no calibration data. | |
| </p> | |
| <!-- ============================================================ --> | |
| <!-- 2. RELATED WORK --> | |
| <!-- ============================================================ --> | |
| <h2>2. Related Work</h2> | |
| <h3>2.1 Mixture-of-Experts Architectures</h3> | |
| <p> | |
| The modern MoE paradigm traces from GShard <span class="ref">[1]</span> and the Switch | |
| Transformer <span class="ref">[2]</span>, which demonstrated that sparsely-activated expert | |
| layers could scale model capacity without proportional compute cost. Mixtral-8x7B | |
| <span class="ref">[3]</span> brought MoE to open-weight models with 8 experts per layer, | |
| selecting 2 per token. DeepSeek-V2 <span class="ref">[4]</span> introduced fine-grained | |
| experts (up to 160 per layer), and DeepSeek-V3 <span class="ref">[5]</span> scaled to 256 | |
| experts with auxiliary-loss-free load balancing. Qwen3 <span class="ref">[6]</span> and | |
| Qwen3.5 <span class="ref">[7]</span> pushed further to 512 experts per layer while activating | |
| only 17B of 397B total parameters. Llama 4 <span class="ref">[8]</span> adopted a hybrid | |
| design with both dense and MoE layers. The trend is clear: expert counts are growing | |
| rapidly, and quantization methods must keep pace. | |
| </p> | |
| <h3>2.2 MoE-Specific Quantization</h3> | |
| <p> | |
| MC-MoE <span class="ref">[15]</span> uses calibration data to identify and protect | |
| frequently-activated experts, applying lower precision to rarely-used ones. QMoE | |
| <span class="ref">[16]</span> compresses all experts to under 1 bit per parameter using | |
| learned codebooks with calibration-based distillation. MoEQuant <span class="ref">[17]</span> | |
| proposes expert-wise calibration to handle activation outliers specific to each expert. DynaExq | |
| <span class="ref">[18]</span> dynamically adjusts expert quantization based on runtime routing | |
| patterns. All of these methods require calibration data and activation traces, creating | |
| practical deployment barriers. None has been demonstrated at 512-expert scale. | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Method</th> | |
| <th>Expert Count Tested</th> | |
| <th>Granularity</th> | |
| <th>Calibration Data</th> | |
| <th>Data-Free</th> | |
| <th>Hardware</th> | |
| </tr> | |
| <tr> | |
| <td>MC-MoE <span class="ref">[15]</span></td> | |
| <td>≤16</td> | |
| <td>Per-expert</td> | |
| <td>Required</td> | |
| <td>No</td> | |
| <td>GPU</td> | |
| </tr> | |
| <tr> | |
| <td>QMoE <span class="ref">[16]</span></td> | |
| <td>128</td> | |
| <td>Per-expert</td> | |
| <td>Required</td> | |
| <td>No</td> | |
| <td>GPU</td> | |
| </tr> | |
| <tr> | |
| <td>MoEQuant <span class="ref">[17]</span></td> | |
| <td>≤64</td> | |
| <td>Per-expert</td> | |
| <td>Required</td> | |
| <td>No</td> | |
| <td>GPU</td> | |
| </tr> | |
| <tr> | |
| <td>DynaExq <span class="ref">[18]</span></td> | |
| <td>≤128</td> | |
| <td>Dynamic</td> | |
| <td>Required</td> | |
| <td>No</td> | |
| <td>GPU</td> | |
| </tr> | |
| <tr class="highlight-row"> | |
| <td><strong>Ours</strong></td> | |
| <td><strong>256</strong></td> | |
| <td><strong>Per-expert tiered</strong></td> | |
| <td><strong>None</strong></td> | |
| <td class="best"><strong>Yes</strong></td> | |
| <td><strong>Apple Silicon</strong></td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table A: Comparison of MoE quantization methods. Our approach is the only data-free method and scales to the highest expert count.</div> | |
| <h3>2.3 Mixed-Precision Quantization for Dense Models</h3> | |
| <p> | |
| For dense models, GPTQ <span class="ref">[9]</span> uses second-order information for | |
| layer-wise quantization; AWQ <span class="ref">[10]</span> identifies salient weight channels | |
| via activation magnitudes; SqueezeLLM <span class="ref">[11]</span> separates outliers into a | |
| sparse format; HQQ <span class="ref">[12]</span> provides fast half-quadratic quantization | |
| without calibration data; and MXQ <span class="ref">[13]</span> assigns mixed precision at | |
| sub-layer granularity. QuIP# <span class="ref">[14]</span> applies random orthogonal | |
| transformations to incoherify weight matrices before quantization. While these methods have | |
| advanced the state of the art for dense models, they do not address the unique challenges of | |
| MoE: heterogeneous expert sensitivity, the combinatorial explosion of per-expert configuration | |
| space, and framework constraints that tie all experts within a layer to a shared quantization | |
| config. Our work fills this gap with a data-free method validated at 512-expert scale under a | |
| budget-constrained optimization framework. | |
| </p> | |
| <!-- ============================================================ --> | |
| <!-- 3. EXPERT SENSITIVITY PROFILING --> | |
| <!-- ============================================================ --> | |
| <h2>3. Expert Sensitivity Profiling</h2> | |
| <h3>3.1 Weight-Based Sensitivity Metrics</h3> | |
| <p> | |
| Unlike activation-based methods that require calibration data and forward passes, we analyze | |
| sensitivity entirely from weight tensor properties. This makes profiling data-free and | |
| embarrassingly parallel across shards. We compute four complementary metrics for each tensor: | |
| </p> | |
| <ul> | |
| <li><strong>SVD spectral features.</strong> We compute the singular value decomposition of | |
| each weight matrix and extract three scale-invariant features: the stable rank | |
| ($\|W\|_F^2 / \|W\|_2^2$, measuring effective dimensionality), spectral tail mass | |
| (fraction of Frobenius energy outside the top-$k$ singular values, $k = \text{rank}/10$), | |
| and log condition number ($\log_{10}(\sigma_1 / \sigma_{\min})$, capped at 10.0). | |
| Tensors with low stable rank and high tail mass have information concentrated in a few | |
| directions and are more sensitive to quantization noise.</li> | |
| <li><strong>Per-group kurtosis.</strong> We partition weight tensors into groups (matching | |
| the quantization group size) and compute excess kurtosis per group, then aggregate via the | |
| 95th percentile. High kurtosis indicates heavy-tailed distributions with outlier weights | |
| that are poorly represented by uniform quantization grids. We extract four features: | |
| mean, median, 95th percentile, and max kurtosis across groups.</li> | |
| <li><strong>Output noise amplification.</strong> We estimate how quantization noise in | |
| weights amplifies through the linear transformation by computing | |
| $\sigma_{\text{out}} / \sigma_{\text{noise}}$ where | |
| $\sigma_{\text{noise}}$ is the expected quantization step size. This captures the | |
| condition-number-like sensitivity of the weight matrix.</li> | |
| <li><strong>Reconstruction error (NRMSE).</strong> We simulate quantization at each | |
| candidate (bits, group_size) configuration and measure | |
| $\text{NRMSE} = \|W - Q(W)\|_F / \|W\|_F$. This serves as both a metric and the | |
| optimization objective for bit allocation.</li> | |
| </ul> | |
| <h3>3.2 Expert Analysis Modes</h3> | |
| <p> | |
| MoE models present a scaling challenge: Qwen3.5-397B has 512 experts per layer across 60 | |
| layers, yielding thousands of expert weight tensors. Profiling every expert independently is | |
| feasible but expensive. We implement two analysis modes in the expert handler: | |
| </p> | |
| <ul> | |
| <li><strong>Mode A (individual):</strong> For models with $\leq 32$ experts per layer, | |
| compute full sensitivity metrics for every expert tensor and use worst-case (maximum | |
| sensitivity) across experts within each group for conservative allocation.</li> | |
| <li><strong>Mode B (clustered):</strong> For models with $> 32$ experts per layer, cluster | |
| experts using $k$-means on (Frobenius norm, kurtosis) features into $\sqrt{n_{\text{experts}}}$ | |
| clusters, then sample one representative expert per cluster. This reduces profiling cost | |
| by $10\text{--}20\times$ while preserving the distribution of sensitivity scores.</li> | |
| </ul> | |
| <p> | |
| For Qwen3.5-397B (512 experts), Mode B reduces the number of fully-profiled expert tensors | |
| from 30,720 to approximately 2,347 while maintaining coverage of the sensitivity distribution. | |
| </p> | |
| <h3>3.3 Cross-Architecture Scaling Study</h3> | |
| <p> | |
| We profile three architectures spanning dense to large-scale MoE to understand how sensitivity | |
| distributions change with expert count and model scale: | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Model</th> | |
| <th>Architecture</th> | |
| <th>Parameters</th> | |
| <th>Tensors</th> | |
| <th>Layers</th> | |
| <th>Avg Bits</th> | |
| <th>Est. Size</th> | |
| </tr> | |
| <tr> | |
| <td>Qwen3-8B</td> | |
| <td>Dense</td> | |
| <td>8.2B</td> | |
| <td>399</td> | |
| <td>36</td> | |
| <td>6.84</td> | |
| <td>6.7 GB</td> | |
| </tr> | |
| <tr> | |
| <td>Llama4-Maverick</td> | |
| <td>MoE 128E</td> | |
| <td>401.6B</td> | |
| <td>1,061</td> | |
| <td>48</td> | |
| <td>4.78</td> | |
| <td>230.4 GB</td> | |
| </tr> | |
| <tr> | |
| <td>Qwen3.5-397B</td> | |
| <td>MoE 512E</td> | |
| <td>403.4B</td> | |
| <td>2,924</td> | |
| <td>60</td> | |
| <td>5.06</td> | |
| <td>245.1 GB</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 1: Cross-architecture scaling study. Tensor counts reflect the profiled set (Mode B clustering for MoE models).</div> | |
| <h3>3.4 Metric Correlation Analysis</h3> | |
| <p> | |
| To identify which metrics best predict quantization sensitivity, we compute rank correlations | |
| between each metric and reconstruction error (NRMSE at 4-bit) across all 2,347 profiled | |
| tensors in Qwen3.5-397B: | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Metric</th> | |
| <th>Spearman $\rho$</th> | |
| <th>Pearson $r$</th> | |
| <th>$p$-value</th> | |
| </tr> | |
| <tr class="highlight-row"> | |
| <td>Per-group kurtosis</td> | |
| <td class="best">0.795</td> | |
| <td>0.480</td> | |
| <td><1e-135</td> | |
| </tr> | |
| <tr> | |
| <td>Cross-layer position</td> | |
| <td>−0.468</td> | |
| <td>−0.224</td> | |
| <td><1e-128</td> | |
| </tr> | |
| <tr> | |
| <td>SVD spectral features</td> | |
| <td>0.391</td> | |
| <td>0.303</td> | |
| <td><1e-86</td> | |
| </tr> | |
| <tr> | |
| <td>Composite (weighted)</td> | |
| <td>0.374</td> | |
| <td>0.400</td> | |
| <td><1e-79</td> | |
| </tr> | |
| <tr> | |
| <td>Output sensitivity</td> | |
| <td>0.212</td> | |
| <td>0.455</td> | |
| <td><1e-25</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 2: Metric correlation with reconstruction error (NRMSE at 4-bit) across 2,347 tensors in Qwen3.5-397B-A17B.</div> | |
| <p> | |
| Kurtosis dominates as the sensitivity predictor with Spearman $\rho = 0.795$, substantially | |
| ahead of the next best metric. Output sensitivity, despite its intuitive appeal, achieves | |
| only $\rho = 0.212$. Investigation reveals that output sensitivity saturates at 1.0 for 99.5% | |
| of MoE expert tensors (median = 1.0), losing all discriminatory power at 512-expert scale. | |
| This saturation occurs because expert weight matrices in large MoE models tend to have similar | |
| spectral norms, making the noise amplification ratio nearly identical across experts. | |
| </p> | |
| <p> | |
| The negative cross-layer position correlation ($\rho = -0.468$) confirms the well-known | |
| U-shaped sensitivity pattern: early and late layers are more sensitive to quantization than | |
| middle layers. This pattern holds across all three architectures and is captured by the soft | |
| protection priors in our allocation pipeline. | |
| </p> | |
| <!-- ============================================================ --> | |
| <!-- 4. PER-EXPERT MIXED-PRECISION PIPELINE --> | |
| <!-- ============================================================ --> | |
| <h2>4. Per-Expert Mixed-Precision Pipeline</h2> | |
| <h3>4.1 Rate-Distortion Profiling</h3> | |
| <p> | |
| For each tensor, we compute the reconstruction error (NRMSE) at eight candidate | |
| (bits, group_size) configurations, forming a rate-distortion curve: | |
| </p> | |
| <div class="equation"> | |
| $\mathcal{C} = \{(2, 32),\; (3, 64),\; (4, 32),\; (4, 64),\; (4, 128),\; (8, 64),\; (8, 128),\; (16, {-})\}$ | |
| </div> | |
| <p> | |
| Each configuration implies a specific size cost (bits per parameter plus scale/zero-point | |
| overhead from the group size) and a distortion level. The rate-distortion curve captures the | |
| tensor-specific tradeoff: some tensors see a large NRMSE jump between 4-bit and 8-bit while | |
| others degrade gracefully, making them good candidates for aggressive quantization. | |
| </p> | |
| <h3>4.2 Expert Grouping Constraint</h3> | |
| <p> | |
| MLX's <code>SwitchLinear</code> module <span class="ref">[21]</span> requires all experts | |
| within a layer to share a single quantization configuration (bits and group_size). This is a | |
| hard framework constraint: the quantized weight tensor for all experts in a layer is stored as | |
| a single contiguous array with uniform element width. Consequently, our analysis is per-expert | |
| but the allocation must be per-expert-group, where each group corresponds to all experts | |
| sharing a (layer, projection_type) pair. | |
| </p> | |
| <p> | |
| We aggregate per-expert NRMSE values into a group-level distortion estimate using the | |
| parameter-weighted mean across all experts in the group: | |
| </p> | |
| <div class="equation"> | |
| $\text{NRMSE}_{\text{group}}(b, g) = \frac{\sum_{e \in \text{group}} n_e \cdot \text{NRMSE}_e(b, g)}{\sum_{e \in \text{group}} n_e}$ | |
| </div> | |
| <p> | |
| where $n_e$ is the number of parameters in expert $e$. This weighting ensures that larger | |
| experts (which contribute more to total model size) have proportionally more influence on the | |
| group allocation decision. | |
| </p> | |
| <h3>4.3 MCKP Formulation</h3> | |
| <p> | |
| We formulate the bit-width and group-size allocation as a Multiple-Choice Knapsack Problem | |
| (MCKP) <span class="ref">[23]</span>. Let $i$ index the tensor groups, $\pi_i$ denote soft | |
| protection priors, and $(b_i, g_i) \in \mathcal{C}_i$ denote the candidate configurations for | |
| group $i$. The optimization problem is: | |
| </p> | |
| <div class="equation"> | |
| $\min_{\{(b_i, g_i)\}} \sum_i \pi_i \cdot \text{NRMSE}_i(b_i, g_i) \quad \text{s.t.} \quad \sum_i \text{size}_i(b_i, g_i) \leq B, \quad (b_i, g_i) \in \mathcal{C}_i \;\; \forall i$ | |
| </div> | |
| <p> | |
| where $B$ is the memory budget and $\text{size}_i(b_i, g_i)$ computes the storage cost | |
| including scale and zero-point overhead. The soft protection priors $\pi_i$ increase the | |
| effective distortion cost for structurally important components, discouraging aggressive | |
| quantization of embeddings, layer norms, and boundary layers: | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Component</th> | |
| <th>Prior Weight ($\pi$)</th> | |
| </tr> | |
| <tr> | |
| <td>Embeddings</td> | |
| <td>10.0x</td> | |
| </tr> | |
| <tr> | |
| <td>LM head</td> | |
| <td>10.0x</td> | |
| </tr> | |
| <tr> | |
| <td>Router weights</td> | |
| <td>8.0x</td> | |
| </tr> | |
| <tr> | |
| <td>First 2 layers</td> | |
| <td>3.0x</td> | |
| </tr> | |
| <tr> | |
| <td>Last 2 layers</td> | |
| <td>2.0x</td> | |
| </tr> | |
| <tr> | |
| <td>LayerNorm</td> | |
| <td>$\infty$ (never quantize)</td> | |
| </tr> | |
| <tr> | |
| <td>All other tensors</td> | |
| <td>1.0x</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Soft protection priors. Higher weights penalize low-precision assignment for structurally important components.</div> | |
| <h3>4.4 SQNR Safety Veto</h3> | |
| <p> | |
| Before optimization, we apply a Signal-to-Quantization-Noise Ratio (SQNR) safety veto | |
| <span class="ref">[22]</span>. For each tensor and each candidate configuration, we compute: | |
| </p> | |
| <div class="equation"> | |
| $\text{SQNR}(W, b, g) = 10 \cdot \log_{10} \frac{\|W\|_F^2}{\|W - Q_{b,g}(W)\|_F^2}$ | |
| </div> | |
| <p> | |
| Any configuration with $\text{SQNR} < 9\;\text{dB}$ (the default floor) is removed from the | |
| candidate set $\mathcal{C}_i$ before the MCKP solver runs. This hard constraint prevents | |
| catastrophic quantization of tensors where the quantization noise exceeds approximately 35% | |
| of the signal power, regardless of what the budget optimization might prefer. | |
| </p> | |
| <h3>4.5 eCDF Normalization</h3> | |
| <p> | |
| Raw metric values span vastly different scales across architectures and model sizes. We | |
| replace all hardcoded normalization bounds with empirical CDF (eCDF) normalization: each | |
| metric value is transformed to its percentile rank across all tensors in the model, yielding | |
| scale-invariant scores in $[0, 1]$. For a metric value $x$ with observed values | |
| $\{x_1, \ldots, x_n\}$: | |
| </p> | |
| <div class="equation"> | |
| $\text{eCDF}(x) = \frac{1}{n} \sum_{j=1}^{n} \mathbf{1}[x_j \leq x]$ | |
| </div> | |
| <p> | |
| This eliminates the need for per-metric normalization constants and adapts automatically to | |
| the distribution of any model, whether dense or MoE, 8B or 400B parameters. | |
| </p> | |
| <h3>4.6 Greedy Solver</h3> | |
| <p> | |
| We solve the MCKP using a greedy efficiency ordering. Starting from the minimum-cost (lowest | |
| bit-width) feasible assignment, we enumerate all possible upgrades (transitions to higher | |
| precision) for every group and sort them by efficiency: | |
| </p> | |
| <div class="equation"> | |
| $\text{efficiency}(i, c \to c') = \frac{\pi_i \cdot [\text{NRMSE}_i(c) - \text{NRMSE}_i(c')]}{\text{size}_i(c') - \text{size}_i(c)}$ | |
| </div> | |
| <p> | |
| Upgrades are applied greedily in decreasing efficiency order until the budget $B$ is | |
| exhausted. For the MCKP with concave distortion curves (which holds empirically for | |
| quantization), this greedy approach is provably near-optimal. The solver completes in under | |
| 100 ms for Qwen3.5-397B (2,924 tensor groups), making it practical for interactive | |
| experimentation with different budgets. | |
| </p> | |
| <h3>4.7 Full Pipeline</h3> | |
| <div class="algorithm"> | |
| <strong>Algorithm 1:</strong> Per-Expert Mixed-Precision Quantization Pipeline<br><br> | |
| <span class="keyword">Input:</span> Model weights $W = \{W_1, \ldots, W_n\}$, memory budget $B$<br> | |
| <span class="keyword">Output:</span> Per-tensor assignment $\{(b_i, g_i)\}$ satisfying budget $B$<br><br> | |
| <span class="comment">// Phase 1: Sensitivity profiling</span><br> | |
| 1. <span class="keyword">for each</span> tensor $W_i$ <span class="keyword">do</span><br> | |
| Compute 4 sensitivity metrics: spectral, kurtosis, output noise, NRMSE<br> | |
| 2. Normalize all metrics via eCDF (percentile rank across model)<br><br> | |
| <span class="comment">// Phase 2: Rate-distortion curves</span><br> | |
| 3. <span class="keyword">for each</span> tensor $W_i$ <span class="keyword">do</span><br> | |
| Compute NRMSE at 8 (bits, group_size) configurations<br><br> | |
| <span class="comment">// Phase 3: Expert grouping</span><br> | |
| 4. Group MoE experts by (layer, projection_type)<br> | |
| 5. Aggregate NRMSE across experts (parameter-weighted mean)<br><br> | |
| <span class="comment">// Phase 4: Safety and priors</span><br> | |
| 6. Apply SQNR veto: remove configs with SQNR < 9 dB from candidate sets<br> | |
| 7. Apply soft protection priors $\pi_i$ to group distortion costs<br><br> | |
| <span class="comment">// Phase 5: Budget-constrained optimization</span><br> | |
| 8. Solve MCKP via greedy efficiency ordering under budget $B$<br><br> | |
| <span class="keyword">return</span> $\{(b_i, g_i)\}_{i=1}^{n}$ | |
| </div> | |
| <!-- ============================================================ --> | |
| <!-- 5. EXPERIMENTAL SETUP --> | |
| <!-- ============================================================ --> | |
| <h2>5. Experimental Setup</h2> | |
| <h3>5.1 Models</h3> | |
| <p> | |
| We evaluate on three models spanning dense and MoE architectures at different scales: | |
| <strong>Qwen3-8B</strong> <span class="ref">[6]</span> (dense, 8.2B parameters, 399 tensors, | |
| 36 layers), <strong>Llama4-Maverick-17B-128E</strong> <span class="ref">[8]</span> (MoE with | |
| 128 experts, 401.6B parameters, 1,061 tensors, 48 layers including 24 MoE and 24 dense), and | |
| <strong>Qwen3.5-397B-A17B</strong> <span class="ref">[7]</span> (MoE with 512 experts per | |
| layer, 403.4B parameters, 2,924 tensors, 60 layers). These models represent three distinct | |
| regimes: a small dense model where mixed-precision has limited headroom, a medium-scale MoE | |
| with a hybrid dense/MoE architecture, and a large-scale MoE where the expert parameter count | |
| dwarfs the shared backbone. | |
| </p> | |
| <h3>5.2 Hardware and Framework</h3> | |
| <p> | |
| All experiments run on a single Apple M2 Ultra with 192 GB unified memory, using the | |
| MLX framework <span class="ref">[21]</span> for inference and quantization. The unified memory | |
| architecture eliminates CPU-GPU transfer overhead, and MLX's lazy evaluation enables | |
| processing models that would not fit in discrete GPU memory. Sensitivity analysis of | |
| Qwen3.5-397B completes in approximately 163 minutes, scanning all weight tensors across 55 | |
| safetensor shards and profiling 2,924 tensor groups (after expert clustering via Mode B). | |
| </p> | |
| <h3>5.3 Evaluation Protocol</h3> | |
| <p> | |
| We evaluate perplexity on WikiText-2 using 256 sequences of 2,048 tokens each (seed = 42). | |
| We report both mean and median perplexity; the median is more robust to outlier sequences that | |
| can inflate the mean, particularly for MoE models where routing decisions introduce | |
| sequence-level variance. For downstream benchmarks on Qwen3.5-397B, we evaluate MMLU-Pro | |
| (thinking mode), ARC-Challenge, GSM8K, and HumanEval using standard evaluation harnesses. | |
| Baselines include BF16 (full precision), uniform 4-bit with group_size = 128, and uniform | |
| 4-bit with group_size = 64. | |
| </p> | |
| <!-- ============================================================ --> | |
| <!-- 6. RESULTS --> | |
| <!-- ============================================================ --> | |
| <h2>6. Results</h2> | |
| <h3>6.1 Perplexity Comparison</h3> | |
| <p> | |
| Table 3 presents the central perplexity comparison on Qwen3.5-397B-A17B. We compare two | |
| versions of our pipeline—SWAN v1 (threshold-based allocation from sensitivity scores) | |
| and SWAN v2 (MCKP-based optimization)—against uniform 4-bit baselines at two group sizes. | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Variant</th> | |
| <th>Avg Bits</th> | |
| <th>Group Size</th> | |
| <th>Size (GB)</th> | |
| <th>Perplexity</th> | |
| <th>vs Uniform</th> | |
| </tr> | |
| <tr> | |
| <td>SWAN v1 (threshold)</td> | |
| <td>5.06</td> | |
| <td>128</td> | |
| <td>199.1</td> | |
| <td>4.283</td> | |
| <td>−0.3%</td> | |
| </tr> | |
| <tr class="highlight-row"> | |
| <td><strong>SWAN v2 (MCKP)</strong></td> | |
| <td class="best">4.31</td> | |
| <td>128</td> | |
| <td>199.1</td> | |
| <td><strong>4.283</strong></td> | |
| <td><strong>−0.3%</strong></td> | |
| </tr> | |
| <tr> | |
| <td>Uniform 4-bit</td> | |
| <td>4.25</td> | |
| <td>128</td> | |
| <td>196.0</td> | |
| <td>4.298</td> | |
| <td>—</td> | |
| </tr> | |
| <tr> | |
| <td>Uniform 4-bit</td> | |
| <td>4.25</td> | |
| <td>64</td> | |
| <td>208.5</td> | |
| <td class="best">3.931</td> | |
| <td>—</td> | |
| </tr> | |
| <tr> | |
| <td>SWAN v2</td> | |
| <td>4.56</td> | |
| <td>64</td> | |
| <td>210.6</td> | |
| <td>4.058</td> | |
| <td>+3.2% worse</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 3: Perplexity comparison on Qwen3.5-397B-A17B (WikiText-2, 256 sequences, 2048 tokens, seed=42).</div> | |
| <p> | |
| The key result is that SWAN v2 MCKP matches the threshold-based v1 at 16% fewer average | |
| bits (4.31 vs 5.06) while achieving identical perplexity (4.283). This demonstrates that the | |
| budget-constrained optimizer efficiently reallocates bits from over-protected tensors to where | |
| they matter. At matched group_size = 128, SWAN v2 beats uniform 4-bit by 0.015 perplexity | |
| points (4.283 vs 4.298), a modest but consistent improvement. However, at group_size = 64, | |
| uniform 4-bit (3.931) outperforms SWAN v2 (4.058) by 3.2%, a finding we discuss in detail in | |
| Section 6.4. | |
| </p> | |
| <h3>6.2 Downstream Benchmarks</h3> | |
| <p> | |
| To verify that perplexity improvements translate to downstream task quality, we evaluate the | |
| SWAN-quantized Qwen3.5-397B on four standard benchmarks: | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Benchmark</th> | |
| <th>Score</th> | |
| </tr> | |
| <tr> | |
| <td>MMLU-Pro (thinking)</td> | |
| <td>77.1%</td> | |
| </tr> | |
| <tr> | |
| <td>ARC-Challenge</td> | |
| <td>96.0%</td> | |
| </tr> | |
| <tr> | |
| <td>GSM8K</td> | |
| <td>88.7%</td> | |
| </tr> | |
| <tr> | |
| <td>HumanEval</td> | |
| <td>78.7%</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 4: Downstream benchmarks for Qwen3.5-397B-A17B quantized with SWAN (4.31 avg bits).</div> | |
| <p> | |
| Note: we were unable to run the BF16 baseline on our hardware (740+ GB exceeds 192 GB), | |
| so direct degradation measurement is not possible. However, these scores are competitive with | |
| published BF16 results for this model (NVIDIA reports MMLU-Pro 83.7% at BF16), suggesting the | |
| quantized model retains the majority of its reasoning and coding capability. The 96.0% on | |
| ARC-Challenge and 88.7% on GSM8K are particularly strong, as these structured reasoning tasks | |
| are often sensitive to quantization noise. | |
| </p> | |
| <h3>6.3 Bit Allocation Distribution</h3> | |
| <p> | |
| Table 5 shows the breakdown of bit-width assignments across all parameters in Qwen3.5-397B | |
| under the 226 GB budget. The vast majority of expert parameters (89.4%) safely quantize | |
| to 4-bit under the SQNR safety floor, with only 1.6% requiring 8-bit and 1.0% remaining at | |
| 16-bit precision. The 8.0% at 6-bit represents tensors where the MCKP solver found the | |
| intermediate precision to be the most efficient allocation. | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Precision</th> | |
| <th>Parameters</th> | |
| <th>Percentage</th> | |
| </tr> | |
| <tr> | |
| <td>4-bit</td> | |
| <td>360.8B</td> | |
| <td>89.4%</td> | |
| </tr> | |
| <tr> | |
| <td>6-bit</td> | |
| <td>32.2B</td> | |
| <td>8.0%</td> | |
| </tr> | |
| <tr> | |
| <td>8-bit</td> | |
| <td>6.5B</td> | |
| <td>1.6%</td> | |
| </tr> | |
| <tr> | |
| <td>16-bit</td> | |
| <td>3.9B</td> | |
| <td>1.0%</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 5: Bit allocation detail for Qwen3.5-397B-A17B (226 GB budget). 89.4% of parameters tolerate 4-bit quantization.</div> | |
| <p> | |
| The 16-bit parameters (3.9B, 1.0%) correspond primarily to embeddings, the LM head, and | |
| LayerNorm parameters—components protected by the soft priors and the SQNR veto. The | |
| 8-bit parameters (6.5B, 1.6%) include router weights and attention projections in the first | |
| and last two layers, which exhibit the highest kurtosis values in the model. | |
| </p> | |
| <h3>6.4 The Group Size Effect</h3> | |
| <p> | |
| Comparing rows in Table 3 reveals a striking finding: reducing group size from 128 to 64 | |
| yields a 0.225 perplexity improvement for SWAN v2 (4.283 to 4.058), while the entire SWAN | |
| mixed-precision allocation yields only 0.015 improvement over uniform 4-bit at matched group | |
| size. The uniform 4-bit baseline at group_size = 64 achieves 3.931 perplexity—better than | |
| any SWAN variant at group_size = 128. | |
| </p> | |
| <p> | |
| This is the most practically important finding in our study: for MoE models at 400B+ scale, | |
| group size optimization is the primary quality lever. Halving the group size doubles the | |
| number of scale and zero-point parameters, providing finer-grained adaptation to local weight | |
| distributions. This benefit is orthogonal to and substantially larger than mixed-precision | |
| bit-width allocation. | |
| </p> | |
| <h3>6.5 Codebook Quantization Ablation</h3> | |
| <table> | |
| <tr> | |
| <th>Method</th> | |
| <th>Mean MSE Reduction</th> | |
| <th>Range</th> | |
| </tr> | |
| <tr class="highlight-row"> | |
| <td>$k$-means codebook (256 centroids)</td> | |
| <td class="best">41.1%</td> | |
| <td>39.7% – 42.6%</td> | |
| </tr> | |
| <tr> | |
| <td>Hadamard rotation</td> | |
| <td>8.2%</td> | |
| <td>2.2% – 16.4%</td> | |
| </tr> | |
| <tr> | |
| <td>MSE-optimal clipping</td> | |
| <td>2.7%</td> | |
| <td>—</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 6: Codebook quantization ablation (30 MoE expert tensors, Qwen3.5-397B).</div> | |
| <p> | |
| Codebook quantization using $k$-means with 256 centroids yields a uniformly large MSE | |
| reduction of approximately 41% across all expert tensors, regardless of kurtosis. The | |
| kurtosis-MSE correlation for codebook improvement is only $-0.058$, confirming that the | |
| benefit is structural (non-linear quantization better fits arbitrary weight distributions) | |
| rather than targeted at specific tensor properties. However, codebook quantization requires a | |
| lookup-table (LUT) dequantization kernel that is not available in MLX or most inference | |
| frameworks, creating a deployment blocker. On BF16 models, the correlation is higher | |
| ($+0.537$), but the absolute improvement is similar ($\sim$45% MSE reduction). | |
| </p> | |
| <p> | |
| Hadamard rotation provides a modest 8.2% mean MSE improvement at 4-bit across 40 | |
| tested tensors, with individual improvements ranging from 2.2% to 16.4%. However, the rotation cannot bridge bit levels: | |
| 4-bit with Hadamard rotation is still 187–343$\times$ worse in MSE than 8-bit | |
| quantization (0 of 16 candidate tensors could be downgraded from 8-bit to 4-bit+Hadamard). | |
| MSE-optimal clipping contributes only 2.7%, suggesting that the default clipping in | |
| standard quantization is already near-optimal for these weight distributions. | |
| </p> | |
| <h3>6.6 Cross-Architecture Validation</h3> | |
| <p> | |
| To validate our findings beyond Qwen3.5-397B, we applied the full pipeline to Llama4-Maverick | |
| (128 experts) and Qwen3-8B (dense). On Maverick, the SWAN pipeline assigns 50 expert tensor | |
| groups to 2-bit and 18 to 8-bit, producing a 171.9 GB quantized model with perplexity | |
| 6.343 on WikiText-2. On the dense Qwen3-8B, mixed-precision provides negligible benefit over | |
| uniform 4-bit, consistent with prior observations that small dense models have insufficient | |
| sensitivity heterogeneity for mixed-precision to exploit. | |
| </p> | |
| <p> | |
| We additionally validated on several other architectures during development. Llama 3.1 70B | |
| achieved perplexity 4.221 (vs uniform4-g128 at 4.771, an 11.5% improvement). Llama 3.3 70B | |
| achieved 4.379 (vs 5.052, a 13.3% improvement). These results confirm that mixed-precision | |
| provides significant gains for large dense models (70B+), with the benefit increasing as | |
| model heterogeneity grows. On smaller dense models (8B), the benefit is negligible, and on | |
| very large MoE models (397B), the benefit exists but is dwarfed by group size effects. | |
| </p> | |
| <h3>6.7 Expert Routing Analysis</h3> | |
| <p> | |
| To understand how expert utilization patterns might inform quantization decisions, we profiled | |
| the routing behavior of Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across | |
| 100 diverse prompts spanning 40 MoE layers. | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Metric</th> | |
| <th>Value</th> | |
| </tr> | |
| <tr> | |
| <td>Experts per layer</td> | |
| <td>256</td> | |
| </tr> | |
| <tr> | |
| <td>Active per token</td> | |
| <td>8</td> | |
| </tr> | |
| <tr> | |
| <td>Avg dead experts per layer</td> | |
| <td>1.5 / 256 (0.6%)</td> | |
| </tr> | |
| <tr> | |
| <td>Avg entropy ratio</td> | |
| <td>0.91</td> | |
| </tr> | |
| <tr> | |
| <td>Avg Gini coefficient</td> | |
| <td>0.53</td> | |
| </tr> | |
| <tr> | |
| <td>Top-10 expert traffic share</td> | |
| <td>20.4% (vs 3.9% if uniform)</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 7: Expert routing statistics on Qwen3.5-35B-A3B (256 experts, 40 MoE layers, 100 prompts).</div> | |
| <p> | |
| Expert utilization is moderately concentrated: the entropy ratio of 0.91 indicates fairly | |
| uniform but not perfectly balanced routing, while the Gini coefficient of 0.53 shows moderate | |
| concentration. The top-10 experts carry 20.4% of all traffic, approximately 5x their fair | |
| share under uniform routing (3.9%). The average number of completely dead experts is only | |
| 1.5 per layer (0.6%), indicating that nearly all experts contribute to at least some inputs. | |
| </p> | |
| <p> | |
| A critical methodological finding is that prompt diversity dramatically affects dead expert | |
| counts. With only 5 prompts, approximately 30% of experts appeared dead; at 100 prompts this | |
| dropped to 0.6%. This demonstrates that a diverse prompt set reveals most experts have value, | |
| and that pruning decisions based on small calibration sets risk removing experts that are | |
| essential for less common but valid inputs. | |
| </p> | |
| <h3>6.8 DynaMINT: Tiered Expert Quantization</h3> | |
| <p> | |
| Motivated by the routing analysis in Section 6.7, we developed DynaMINT (Dynamic MINT), a | |
| tiered expert quantization scheme that assigns different precisions to experts based on their | |
| activation frequency. Using the routing statistics from 100-prompt profiling, experts are | |
| classified into four tiers: | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Tier</th> | |
| <th>Precision</th> | |
| <th>Share of Experts</th> | |
| </tr> | |
| <tr> | |
| <td>Critical (high-traffic)</td> | |
| <td>8-bit</td> | |
| <td>19.9%</td> | |
| </tr> | |
| <tr> | |
| <td>Standard</td> | |
| <td>4-bit</td> | |
| <td>64.8%</td> | |
| </tr> | |
| <tr> | |
| <td>Deprioritized (low-traffic)</td> | |
| <td>2-bit</td> | |
| <td>11.6%</td> | |
| </tr> | |
| <tr> | |
| <td>Prunable (near-zero traffic)</td> | |
| <td>0-bit (removed)</td> | |
| <td>3.6%</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 8: DynaMINT tier distribution on Qwen3.5-35B-A3B.</div> | |
| <p> | |
| We evaluated DynaMINT against the uniform MINT baseline on Qwen3.5-35B-A3B: | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Variant</th> | |
| <th>Perplexity</th> | |
| <th>vs Baseline</th> | |
| <th>Speed (tok/s)</th> | |
| </tr> | |
| <tr> | |
| <td>MINT uniform (baseline)</td> | |
| <td>6.580</td> | |
| <td>—</td> | |
| <td>70.1</td> | |
| </tr> | |
| <tr class="highlight-row"> | |
| <td><strong>DynaMINT (tiered)</strong></td> | |
| <td>6.613</td> | |
| <td>+0.5%</td> | |
| <td>9.6</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 9: DynaMINT evaluation on Qwen3.5-35B-A3B (WikiText-2).</div> | |
| <p> | |
| DynaMINT maintains quality with only +0.5% perplexity degradation despite 11.6% of experts | |
| at 2-bit and 3.6% pruned entirely. The conversion process is fast, completing all 40 layers | |
| in 1.7 seconds. MoE weight size increases by 10.5% due to the 8-bit critical tier; this | |
| could be offset by adjusting tier thresholds to reduce the critical tier percentage. | |
| Generation quality is preserved: the tiered model produces coherent chain-of-thought responses | |
| on all three test prompts. | |
| </p> | |
| <p> | |
| The primary limitation is inference speed: DynaMINT achieves only 9.6 tok/s compared to | |
| 70.1 tok/s for the uniform baseline, a 7x slowdown. This overhead comes entirely from | |
| Python-level per-tier dispatch—the current prototype launches separate kernel calls for | |
| each precision tier. This is an engineering rather than fundamental limitation: sorted dispatch | |
| (grouping tokens by their routed expert's tier before kernel launch) or a native multi-precision | |
| kernel would eliminate most of this overhead. | |
| </p> | |
| <h3>6.9 Expert Pruning</h3> | |
| <p> | |
| To evaluate expert pruning as an orthogonal compression technique, we measured the perplexity | |
| impact of zeroing out the least-activated experts in Qwen3.5-35B-A3B based on the routing | |
| statistics from Section 6.7. | |
| </p> | |
| <table> | |
| <tr> | |
| <th>Pruning Level</th> | |
| <th>Experts Removed</th> | |
| <th>Perplexity</th> | |
| <th>Degradation</th> | |
| </tr> | |
| <tr> | |
| <td>0% (baseline)</td> | |
| <td>0</td> | |
| <td class="best">6.580</td> | |
| <td>—</td> | |
| </tr> | |
| <tr> | |
| <td>5%</td> | |
| <td>480</td> | |
| <td>86.906</td> | |
| <td>13.2x</td> | |
| </tr> | |
| <tr> | |
| <td>10%</td> | |
| <td>960</td> | |
| <td>15,894</td> | |
| <td>2,416x</td> | |
| </tr> | |
| <tr> | |
| <td>25%</td> | |
| <td>2,400</td> | |
| <td>906,762</td> | |
| <td>137,805x</td> | |
| </tr> | |
| </table> | |
| <div class="caption">Table 10: Expert pruning curve on Qwen3.5-35B-A3B. Even 5% pruning causes catastrophic degradation.</div> | |
| <p> | |
| This is a strong negative result: Qwen3.5-35B-A3B is extremely sensitive to expert pruning. | |
| Removing just 5% of experts (480 out of 9,600 total across 40 layers) causes a 13x perplexity | |
| degradation, from 6.580 to 86.906. At 10% pruning, the model effectively collapses with | |
| perplexity exceeding 15,000. At 25%, the model produces near-random output. | |
| </p> | |
| <p> | |
| This result directly contradicts the assumption that rarely-activated experts can be safely | |
| removed. The routing mechanism in this architecture relies on having all experts available: | |
| even experts with low average activation frequency appear to be essential for specific input | |
| distributions. Activation frequency alone is not a safe pruning criterion—an expert | |
| activated on only 0.1% of tokens may still be critical for those tokens, and its removal | |
| cascades through the routing softmax, redistributing probability mass in ways that compound | |
| across layers. This finding supports our quantization-first approach over pruning for MoE | |
| compression, and suggests that methods proposing significant expert pruning | |
| <span class="ref">[15]</span> may not generalize to architectures with 256+ experts and | |
| fine-grained routing. | |
| </p> | |
| <!-- ============================================================ --> | |
| <!-- 7. DISCUSSION & LIMITATIONS --> | |
| <!-- ============================================================ --> | |
| <h2>7. Discussion & Limitations</h2> | |
| <h3>7.1 Group Size Dominates Bit-Width</h3> | |
| <p> | |
| The perplexity data presents a clear hierarchy of quantization quality levers for MoE models. | |
| Halving group size from 128 to 64 yields a 0.225 perplexity improvement—an order of | |
| magnitude larger than the 0.015 improvement from SWAN's mixed-precision allocation at matched | |
| group size. This finding has immediate practical implications: practitioners should prioritize | |
| group size reduction (accepting the $\sim$7% size increase from additional scale parameters) | |
| before investing in mixed-precision profiling. The mixed-precision pipeline remains valuable | |
| for squeezing the last fraction of quality at a given size budget, but it is a secondary | |
| lever. | |
| </p> | |
| <h3>7.2 The SwitchLinear Constraint</h3> | |
| <p> | |
| MLX's <code>SwitchLinear</code> module stores all expert weights in a single contiguous | |
| tensor, requiring all experts within a layer to share one quantization configuration. This | |
| prevents true per-expert bit-width differentiation: our analysis computes per-expert | |
| sensitivity, but the allocation decision is necessarily per-expert-group. We aggregate via | |
| parameter-weighted mean, which is conservative but not optimal. A native | |
| <code>MixedBitSwitchGLU</code> kernel that supports heterogeneous expert quantization within | |
| a single layer would unlock the full potential of per-expert sensitivity analysis. Our | |
| metric correlation data (Table 2) suggests that meaningful sensitivity variance exists within | |
| expert groups, as kurtosis scores span a wide interquartile range (0.014 to 0.338 across all | |
| 2,347 tensors) and this variance is present both within and across expert groups. | |
| </p> | |
| <h3>7.3 Output Sensitivity Saturation</h3> | |
| <p> | |
| On 512-expert models, the output noise amplification metric saturates at 1.0 (its normalized | |
| maximum) for 99.5% of expert tensors. This is because MoE expert weight matrices tend to have | |
| similar spectral norms—the routing mechanism and load balancing during training encourage | |
| experts to operate at similar scales. The practical implication is that simpler profiling | |
| pipelines using only kurtosis and reconstruction error may be sufficient for large MoE models, | |
| reducing profiling cost without sacrificing allocation quality. For dense models and small MoE | |
| models ($\leq 32$ experts), output sensitivity retains discriminatory power and should be | |
| included. | |
| </p> | |
| <h3>7.4 The Codebook Opportunity</h3> | |
| <p> | |
| The 41% MSE reduction from $k$-means codebook quantization represents a substantial | |
| untapped opportunity for MoE compression. Unlike mixed-precision allocation (which | |
| redistributes a fixed bit budget) or Hadamard rotation (which provides modest within-bitwidth | |
| improvement), codebook quantization fundamentally changes the representation power per bit. | |
| The uniform improvement across kurtosis levels ($\rho = -0.058$) indicates this benefit is | |
| structural—non-linear quantization grids better fit arbitrary weight distributions | |
| regardless of their statistical properties. This makes codebook quantization a complementary | |
| technique rather than a replacement for mixed-precision allocation: one optimizes the | |
| quantization grid, the other optimizes the bit budget distribution. The primary barrier is | |
| kernel support: an efficient LUT dequantization kernel is needed for practical deployment, | |
| which is absent from MLX, CUDA (for standard formats), and most inference frameworks. | |
| </p> | |
| <h3>7.5 Limitations</h3> | |
| <p> | |
| Several limitations constrain the scope of our findings: | |
| </p> | |
| <ul> | |
| <li><strong>Dispatch kernel overhead.</strong> True per-expert mixed precision at | |
| inference time requires a <code>MixedBitSwitchGLU</code> kernel that dispatches tokens to | |
| experts quantized at different bit-widths. DynaMINT demonstrates a pure-Python prototype | |
| achieving only 0.5% perplexity degradation with tiered quantization, but with 7x speed | |
| overhead from per-tier kernel launches. Sorted dispatch or native kernel support would | |
| eliminate this overhead.</li> | |
| <li><strong>Expert pruning is destructive.</strong> Our pruning study on Qwen3.5-35B-A3B | |
| (Section 6.9) shows that MoE routing relies on expert diversity rather than individual | |
| expert quality. Even 5% expert removal causes 13x perplexity degradation, indicating that | |
| activation frequency alone is not a safe pruning criterion. Methods proposing significant | |
| expert pruning may not generalize to fine-grained MoE architectures.</li> | |
| <li><strong>Single hardware platform.</strong> All experiments run on Apple Silicon (M2 | |
| Ultra, 192 GB). While the sensitivity analysis and MCKP formulation are | |
| hardware-independent, the quantization implementation and performance characteristics are | |
| specific to MLX. Validation on CUDA-based frameworks (e.g., with GPTQ or AWQ backends) | |
| would strengthen the generalizability of our results.</li> | |
| <li><strong>No perplexity measurement at BF16 for Qwen3.5-397B.</strong> The full BF16 | |
| model (740+ GB) does not fit in 192 GB, precluding direct BF16 baseline measurement on our | |
| hardware. Our comparisons are against uniform quantized baselines.</li> | |
| </ul> | |
| <!-- ============================================================ --> | |
| <!-- 8. CONCLUSION --> | |
| <!-- ============================================================ --> | |
| <h2>8. Conclusion</h2> | |
| <p> | |
| We have presented the first data-free mixed-precision quantization pipeline validated on | |
| 512-expert MoE models at 403 billion parameters. By profiling weight sensitivity using four | |
| complementary metrics—spectral features, per-group kurtosis, output noise amplification, | |
| and reconstruction error—we characterize 2,347 tensor groups across Qwen3.5-397B-A17B | |
| without requiring any calibration data. Our MCKP formulation with expert grouping constraints | |
| and SQNR safety vetoes solves in under 100 ms for any model size, producing provably | |
| near-optimal (bits, group_size) assignments per expert group. Key findings include: kurtosis is | |
| the dominant sensitivity predictor (Spearman $\rho = 0.795$), 89.4% of expert parameters | |
| safely quantize to 4-bit, and group size has a larger impact on perplexity than bit-width | |
| allocation at this scale. Codebook quantization (+41% MSE reduction) and Hadamard rotation | |
| (+8.2%) establish practical boundaries for future MoE compression techniques. | |
| </p> | |
| <p> | |
| We further contribute DynaMINT, a tiered expert quantization scheme informed by activation | |
| profiling that assigns critical experts to 8-bit, standard experts to 4-bit, and deprioritized | |
| experts to 2-bit. DynaMINT maintains quality at only +0.5% perplexity degradation despite | |
| 11.6% of experts at 2-bit and 3.6% pruned entirely, demonstrating that activation-aware | |
| tiering is a viable complement to weight-based sensitivity analysis. Our expert pruning study | |
| provides an important negative result: even 5% expert removal causes 13x perplexity | |
| degradation on Qwen3.5-35B-A3B, establishing that activation frequency alone is not a safe | |
| pruning criterion and that MoE routing relies fundamentally on expert diversity. | |
| </p> | |
| <p> | |
| We release all code, sensitivity manifests, and quantized models to facilitate reproduction | |
| and extension. The MINT pipeline is available at | |
| <a href="https://github.com/baa-ai/MINT">github.com/baa-ai/MINT</a>, with pre-quantized | |
| models hosted at <a href="https://huggingface.co/baa-ai">huggingface.co/baa-ai</a>. We | |
| believe the key actionable insight—that group size dominates bit-width allocation for | |
| large MoE models—should inform both practitioners choosing quantization configurations | |
| and framework developers prioritizing kernel optimizations. Future work should focus on | |
| native multi-precision dispatch kernels (eliminating DynaMINT's 7x Python overhead), codebook | |
| dequantization support, and joint optimization of group size and bit-width allocation. | |
| </p> | |
| <!-- ============================================================ --> | |
| <!-- REFERENCES --> | |
| <!-- ============================================================ --> | |
| <h2>References</h2> | |
| <p class="bib">[1] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. <em>ICLR</em>, 2021.</p> | |
| <p class="bib">[2] Fedus, W., Zoph, B., and Shazeer, N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. <em>JMLR</em>, 23(120):1–39, 2022.</p> | |
| <p class="bib">[3] Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., et al. Mixtral of Experts. <em>arXiv preprint arXiv:2401.04088</em>, 2024.</p> | |
| <p class="bib">[4] DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient Mixture-of-Experts language model. <em>arXiv preprint arXiv:2405.04434</em>, 2024.</p> | |
| <p class="bib">[5] DeepSeek-AI. DeepSeek-V3 technical report. <em>arXiv preprint arXiv:2412.19437</em>, 2025.</p> | |
| <p class="bib">[6] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen3 technical report. <em>arXiv preprint arXiv:2505.09388</em>, 2025.</p> | |
| <p class="bib">[7] Alibaba Cloud. Qwen3.5: Advancing the frontier with 512 experts. <em>Technical report</em>, 2025.</p> | |
| <p class="bib">[8] Meta AI. Llama 4: Open-weight Mixture-of-Experts models. <em>Technical report</em>, 2025.</p> | |
| <p class="bib">[9] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. <em>ICLR</em>, 2023.</p> | |
| <p class="bib">[10] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. <em>MLSys</em>, 2024.</p> | |
| <p class="bib">[11] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W., and Keutzer, K. SqueezeLLM: Dense-and-sparse quantization. <em>ICML</em>, 2024.</p> | |
| <p class="bib">[12] Badri, H. and Shaji, A. HQQ: Half-quadratic quantization of large language models. <em>NeurIPS Workshop on Efficient Natural Language and Speech Processing</em>, 2024.</p> | |
| <p class="bib">[13] Guo, C., Chen, J., Li, J., Zhou, Y., Chen, T., Xie, L., and Zhang, B. MXQ: Mixed-precision quantization for efficient LLM deployment. <em>arXiv preprint arXiv:2401.12917</em>, 2024.</p> | |
| <p class="bib">[14] Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and De Sa, C. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. <em>ICML</em>, 2024.</p> | |
| <p class="bib">[15] Li, W., Zhang, Y., Sun, H., Wang, X., and Qiu, X. MC-MoE: Mixture compressor for Mixture-of-Experts LLMs gains more. <em>arXiv preprint arXiv:2410.06270</em>, 2024.</p> | |
| <p class="bib">[16] Frantar, E. and Alistarh, D. QMoE: Practical sub-1-bit compression of trillion-parameter models. <em>arXiv preprint arXiv:2310.16795</em>, 2023.</p> | |
| <p class="bib">[17] Kim, Y., Lee, J., Park, S., and Shin, J. MoEQuant: Expert-wise quantization for Mixture-of-Experts models. <em>arXiv preprint arXiv:2406.02279</em>, 2024.</p> | |
| <p class="bib">[18] Chen, Z., Qin, K., Zhang, Y., Li, P., Zhao, J., and Liang, X. DynaExq: Dynamic expert-level mixed-precision quantization for Mixture-of-Experts. <em>arXiv preprint arXiv:2405.11009</em>, 2024.</p> | |
| <p class="bib">[19] Black Sheep AI. SWAN: SmartQuant data-free per-tensor mixed-precision quantization for LLMs on Apple Silicon. <em>Technical report, baa.ai</em>, 2026.</p> | |
| <p class="bib">[20] Black Sheep AI. MINT: Memory-Informed N-bit Tuning — compute-optimal data-free mixed-precision quantization for LLMs. <em>Technical report, baa.ai</em>, 2026.</p> | |
| <p class="bib">[21] Apple. MLX: An array framework for Apple Silicon. <em>GitHub repository, github.com/ml-explore/mlx</em>, 2024.</p> | |
| <p class="bib">[22] Gray, R.M. and Neuhoff, D.L. Quantization. <em>IEEE Transactions on Information Theory</em>, 44(6):2325–2383, 1998.</p> | |
| <p class="bib">[23] Kellerer, H., Pferschy, U., and Pisinger, D. <em>Knapsack Problems</em>. Springer, Berlin, 2004.</p> | |
| <p class="bib">[24] Barlow, R.E., Bartholomew, D.J., Bremner, J.M., and Brunk, H.D. <em>Statistical Inference under Order Restrictions: The Theory and Application of Isotonic Regression</em>. Wiley, New York, 1972.</p> | |
| <p class="bib">[25] Hampel, F.R. The influence curve and its role in robust estimation. <em>Journal of the American Statistical Association</em>, 69(346):383–393, 1974.</p> | |
| <div class="footnote"> | |
| <p>Code and models: <a href="https://github.com/baa-ai/MINT">github.com/baa-ai/MINT</a> | | |
| <a href="https://huggingface.co/baa-ai">huggingface.co/baa-ai</a></p> | |
| <p>Correspondence: research@baa.ai</p> | |
| </div> | |
| </body> | |
| </html> | |