All experts calibration?

by mratsim - opened Dec 23, 2025

Dec 23, 2025

Can you confirm whether you calibrate all experts or not?

When looking at ModelOpt DeepSeek example, specific snippets of code are needed to ensure that all experts see the calibration samples: https://github.com/NVIDIA/Model-Optimizer/blob/0.40.0/examples/deepseek/ptq.py#L201-L219

    class CalibMoe(deekseep_model.MoE):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self._setup()

        def _setup(self):
            self._original_topk = self.gate.topk
            self._original_topk_groups = self.gate.topk_groups

        def forward(self, x: torch.Tensor) -> torch.Tensor:
            # Forward all tokens to all experts for calibration
            self.gate.topk = self.n_routed_experts
            self.gate.topk_groups = self.gate.n_groups
            super().forward(x)
            # Restore the original topk and topk_groups
            self.gate.topk = self._original_topk
            self.gate.topk_groups = self._original_topk_groups

            return super().forward(x)

Otherwise, I suspect some experts are quantized very naively, significantly impacting quality for use-cases not in the calibration data.

Salyut1

Owner Dec 23, 2025

I appreciate the question. I will admit that I didn't consider the possibility of the script missing experts. I'm running the MMLU benchmark to confirm the expert integrity post quantization. If I find unusually low scores I'll recreate the checkpoint with revamped quantization logic. I'll get back to you with results.

Salyut1

Owner Dec 23, 2025

I ran the MMLU benchmark using lm_eval. It did pretty well! Better than I expected in all honesty. Here are the results:

MMLU Benchmark Results: Salyut1/GLM-4.7-NVFP4

Summary Table

Groups	Version	Metric	Value	Stderr
MMLU (Total)	2	acc ↑	0.8348	± 0.0030
Social Sciences	2	acc ↑	0.9051	± 0.0052
Other	2	acc ↑	0.8684	± 0.0058
STEM	2	acc ↑	0.8351	± 0.0064
Humanities	2	acc ↑	0.7664	± 0.0059

STEM

Tasks	Metric	Value	Stderr
High School Biology	acc ↑	0.9516	± 0.0122
College Biology	acc ↑	0.9514	± 0.0180
Astronomy	acc ↑	0.9474	± 0.0182
High School Computer Science	acc ↑	0.9300	± 0.0256
Conceptual Physics	acc ↑	0.9064	± 0.0190
Elementary Mathematics	acc ↑	0.8862	± 0.0164
Electrical Engineering	acc ↑	0.8690	± 0.0281
High School Statistics	acc ↑	0.8565	± 0.0239
College Computer Science	acc ↑	0.8400	± 0.0368
Anatomy	acc ↑	0.8296	± 0.0325
High School Physics	acc ↑	0.7947	± 0.0330
High School Chemistry	acc ↑	0.7882	± 0.0287
Machine Learning	acc ↑	0.7679	± 0.0401
College Physics	acc ↑	0.7647	± 0.0422
Abstract Algebra	acc ↑	0.6800	± 0.0469
College Chemistry	acc ↑	0.6800	± 0.0469
College Mathematics	acc ↑	0.6800	± 0.0469
High School Mathematics	acc ↑	0.6481	± 0.0291

Social Sciences

Tasks	Metric	Value	Stderr
High School Government/Politics	acc ↑	0.9793	± 0.0103
High School Microeconomics	acc ↑	0.9706	± 0.0110
High School Psychology	acc ↑	0.9523	± 0.0091
Human Sexuality	acc ↑	0.9313	± 0.0222
Sociology	acc ↑	0.9204	± 0.0191
High School Geography	acc ↑	0.9192	± 0.0194
High School Macroeconomics	acc ↑	0.9000	± 0.0152
US Foreign Policy	acc ↑	0.9000	± 0.0302
Professional Psychology	acc ↑	0.8725	± 0.0135
Security Studies	acc ↑	0.8653	± 0.0219
Public Relations	acc ↑	0.7636	± 0.0407
Econometrics	acc ↑	0.7544	± 0.0405

Humanities

Tasks	Metric	Value	Stderr
High School US History	acc ↑	0.9461	± 0.0159
High School World History	acc ↑	0.9367	± 0.0158
World Religions	acc ↑	0.9064	± 0.0223
Prehistory	acc ↑	0.8981	± 0.0168
International Law	acc ↑	0.8926	± 0.0283
Jurisprudence	acc ↑	0.8889	± 0.0304
Logical Fallacies	acc ↑	0.8834	± 0.0252
High School European History	acc ↑	0.8788	± 0.0255
Moral Disputes	acc ↑	0.8699	± 0.0181
Philosophy	acc ↑	0.8617	± 0.0196
Formal Logic	acc ↑	0.7460	± 0.0389
Professional Law	acc ↑	0.6610	± 0.0121
Moral Scenarios	acc ↑	0.6425	± 0.0160

Other

Tasks	Metric	Value	Stderr
Medical Genetics	acc ↑	0.9800	± 0.0141
Marketing	acc ↑	0.9530	± 0.0139
Miscellaneous	acc ↑	0.9374	± 0.0087
Professional Medicine	acc ↑	0.9301	± 0.0155
Clinical Knowledge	acc ↑	0.9057	± 0.0180
Nutrition	acc ↑	0.9052	± 0.0168
Management	acc ↑	0.8932	± 0.0306
Business Ethics	acc ↑	0.8600	± 0.0349
Computer Security	acc ↑	0.8600	± 0.0349
Human Aging	acc ↑	0.8161	± 0.0260
College Medicine	acc ↑	0.7977	± 0.0306
Professional Accounting	acc ↑	0.7624	± 0.0254
Global Facts	acc ↑	0.6500	± 0.0479
Virology	acc ↑	0.5723	± 0.0385

Salyut1 changed discussion status to closed Dec 25, 2025

mratsim

Dec 25, 2025

For reference, here is a discussion about all experts calibration in another quantization framework: https://github.com/ModelCloud/GPTQModel/pull/2235

And visual benchmarks: https://avtc.github.io/aquarium-side-by-side/

All experts on the right

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment