All experts calibration?
Can you confirm whether you calibrate all experts or not?
When looking at ModelOpt DeepSeek example, specific snippets of code are needed to ensure that all experts see the calibration samples: https://github.com/NVIDIA/Model-Optimizer/blob/0.40.0/examples/deepseek/ptq.py#L201-L219
class CalibMoe(deekseep_model.MoE):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._setup()
def _setup(self):
self._original_topk = self.gate.topk
self._original_topk_groups = self.gate.topk_groups
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Forward all tokens to all experts for calibration
self.gate.topk = self.n_routed_experts
self.gate.topk_groups = self.gate.n_groups
super().forward(x)
# Restore the original topk and topk_groups
self.gate.topk = self._original_topk
self.gate.topk_groups = self._original_topk_groups
return super().forward(x)
Otherwise, I suspect some experts are quantized very naively, significantly impacting quality for use-cases not in the calibration data.
I appreciate the question. I will admit that I didn't consider the possibility of the script missing experts. I'm running the MMLU benchmark to confirm the expert integrity post quantization. If I find unusually low scores I'll recreate the checkpoint with revamped quantization logic. I'll get back to you with results.
I ran the MMLU benchmark using lm_eval. It did pretty well! Better than I expected in all honesty. Here are the results:
MMLU Benchmark Results: Salyut1/GLM-4.7-NVFP4
Summary Table
| Groups | Version | Metric | Value | Stderr |
|---|---|---|---|---|
| MMLU (Total) | 2 | acc β | 0.8348 | Β± 0.0030 |
| Social Sciences | 2 | acc β | 0.9051 | Β± 0.0052 |
| Other | 2 | acc β | 0.8684 | Β± 0.0058 |
| STEM | 2 | acc β | 0.8351 | Β± 0.0064 |
| Humanities | 2 | acc β | 0.7664 | Β± 0.0059 |
STEM
| Tasks | n-shot | Metric | Value | Stderr |
|---|---|---|---|---|
| High School Biology | 0 | acc β | 0.9516 | Β± 0.0122 |
| College Biology | 0 | acc β | 0.9514 | Β± 0.0180 |
| Astronomy | 0 | acc β | 0.9474 | Β± 0.0182 |
| High School Computer Science | 0 | acc β | 0.9300 | Β± 0.0256 |
| Conceptual Physics | 0 | acc β | 0.9064 | Β± 0.0190 |
| Elementary Mathematics | 0 | acc β | 0.8862 | Β± 0.0164 |
| Electrical Engineering | 0 | acc β | 0.8690 | Β± 0.0281 |
| High School Statistics | 0 | acc β | 0.8565 | Β± 0.0239 |
| College Computer Science | 0 | acc β | 0.8400 | Β± 0.0368 |
| Anatomy | 0 | acc β | 0.8296 | Β± 0.0325 |
| High School Physics | 0 | acc β | 0.7947 | Β± 0.0330 |
| High School Chemistry | 0 | acc β | 0.7882 | Β± 0.0287 |
| Machine Learning | 0 | acc β | 0.7679 | Β± 0.0401 |
| College Physics | 0 | acc β | 0.7647 | Β± 0.0422 |
| Abstract Algebra | 0 | acc β | 0.6800 | Β± 0.0469 |
| College Chemistry | 0 | acc β | 0.6800 | Β± 0.0469 |
| College Mathematics | 0 | acc β | 0.6800 | Β± 0.0469 |
| High School Mathematics | 0 | acc β | 0.6481 | Β± 0.0291 |
Social Sciences
| Tasks | n-shot | Metric | Value | Stderr |
|---|---|---|---|---|
| High School Government/Politics | 0 | acc β | 0.9793 | Β± 0.0103 |
| High School Microeconomics | 0 | acc β | 0.9706 | Β± 0.0110 |
| High School Psychology | 0 | acc β | 0.9523 | Β± 0.0091 |
| Human Sexuality | 0 | acc β | 0.9313 | Β± 0.0222 |
| Sociology | 0 | acc β | 0.9204 | Β± 0.0191 |
| High School Geography | 0 | acc β | 0.9192 | Β± 0.0194 |
| High School Macroeconomics | 0 | acc β | 0.9000 | Β± 0.0152 |
| US Foreign Policy | 0 | acc β | 0.9000 | Β± 0.0302 |
| Professional Psychology | 0 | acc β | 0.8725 | Β± 0.0135 |
| Security Studies | 0 | acc β | 0.8653 | Β± 0.0219 |
| Public Relations | 0 | acc β | 0.7636 | Β± 0.0407 |
| Econometrics | 0 | acc β | 0.7544 | Β± 0.0405 |
Humanities
| Tasks | n-shot | Metric | Value | Stderr |
|---|---|---|---|---|
| High School US History | 0 | acc β | 0.9461 | Β± 0.0159 |
| High School World History | 0 | acc β | 0.9367 | Β± 0.0158 |
| World Religions | 0 | acc β | 0.9064 | Β± 0.0223 |
| Prehistory | 0 | acc β | 0.8981 | Β± 0.0168 |
| International Law | 0 | acc β | 0.8926 | Β± 0.0283 |
| Jurisprudence | 0 | acc β | 0.8889 | Β± 0.0304 |
| Logical Fallacies | 0 | acc β | 0.8834 | Β± 0.0252 |
| High School European History | 0 | acc β | 0.8788 | Β± 0.0255 |
| Moral Disputes | 0 | acc β | 0.8699 | Β± 0.0181 |
| Philosophy | 0 | acc β | 0.8617 | Β± 0.0196 |
| Formal Logic | 0 | acc β | 0.7460 | Β± 0.0389 |
| Professional Law | 0 | acc β | 0.6610 | Β± 0.0121 |
| Moral Scenarios | 0 | acc β | 0.6425 | Β± 0.0160 |
Other
| Tasks | n-shot | Metric | Value | Stderr |
|---|---|---|---|---|
| Medical Genetics | 0 | acc β | 0.9800 | Β± 0.0141 |
| Marketing | 0 | acc β | 0.9530 | Β± 0.0139 |
| Miscellaneous | 0 | acc β | 0.9374 | Β± 0.0087 |
| Professional Medicine | 0 | acc β | 0.9301 | Β± 0.0155 |
| Clinical Knowledge | 0 | acc β | 0.9057 | Β± 0.0180 |
| Nutrition | 0 | acc β | 0.9052 | Β± 0.0168 |
| Management | 0 | acc β | 0.8932 | Β± 0.0306 |
| Business Ethics | 0 | acc β | 0.8600 | Β± 0.0349 |
| Computer Security | 0 | acc β | 0.8600 | Β± 0.0349 |
| Human Aging | 0 | acc β | 0.8161 | Β± 0.0260 |
| College Medicine | 0 | acc β | 0.7977 | Β± 0.0306 |
| Professional Accounting | 0 | acc β | 0.7624 | Β± 0.0254 |
| Global Facts | 0 | acc β | 0.6500 | Β± 0.0479 |
| Virology | 0 | acc β | 0.5723 | Β± 0.0385 |
For reference, here is a discussion about all experts calibration in another quantization framework: https://github.com/ModelCloud/GPTQModel/pull/2235
And visual benchmarks: https://avtc.github.io/aquarium-side-by-side/
All experts on the right
