All experts calibration?

#4
by mratsim - opened

Can you confirm whether you calibrate all experts or not?

When looking at ModelOpt DeepSeek example, specific snippets of code are needed to ensure that all experts see the calibration samples: https://github.com/NVIDIA/Model-Optimizer/blob/0.40.0/examples/deepseek/ptq.py#L201-L219

    class CalibMoe(deekseep_model.MoE):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self._setup()

        def _setup(self):
            self._original_topk = self.gate.topk
            self._original_topk_groups = self.gate.topk_groups

        def forward(self, x: torch.Tensor) -> torch.Tensor:
            # Forward all tokens to all experts for calibration
            self.gate.topk = self.n_routed_experts
            self.gate.topk_groups = self.gate.n_groups
            super().forward(x)
            # Restore the original topk and topk_groups
            self.gate.topk = self._original_topk
            self.gate.topk_groups = self._original_topk_groups

            return super().forward(x)

Otherwise, I suspect some experts are quantized very naively, significantly impacting quality for use-cases not in the calibration data.

I appreciate the question. I will admit that I didn't consider the possibility of the script missing experts. I'm running the MMLU benchmark to confirm the expert integrity post quantization. If I find unusually low scores I'll recreate the checkpoint with revamped quantization logic. I'll get back to you with results.

I ran the MMLU benchmark using lm_eval. It did pretty well! Better than I expected in all honesty. Here are the results:

MMLU Benchmark Results: Salyut1/GLM-4.7-NVFP4

Summary Table

Groups Version Metric Value Stderr
MMLU (Total) 2 acc ↑ 0.8348 Β± 0.0030
Social Sciences 2 acc ↑ 0.9051 Β± 0.0052
Other 2 acc ↑ 0.8684 Β± 0.0058
STEM 2 acc ↑ 0.8351 Β± 0.0064
Humanities 2 acc ↑ 0.7664 Β± 0.0059

STEM

Tasks n-shot Metric Value Stderr
High School Biology 0 acc ↑ 0.9516 Β± 0.0122
College Biology 0 acc ↑ 0.9514 Β± 0.0180
Astronomy 0 acc ↑ 0.9474 Β± 0.0182
High School Computer Science 0 acc ↑ 0.9300 Β± 0.0256
Conceptual Physics 0 acc ↑ 0.9064 Β± 0.0190
Elementary Mathematics 0 acc ↑ 0.8862 Β± 0.0164
Electrical Engineering 0 acc ↑ 0.8690 Β± 0.0281
High School Statistics 0 acc ↑ 0.8565 Β± 0.0239
College Computer Science 0 acc ↑ 0.8400 Β± 0.0368
Anatomy 0 acc ↑ 0.8296 Β± 0.0325
High School Physics 0 acc ↑ 0.7947 Β± 0.0330
High School Chemistry 0 acc ↑ 0.7882 Β± 0.0287
Machine Learning 0 acc ↑ 0.7679 Β± 0.0401
College Physics 0 acc ↑ 0.7647 Β± 0.0422
Abstract Algebra 0 acc ↑ 0.6800 Β± 0.0469
College Chemistry 0 acc ↑ 0.6800 Β± 0.0469
College Mathematics 0 acc ↑ 0.6800 Β± 0.0469
High School Mathematics 0 acc ↑ 0.6481 Β± 0.0291

Social Sciences

Tasks n-shot Metric Value Stderr
High School Government/Politics 0 acc ↑ 0.9793 Β± 0.0103
High School Microeconomics 0 acc ↑ 0.9706 Β± 0.0110
High School Psychology 0 acc ↑ 0.9523 Β± 0.0091
Human Sexuality 0 acc ↑ 0.9313 Β± 0.0222
Sociology 0 acc ↑ 0.9204 Β± 0.0191
High School Geography 0 acc ↑ 0.9192 Β± 0.0194
High School Macroeconomics 0 acc ↑ 0.9000 Β± 0.0152
US Foreign Policy 0 acc ↑ 0.9000 Β± 0.0302
Professional Psychology 0 acc ↑ 0.8725 Β± 0.0135
Security Studies 0 acc ↑ 0.8653 Β± 0.0219
Public Relations 0 acc ↑ 0.7636 Β± 0.0407
Econometrics 0 acc ↑ 0.7544 Β± 0.0405

Humanities

Tasks n-shot Metric Value Stderr
High School US History 0 acc ↑ 0.9461 Β± 0.0159
High School World History 0 acc ↑ 0.9367 Β± 0.0158
World Religions 0 acc ↑ 0.9064 Β± 0.0223
Prehistory 0 acc ↑ 0.8981 Β± 0.0168
International Law 0 acc ↑ 0.8926 Β± 0.0283
Jurisprudence 0 acc ↑ 0.8889 Β± 0.0304
Logical Fallacies 0 acc ↑ 0.8834 Β± 0.0252
High School European History 0 acc ↑ 0.8788 Β± 0.0255
Moral Disputes 0 acc ↑ 0.8699 Β± 0.0181
Philosophy 0 acc ↑ 0.8617 Β± 0.0196
Formal Logic 0 acc ↑ 0.7460 Β± 0.0389
Professional Law 0 acc ↑ 0.6610 Β± 0.0121
Moral Scenarios 0 acc ↑ 0.6425 Β± 0.0160

Other

Tasks n-shot Metric Value Stderr
Medical Genetics 0 acc ↑ 0.9800 Β± 0.0141
Marketing 0 acc ↑ 0.9530 Β± 0.0139
Miscellaneous 0 acc ↑ 0.9374 Β± 0.0087
Professional Medicine 0 acc ↑ 0.9301 Β± 0.0155
Clinical Knowledge 0 acc ↑ 0.9057 Β± 0.0180
Nutrition 0 acc ↑ 0.9052 Β± 0.0168
Management 0 acc ↑ 0.8932 Β± 0.0306
Business Ethics 0 acc ↑ 0.8600 Β± 0.0349
Computer Security 0 acc ↑ 0.8600 Β± 0.0349
Human Aging 0 acc ↑ 0.8161 Β± 0.0260
College Medicine 0 acc ↑ 0.7977 Β± 0.0306
Professional Accounting 0 acc ↑ 0.7624 Β± 0.0254
Global Facts 0 acc ↑ 0.6500 Β± 0.0479
Virology 0 acc ↑ 0.5723 Β± 0.0385
Salyut1 changed discussion status to closed

For reference, here is a discussion about all experts calibration in another quantization framework: https://github.com/ModelCloud/GPTQModel/pull/2235

And visual benchmarks: https://avtc.github.io/aquarium-side-by-side/

All experts on the right

image

Sign up or log in to comment