Spaces:
Sleeping
About
LM-Harmony is a multi-task leaderboard designed to evaluate model potential rather than deployment-ready performance. Unlike most existing LLM leaderboards, which assess models in a frozen, zero-shot, or lightly prompted setting, LM-Harmony focuses on the achievable performance of models after task-specific adaptation.
Train-before-test Evaluation
To this end, we adopt a train-before-test evaluation protocol: for each benchmark, every model is fine-tuned on the corresponding training set using a fixed hyper-parameter configuration before being evaluated on the test set. This setting better reflects the intrinsic capacity of models and leads to more stable and meaningful cross-task comparisons.
Ranking Consistency
A key advantage of LM-Harmony is that model rankings are substantially more consistent across tasks compared to direct (zero-shot) evaluation. To quantify this effect, we measure the agreement between model rankings across all pairs of benchmarks using Kendall’s τ, a standard metric for rank correlation.
Train-before-test evaluation significantly increases Kendall’s τ, indicating stronger agreement in relative model ordering across diverse tasks.
Direct evaluation.
Train-before-test.
Task Selection
LM-Harmony includes nine benchmarks spanning different domains. All datasets are publicly available on HuggingFace.
| Benchmark | Domain / What it tests | Metric |
|---|---|---|
| MedMCQA | Professional-level medical knowledge and clinical reasoning (multiple-choice, exam-style) | acc_norm |
| NQ-Open | Open-domain factual QA with short, generated answers (no retrieval) | exact_match,remove_whitespace |
| Winogrande | Commonsense pronoun/coreference resolution with reduced annotation artifacts | acc |
| HellaSwag | Commonsense inference: selecting the most plausible continuation of a situation | acc_norm |
| Social-IQA | Social commonsense: intents, reactions, and interpersonal dynamics | acc |
| PIQA | Physical commonsense: everyday object interactions and affordances | acc_norm |
| CommonsenseQA | Structured commonsense knowledge grounded in concepts and relations (multiple-choice) | acc |
| GSM8K | Multi-step grade-school math word problems | exact_match,flexible-extract |
| ARC-Challenge | Challenging science questions requiring abstract and multi-hop reasoning | acc_norm |
** Note: acc_norm refers to when the score for each choice is normalized by byte-length. Check more details in lm-eval-harness.
Aggregation
Aggregating performance across multiple tasks is inherently challenging. Prior work has shown that any aggregated ranking becomes unreliable when task-wise rankings disagree. Although train-before-test evaluation substantially improves ranking consistency, disagreements across tasks still remain. As a result, different aggregation methods can yield different overall rankings.
LM-Harmony currently provides two aggregation approaches:
- PC1 (First Principal Component). We perform PCA on the full model–task score matrix and use the first principal component as the aggregated score. In our paper, we show that PC1 is highly correlated with pretraining compute, suggesting that it captures a latent notion of overall model capacity. Below, we present explained variances by first five principal components. Under train-before-test, PC1 explains 89% variances.
Direct evaluation.
Train-before-test.
- Average Score. We also report the simple average of normalized task scores. While this is the most common aggregation method, its validity is questionable because scores from different tasks are not necessarily commensurable.
Hyperparameters
We observe that while hyper-parameter choices—especially learning rate—can have a large impact on absolute performance, they have minimal impact on model rankings as long as the same configuration is applied consistently across models.
In other words, rankings are stable across reasonable hyper-parameter choices, even though raw scores may shift. To ensure reproducibility and ease of evaluation for newly added models, LM-Harmony therefore adopts a fixed hyper-parameter setup.
Different from the parameter-search–heavy setup in the paper, we adopt a more lightweighted approach to the leaderboard. We use a fixed learning rate of 5e-5. For the datasets MedMCQA, NQ-Open, Winogrande, HellaSwag, and Social-IQA, we train the model for one epoch and save a checkpoint every 20% of the steps. For PIQA, CommonsenseQA, GSM8K, and ARC-Challenge, we train for five epochs since their training sets are smaller, saving a checkpoint at the end of each epoch. We limit the training set to 50,000 samples, the validation set to 2,000, and the test set to 10,000. The best checkpoint is selected based on the validation set. Please refer to our code for reproduction.
Scope and Limitations
LM-Harmony is designed to measure model potential under adaptation, not zero-shot or deployment-time performance. Benchmarks without sufficient training data are not suitable for this protocol, and domain-specific tasks (e.g., coding or multimodal reasoning) may require specialized extensions.
We view LM-Harmony as an evolving benchmark suite and welcome future expansions as new datasets and evaluation methodologies become available.
Add Your Model
To add your model to the LM-Harmony leaderboard, please reach out to us by email. We will evaluate your model using the same fixed hyper-parameter configuration described above.
Your model should be publicly available on the Hugging Face Hub and loadable via AutoModel / AutoTokenizer from the transformers library.