leaderboard / about.md
ghzhang233's picture
update about
5d574d3

About

LM-Harmony is a multi-task leaderboard designed to evaluate model potential rather than deployment-ready performance. Unlike most existing LLM leaderboards, which assess models in a frozen, zero-shot, or lightly prompted setting, LM-Harmony focuses on the achievable performance of models after task-specific adaptation.

Train-before-test Evaluation

To this end, we adopt a train-before-test evaluation protocol: for each benchmark, every model is fine-tuned on the corresponding training set using a fixed hyper-parameter configuration before being evaluated on the test set. This setting better reflects the intrinsic capacity of models and leads to more stable and meaningful cross-task comparisons.

Ranking Consistency

A key advantage of LM-Harmony is that model rankings are substantially more consistent across tasks compared to direct (zero-shot) evaluation. To quantify this effect, we measure the agreement between model rankings across all pairs of benchmarks using Kendall’s τ, a standard metric for rank correlation.

Train-before-test evaluation significantly increases Kendall’s τ, indicating stronger agreement in relative model ordering across diverse tasks.

Agreement comparison.
Direct evaluation.

Direct evaluation.

Train-before-test.

Train-before-test.

Task Selection

LM-Harmony includes nine benchmarks spanning different domains. All datasets are publicly available on HuggingFace.

Benchmark Domain / What it tests Metric
MedMCQA Professional-level medical knowledge and clinical reasoning (multiple-choice, exam-style) acc_norm
NQ-Open Open-domain factual QA with short, generated answers (no retrieval) exact_match,remove_whitespace
Winogrande Commonsense pronoun/coreference resolution with reduced annotation artifacts acc
HellaSwag Commonsense inference: selecting the most plausible continuation of a situation acc_norm
Social-IQA Social commonsense: intents, reactions, and interpersonal dynamics acc
PIQA Physical commonsense: everyday object interactions and affordances acc_norm
CommonsenseQA Structured commonsense knowledge grounded in concepts and relations (multiple-choice) acc
GSM8K Multi-step grade-school math word problems exact_match,flexible-extract
ARC-Challenge Challenging science questions requiring abstract and multi-hop reasoning acc_norm

** Note: acc_norm refers to when the score for each choice is normalized by byte-length. Check more details in lm-eval-harness.

Aggregation

Aggregating performance across multiple tasks is inherently challenging. Prior work has shown that any aggregated ranking becomes unreliable when task-wise rankings disagree. Although train-before-test evaluation substantially improves ranking consistency, disagreements across tasks still remain. As a result, different aggregation methods can yield different overall rankings.

LM-Harmony currently provides two aggregation approaches:

  • PC1 (First Principal Component). We perform PCA on the full model–task score matrix and use the first principal component as the aggregated score. In our paper, we show that PC1 is highly correlated with pretraining compute, suggesting that it captures a latent notion of overall model capacity. Below, we present explained variances by first five principal components. Under train-before-test, PC1 explains 89% variances.
Direct evaluation.

Direct evaluation.

Train-before-test.

Train-before-test.

  • Average Score. We also report the simple average of normalized task scores. While this is the most common aggregation method, its validity is questionable because scores from different tasks are not necessarily commensurable.

Hyperparameters

We observe that while hyper-parameter choices—especially learning rate—can have a large impact on absolute performance, they have minimal impact on model rankings as long as the same configuration is applied consistently across models.

In other words, rankings are stable across reasonable hyper-parameter choices, even though raw scores may shift. To ensure reproducibility and ease of evaluation for newly added models, LM-Harmony therefore adopts a fixed hyper-parameter setup.

Different from the parameter-search–heavy setup in the paper, we adopt a more lightweighted approach to the leaderboard. We use a fixed learning rate of 5e-5. For the datasets MedMCQA, NQ-Open, Winogrande, HellaSwag, and Social-IQA, we train the model for one epoch and save a checkpoint every 20% of the steps. For PIQA, CommonsenseQA, GSM8K, and ARC-Challenge, we train for five epochs since their training sets are smaller, saving a checkpoint at the end of each epoch. We limit the training set to 50,000 samples, the validation set to 2,000, and the test set to 10,000. The best checkpoint is selected based on the validation set. Please refer to our code for reproduction.

Scope and Limitations

LM-Harmony is designed to measure model potential under adaptation, not zero-shot or deployment-time performance. Benchmarks without sufficient training data are not suitable for this protocol, and domain-specific tasks (e.g., coding or multimodal reasoning) may require specialized extensions.

We view LM-Harmony as an evolving benchmark suite and welcome future expansions as new datasets and evaluation methodologies become available.

Add Your Model

To add your model to the LM-Harmony leaderboard, please reach out to us by email. We will evaluate your model using the same fixed hyper-parameter configuration described above.

Your model should be publicly available on the Hugging Face Hub and loadable via AutoModel / AutoTokenizer from the transformers library.