Spaces:

lm-harmony
/

leaderboard

Sleeping

App Files Files Community

leaderboard / about.md

ghzhang233

update about

5d574d3 about 2 months ago

preview code

raw

history blame contribute delete

8.64 kB

About

LM-Harmony is a multi-task leaderboard designed to evaluate model potential rather than deployment-ready performance. Unlike most existing LLM leaderboards, which assess models in a frozen, zero-shot, or lightly prompted setting, LM-Harmony focuses on the achievable performance of models after task-specific adaptation.

Train-before-test Evaluation

To this end, we adopt a train-before-test evaluation protocol: for each benchmark, every model is fine-tuned on the corresponding training set using a fixed hyper-parameter configuration before being evaluated on the test set. This setting better reflects the intrinsic capacity of models and leads to more stable and meaningful cross-task comparisons.

Ranking Consistency

A key advantage of LM-Harmony is that model rankings are substantially more consistent across tasks compared to direct (zero-shot) evaluation. To quantify this effect, we measure the agreement between model rankings across all pairs of benchmarks using Kendall’s τ, a standard metric for rank correlation.

Train-before-test evaluation significantly increases Kendall’s τ, indicating stronger agreement in relative model ordering across diverse tasks.

Direct evaluation.

Train-before-test.

Task Selection

LM-Harmony includes nine benchmarks spanning different domains. All datasets are publicly available on HuggingFace.

Benchmark	Domain / What it tests	Metric
MedMCQA	Professional-level medical knowledge and clinical reasoning (multiple-choice, exam-style)	acc_norm
NQ-Open	Open-domain factual QA with short, generated answers (no retrieval)	exact_match,remove_whitespace
Winogrande	Commonsense pronoun/coreference resolution with reduced annotation artifacts	acc
HellaSwag	Commonsense inference: selecting the most plausible continuation of a situation	acc_norm
Social-IQA	Social commonsense: intents, reactions, and interpersonal dynamics	acc
PIQA	Physical commonsense: everyday object interactions and affordances	acc_norm
CommonsenseQA	Structured commonsense knowledge grounded in concepts and relations (multiple-choice)	acc
GSM8K	Multi-step grade-school math word problems	exact_match,flexible-extract
ARC-Challenge	Challenging science questions requiring abstract and multi-hop reasoning	acc_norm

** Note: acc_norm refers to when the score for each choice is normalized by byte-length. Check more details in lm-eval-harness.

Aggregation

Aggregating performance across multiple tasks is inherently challenging. Prior work has shown that any aggregated ranking becomes unreliable when task-wise rankings disagree. Although train-before-test evaluation substantially improves ranking consistency, disagreements across tasks still remain. As a result, different aggregation methods can yield different overall rankings.

LM-Harmony currently provides two aggregation approaches:

PC1 (First Principal Component). We perform PCA on the full model–task score matrix and use the first principal component as the aggregated score. In our paper, we show that PC1 is highly correlated with pretraining compute, suggesting that it captures a latent notion of overall model capacity. Below, we present explained variances by first five principal components. Under train-before-test, PC1 explains 89% variances.

Direct evaluation.

Train-before-test.

Average Score. We also report the simple average of normalized task scores. While this is the most common aggregation method, its validity is questionable because scores from different tasks are not necessarily commensurable.

Hyperparameters

We observe that while hyper-parameter choices—especially learning rate—can have a large impact on absolute performance, they have minimal impact on model rankings as long as the same configuration is applied consistently across models.

In other words, rankings are stable across reasonable hyper-parameter choices, even though raw scores may shift. To ensure reproducibility and ease of evaluation for newly added models, LM-Harmony therefore adopts a fixed hyper-parameter setup.

Different from the parameter-search–heavy setup in the paper, we adopt a more lightweighted approach to the leaderboard. We use a fixed learning rate of 5e-5. For the datasets MedMCQA, NQ-Open, Winogrande, HellaSwag, and Social-IQA, we train the model for one epoch and save a checkpoint every 20% of the steps. For PIQA, CommonsenseQA, GSM8K, and ARC-Challenge, we train for five epochs since their training sets are smaller, saving a checkpoint at the end of each epoch. We limit the training set to 50,000 samples, the validation set to 2,000, and the test set to 10,000. The best checkpoint is selected based on the validation set. Please refer to our code for reproduction.

Scope and Limitations

LM-Harmony is designed to measure model potential under adaptation, not zero-shot or deployment-time performance. Benchmarks without sufficient training data are not suitable for this protocol, and domain-specific tasks (e.g., coding or multimodal reasoning) may require specialized extensions.

We view LM-Harmony as an evolving benchmark suite and welcome future expansions as new datasets and evaluation methodologies become available.

Add Your Model

To add your model to the LM-Harmony leaderboard, please reach out to us by email. We will evaluate your model using the same fixed hyper-parameter configuration described above.

Your model should be publicly available on the Hugging Face Hub and loadable via AutoModel / AutoTokenizer from the transformers library.