Spaces:

lm-harmony
/

leaderboard

Running

App Files Files Community

ghzhang commited on Jan 5

Commit

3cf1394

1 Parent(s): 27afe1f

Add about documentation and update app to display it; include new assets

Browse files

Files changed (8) hide show

.gitignore +2 -0
about.md +96 -0
app.py +3 -2
assets/agreement_comparison.png +0 -0
assets/direct_eval.png +0 -0
assets/pca_direct_eval.png +0 -0
assets/pca_train_before_test.png +0 -0
assets/train_before_test.png +0 -0

.gitignore CHANGED Viewed

@@ -11,3 +11,5 @@ eval-results/
 eval-queue-bk/
 eval-results-bk/
 logs/

 eval-queue-bk/
 eval-results-bk/
 logs/
+*DS_Store

about.md ADDED Viewed

	@@ -0,0 +1,96 @@

+# About
+**LM-Harmony** is a multi-task leaderboard designed to evaluate **model potential** rather than deployment-ready performance. Unlike most existing LLM leaderboards, which assess models in a frozen, zero-shot, or lightly prompted setting, LM-Harmony focuses on the achievable performance of models after task-specific adaptation.
+---
+## Train-before-test Evaluation
+To this end, we adopt a **train-before-test** evaluation protocol: for each benchmark, every model is fine-tuned on the corresponding training set using a fixed hyper-parameter configuration before being evaluated on the test set. This setting better reflects the intrinsic capacity of models and leads to more stable and meaningful cross-task comparisons.
+---
+## Ranking Consistency
+A key advantage of LM-Harmony is that model rankings are **substantially more consistent across tasks** compared to direct (zero-shot) evaluation. To quantify this effect, we measure the agreement between model rankings across all pairs of benchmarks using **Kendall’s $\tau$**, a standard metric for rank correlation.
+Train-before-test evaluation significantly increases Kendall’s $\tau$, indicating stronger agreement in relative model ordering across diverse tasks.
+![Agreement comparison.](assets/agreement_comparison.png)
+<!-- ![Direct evaluation.](assets/direct_eval.png) -->
+<!-- ![Train-before-test.](assets/train_before_test.png) -->
+<div style="display: flex; justify-content: space-around;">
+  <div style="text-align: center;">
+    <img src="assets/direct_eval.png" alt="Direct evaluation." style="width: 300px;"/>
+    <p>Direct evaluation.</p>
+  </div>
+  <div style="text-align: center;">
+    <img src="assets/train_before_test.png" alt="Train-before-test." style="width: 300px;"/>
+    <p>Train-before-test.</p>
+  </div>
+</div>
+---
+## Task Selection
+LM-Harmony includes nine benchmarks spanning different domains. All datasets are publicly available on HuggingFace.
+| Benchmark                                                                   | Domain / What it tests                                                                    | Metric                          |
+| --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------- |
+| [MedMCQA](https://huggingface.co/datasets/openlifescienceai/medmcqa)        | Professional-level medical knowledge and clinical reasoning (multiple-choice, exam-style) | `acc_norm`                      |
+| [NQ-Open](https://huggingface.co/datasets/google-research-datasets/nq_open) | Open-domain factual QA with short, generated answers (no retrieval)                       | `exact_match,remove_whitespace` |
+| [Winogrande](https://huggingface.co/datasets/allenai/winogrande)            | Commonsense pronoun/coreference resolution with reduced annotation artifacts              | `acc`                           |
+| [HellaSwag](https://huggingface.co/datasets/Rowan/hellaswag)                | Commonsense inference: selecting the most plausible continuation of a situation           | `acc_norm`                      |
+| [Social-IQA](https://huggingface.co/datasets/allenai/social_i_qa)           | Social commonsense: intents, reactions, and interpersonal dynamics                        | `acc`                           |
+| [PIQA](https://huggingface.co/datasets/ybisk/piqa)                          | Physical commonsense: everyday object interactions and affordances                        | `acc_norm`                      |
+| [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa)         | Structured commonsense knowledge grounded in concepts and relations (multiple-choice)     | `acc`                           |
+| [GSM8K](https://huggingface.co/datasets/openai/gsm8k)                       | Multi-step grade-school math word problems                                                | `exact_match,flexible-extract`  |
+| [ARC-Challenge](https://huggingface.co/datasets/allenai/ai2_arc)            | Challenging science questions requiring abstract and multi-hop reasoning                  | `acc_norm`                      |
+** Note: `acc_norm` refers to when the score for each choice is normalized by byte-length. Check more details in [lm-eval-harness](https://blog.eleuther.ai/multiple-choice-normalization/).
+---
+## Aggregation
+Aggregating performance across multiple tasks is inherently challenging. [Prior work](https://arxiv.org/pdf/2405.01719) has shown that any aggregated ranking becomes unreliable when task-wise rankings disagree. Although train-before-test evaluation substantially improves ranking consistency, disagreements across tasks still remain. As a result, different aggregation methods can yield different overall rankings.
+LM-Harmony currently provides two aggregation approaches:
+- **PC1 (First Principal Component).** We perform PCA on the full model–task score matrix and use the first principal component as the aggregated score. In our paper, we show that PC1 is highly correlated with pretraining compute, suggesting that it captures a latent notion of overall model capacity. Below, we present explained variances by first five principal components.
+Under train-before-test, PC1 explains 89\% variances.
+<div style="display: flex; justify-content: space-around;">
+  <div style="text-align: center;">
+    <img src="assets/pca_direct_eval.png" alt="Direct evaluation." style="width: 300px;"/>
+    <p>Direct evaluation.</p>
+  </div>
+  <div style="text-align: center;">
+    <img src="assets/pca_train_before_test.png" alt="Train-before-test." style="width: 300px;"/>
+    <p>Train-before-test.</p>
+  </div>
+</div>
+- **Average Score.** We also report the simple average of normalized task scores. While this is the most common aggregation method, its validity is questionable because scores from different tasks are not necessarily commensurable.
+---
+## Hyperparameters
+We observe that while hyper-parameter choices—especially learning rate—can have a large impact on absolute performance, they have minimal impact on model rankings as long as the same configuration is applied consistently across models.
+In other words, rankings are stable across reasonable hyper-parameter choices, even though raw scores may shift. To ensure reproducibility and ease of evaluation for newly added models, LM-Harmony therefore adopts a fixed hyper-parameter setup.
+We use a fixed learning rate of $5e-5$. For the datasets MedMCQA, NQ-Open, Winogrande, HellaSwag, and Social-IQA, we train the model for one epoch and save a checkpoint every 20% of the steps. For PIQA, CommonsenseQA, GSM8K, and ARC-Challenge, we train for five epochs since their training sets are smaller, saving a checkpoint at the end of each epoch. We limit the training set to 50,000 samples, the validation set to 2,000, and the test set to 10,000. The best checkpoint is selected based on the validation set. Please refer to our [code](https://github.com/socialfoundations/lm-harmony) for reproduction.
+---
+## Scope and Limitations
+LM-Harmony is designed to measure **model potential under adaptation**, not zero-shot or deployment-time performance. Benchmarks without sufficient training data are not suitable for this protocol, and domain-specific tasks (e.g., coding or multimodal reasoning) may require specialized extensions.
+We view LM-Harmony as an evolving benchmark suite and welcome future expansions as new datasets and evaluation methodologies become available.

app.py CHANGED Viewed

@@ -138,7 +138,8 @@ with demo:
         with gr.TabItem(
             "📝 About", elem_id="llm-benchmark-tab-table-about", interactive=True
         ):
-            gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
         with gr.TabItem(
             "🚀 Submit here! ",
@@ -186,7 +187,7 @@ with demo:
             #                 )
             with gr.Row():
                 gr.Markdown(
-                    "# ✉️✨ This is still under construction!",
                     elem_classes="markdown-text",
                 )

         with gr.TabItem(
             "📝 About", elem_id="llm-benchmark-tab-table-about", interactive=True
         ):
+            with open("about.md", "r") as f:
+                gr.Markdown(f.read(), elem_classes="markdown-text")
         with gr.TabItem(
             "🚀 Submit here! ",
             #                 )
             with gr.Row():
                 gr.Markdown(
+                    "# ✉️✨ This is still under construction! For now, if you would like to add your model (in Huggingface) to the leaderboard, please simply reach out to us through email.",
                     elem_classes="markdown-text",
                 )

assets/agreement_comparison.png ADDED Viewed

assets/direct_eval.png ADDED Viewed

assets/pca_direct_eval.png ADDED Viewed

assets/pca_train_before_test.png ADDED Viewed

assets/train_before_test.png ADDED Viewed