ghzhang commited on
Commit
3cf1394
·
1 Parent(s): 27afe1f

Add about documentation and update app to display it; include new assets

Browse files
.gitignore CHANGED
@@ -11,3 +11,5 @@ eval-results/
11
  eval-queue-bk/
12
  eval-results-bk/
13
  logs/
 
 
 
11
  eval-queue-bk/
12
  eval-results-bk/
13
  logs/
14
+
15
+ *DS_Store
about.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # About
2
+
3
+ **LM-Harmony** is a multi-task leaderboard designed to evaluate **model potential** rather than deployment-ready performance. Unlike most existing LLM leaderboards, which assess models in a frozen, zero-shot, or lightly prompted setting, LM-Harmony focuses on the achievable performance of models after task-specific adaptation.
4
+
5
+ ---
6
+
7
+ ## Train-before-test Evaluation
8
+
9
+ To this end, we adopt a **train-before-test** evaluation protocol: for each benchmark, every model is fine-tuned on the corresponding training set using a fixed hyper-parameter configuration before being evaluated on the test set. This setting better reflects the intrinsic capacity of models and leads to more stable and meaningful cross-task comparisons.
10
+
11
+ ---
12
+
13
+ ## Ranking Consistency
14
+
15
+ A key advantage of LM-Harmony is that model rankings are **substantially more consistent across tasks** compared to direct (zero-shot) evaluation. To quantify this effect, we measure the agreement between model rankings across all pairs of benchmarks using **Kendall’s $\tau$**, a standard metric for rank correlation.
16
+
17
+ Train-before-test evaluation significantly increases Kendall’s $\tau$, indicating stronger agreement in relative model ordering across diverse tasks.
18
+
19
+ ![Agreement comparison.](assets/agreement_comparison.png)
20
+
21
+ <!-- ![Direct evaluation.](assets/direct_eval.png) -->
22
+
23
+ <!-- ![Train-before-test.](assets/train_before_test.png) -->
24
+
25
+ <div style="display: flex; justify-content: space-around;">
26
+ <div style="text-align: center;">
27
+ <img src="assets/direct_eval.png" alt="Direct evaluation." style="width: 300px;"/>
28
+ <p>Direct evaluation.</p>
29
+ </div>
30
+ <div style="text-align: center;">
31
+ <img src="assets/train_before_test.png" alt="Train-before-test." style="width: 300px;"/>
32
+ <p>Train-before-test.</p>
33
+ </div>
34
+ </div>
35
+
36
+ ---
37
+
38
+ ## Task Selection
39
+
40
+ LM-Harmony includes nine benchmarks spanning different domains. All datasets are publicly available on HuggingFace.
41
+
42
+ | Benchmark | Domain / What it tests | Metric |
43
+ | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------- |
44
+ | [MedMCQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) | Professional-level medical knowledge and clinical reasoning (multiple-choice, exam-style) | `acc_norm` |
45
+ | [NQ-Open](https://huggingface.co/datasets/google-research-datasets/nq_open) | Open-domain factual QA with short, generated answers (no retrieval) | `exact_match,remove_whitespace` |
46
+ | [Winogrande](https://huggingface.co/datasets/allenai/winogrande) | Commonsense pronoun/coreference resolution with reduced annotation artifacts | `acc` |
47
+ | [HellaSwag](https://huggingface.co/datasets/Rowan/hellaswag) | Commonsense inference: selecting the most plausible continuation of a situation | `acc_norm` |
48
+ | [Social-IQA](https://huggingface.co/datasets/allenai/social_i_qa) | Social commonsense: intents, reactions, and interpersonal dynamics | `acc` |
49
+ | [PIQA](https://huggingface.co/datasets/ybisk/piqa) | Physical commonsense: everyday object interactions and affordances | `acc_norm` |
50
+ | [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) | Structured commonsense knowledge grounded in concepts and relations (multiple-choice) | `acc` |
51
+ | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) | Multi-step grade-school math word problems | `exact_match,flexible-extract` |
52
+ | [ARC-Challenge](https://huggingface.co/datasets/allenai/ai2_arc) | Challenging science questions requiring abstract and multi-hop reasoning | `acc_norm` |
53
+
54
+ ** Note: `acc_norm` refers to when the score for each choice is normalized by byte-length. Check more details in [lm-eval-harness](https://blog.eleuther.ai/multiple-choice-normalization/).
55
+
56
+ ---
57
+
58
+ ## Aggregation
59
+
60
+ Aggregating performance across multiple tasks is inherently challenging. [Prior work](https://arxiv.org/pdf/2405.01719) has shown that any aggregated ranking becomes unreliable when task-wise rankings disagree. Although train-before-test evaluation substantially improves ranking consistency, disagreements across tasks still remain. As a result, different aggregation methods can yield different overall rankings.
61
+
62
+ LM-Harmony currently provides two aggregation approaches:
63
+
64
+ - **PC1 (First Principal Component).** We perform PCA on the full model–task score matrix and use the first principal component as the aggregated score. In our paper, we show that PC1 is highly correlated with pretraining compute, suggesting that it captures a latent notion of overall model capacity. Below, we present explained variances by first five principal components.
65
+ Under train-before-test, PC1 explains 89\% variances.
66
+
67
+ <div style="display: flex; justify-content: space-around;">
68
+ <div style="text-align: center;">
69
+ <img src="assets/pca_direct_eval.png" alt="Direct evaluation." style="width: 300px;"/>
70
+ <p>Direct evaluation.</p>
71
+ </div>
72
+ <div style="text-align: center;">
73
+ <img src="assets/pca_train_before_test.png" alt="Train-before-test." style="width: 300px;"/>
74
+ <p>Train-before-test.</p>
75
+ </div>
76
+ </div>
77
+
78
+ - **Average Score.** We also report the simple average of normalized task scores. While this is the most common aggregation method, its validity is questionable because scores from different tasks are not necessarily commensurable.
79
+
80
+ ---
81
+
82
+ ## Hyperparameters
83
+
84
+ We observe that while hyper-parameter choices—especially learning rate—can have a large impact on absolute performance, they have minimal impact on model rankings as long as the same configuration is applied consistently across models.
85
+
86
+ In other words, rankings are stable across reasonable hyper-parameter choices, even though raw scores may shift. To ensure reproducibility and ease of evaluation for newly added models, LM-Harmony therefore adopts a fixed hyper-parameter setup.
87
+
88
+ We use a fixed learning rate of $5e-5$. For the datasets MedMCQA, NQ-Open, Winogrande, HellaSwag, and Social-IQA, we train the model for one epoch and save a checkpoint every 20% of the steps. For PIQA, CommonsenseQA, GSM8K, and ARC-Challenge, we train for five epochs since their training sets are smaller, saving a checkpoint at the end of each epoch. We limit the training set to 50,000 samples, the validation set to 2,000, and the test set to 10,000. The best checkpoint is selected based on the validation set. Please refer to our [code](https://github.com/socialfoundations/lm-harmony) for reproduction.
89
+
90
+ ---
91
+
92
+ ## Scope and Limitations
93
+
94
+ LM-Harmony is designed to measure **model potential under adaptation**, not zero-shot or deployment-time performance. Benchmarks without sufficient training data are not suitable for this protocol, and domain-specific tasks (e.g., coding or multimodal reasoning) may require specialized extensions.
95
+
96
+ We view LM-Harmony as an evolving benchmark suite and welcome future expansions as new datasets and evaluation methodologies become available.
app.py CHANGED
@@ -138,7 +138,8 @@ with demo:
138
  with gr.TabItem(
139
  "📝 About", elem_id="llm-benchmark-tab-table-about", interactive=True
140
  ):
141
- gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
 
142
 
143
  with gr.TabItem(
144
  "🚀 Submit here! ",
@@ -186,7 +187,7 @@ with demo:
186
  # )
187
  with gr.Row():
188
  gr.Markdown(
189
- "# ✉️✨ This is still under construction!",
190
  elem_classes="markdown-text",
191
  )
192
 
 
138
  with gr.TabItem(
139
  "📝 About", elem_id="llm-benchmark-tab-table-about", interactive=True
140
  ):
141
+ with open("about.md", "r") as f:
142
+ gr.Markdown(f.read(), elem_classes="markdown-text")
143
 
144
  with gr.TabItem(
145
  "🚀 Submit here! ",
 
187
  # )
188
  with gr.Row():
189
  gr.Markdown(
190
+ "# ✉️✨ This is still under construction! For now, if you would like to add your model (in Huggingface) to the leaderboard, please simply reach out to us through email.",
191
  elem_classes="markdown-text",
192
  )
193
 
assets/agreement_comparison.png ADDED
assets/direct_eval.png ADDED
assets/pca_direct_eval.png ADDED
assets/pca_train_before_test.png ADDED
assets/train_before_test.png ADDED