Spaces:

evaleval
/

general-eval-card

Running

App Files Files Community

general-eval-card / content /tutorials /model-developers.md

Anka

Re-formatted tutorials, default reported metrics to Overlaps

dd102db 24 days ago

preview code

Raw

History Blame Contribute Delete

6.82 kB

	# Evaluation Cards for Model Developers

	How your model appears in the corpus, how to read what's documented vs. missing, and how to make your evaluation reporting stronger.

	> Prerequisite: the [Quickstart](quickstart.md) (~6 min) covers the four signals, the five-level hierarchy, and the snapshot model.

	---

	## Why you should care how your model is carded

	Whatever you publish about your model's evaluations, Evaluation Cards re-presents it as a structured record alongside everyone else's, and also shows what you did not disclose. The project treats a published score as a claim and an undisclosed detail as a claim deliberately not made (not an error). That means your card is, in effect, a public read on how legible and verifiable your reporting is.

	A high benchmark number with weak signals reads as a weak claim. Most of those gaps are cheap to close, and this guide is a checklist for doing so.

	> 🖼️ Screenshot — `01-home-overview.png`
	> What to capture: The homepage with the corpus snapshot and the four signals.

	---

	## Step 1: Find your model (and your org)

	- Models view (`/models`): search for your model, filter by parameter range, and open its page.

	> 🖼️ Screenshot — `09-models-index.png`
	> What to capture: The Models index list.

	- Developers view: toggle Models → Developers on the same page to see your organization's footprint across all its models.

	> 🖼️ Screenshot — `22-developers-view.png`
	> What to capture: The `/models` page with the Developers toggle active.

	---

	## Step 2: Read your page honestly

	Open your model's page in Summary View, then switch to Researcher View to see the per-result detail others will scrutinize.

	> 🖼️ Screenshot — `13-card-summary-full.png`
	> What to capture: A full model page in Summary View.

	> 🖼️ Screenshot — `19-card-researcher-full.png`
	> What to capture: The same page after clicking Researcher View.

	Three things to look at first:

	1. The `DOCUMENTED` badge (e.g. "36%, 14 / 39 reported"). This is the headline read on your reporting hygiene: how much of your reported record is fully specified. Treat a low number as a backlog, not a verdict.

	> 🖼️ Screenshot — `14-card-summary-top.png`
	> What to capture: The page header with the DOCUMENTED badge.

	2. §1 Identification. Confirm the basics are right: model name, developer, release date, modalities, system ID. Errors here propagate everywhere.

	> 🖼️ Screenshot — `15-card-identification.png`
	> What to capture: The §1 Identification block.

	3. §2 Benchmark coverage. What categories are present, and where the split spread is. Gaps here are visible to everyone reading your page.

	> 🖼️ Screenshot — `16-card-coverage.png`
	> What to capture: The §2 Benchmark coverage section.

	---

	## Step 3: Work the four signals as a reporting checklist

	Each signal maps to concrete, mostly low-cost actions.

	### 🔁 Reproducibility: make it re-runnable
	Disclose, per result: setup variants, prompts, decoding parameters, harness name + version, random seeds, and code/artifacts. Corpus-wide only ~3% of scores have complete setup documentation, so even modest disclosure stands out.

	- ✅ Publish the exact harness + version (e.g. the eval framework and commit).
	- ✅ State decoding settings (temperature, top-p, max tokens) and seeds.
	- ✅ Link runnable code or a config, not just a number.

	### 📋 Completeness: cover the categories that matter
	Completeness is judged relative to expectations for your model class. A strong capability record that is silent on safety, robustness, or fairness reads as incomplete.

	- ✅ Report beyond the flattering capability benchmarks.
	- ✅ If a category is intentionally out of scope, the absence will still show, so consider documenting why.

	### 👤 Provenance & Risk: invite independent evaluation
	Your page splits results into first-party (your own) vs third-party (independent), and maps results to IBM Risk Atlas risk domains. A category that is entirely first-party is a visible blind spot: impressive, but uncorroborated.

	> 🖼️ Screenshot — `17-card-who-reports.png`
	> What to capture: The §3 "Who reports what" first-party vs third-party breakdown.

	- ✅ Encourage / enable third-party evaluation, especially for safety-relevant claims.
	- ✅ Don't expect self-reported numbers alone to carry weight with careful readers.

	### ⚖️ Comparability: report so your scores can be compared
	The site flags when two scores on the same benchmark used different splits, metric variants, or units, which invalidates direct comparison.

	> 🖼️ Screenshot — `18-card-metrics.png`
	> What to capture: The §4 Reported metrics section with benchmark charts.

	- ✅ Use the standard splits and metric definitions for each benchmark.
	- ✅ State units explicitly; don't silently switch metric variants.
	- ✅ Use the Evaluations page to see how others report the same benchmark family.

	> 🖼️ Screenshot — `08-evals-family-expanded.png`
	> What to capture: A benchmark family detail page (e.g. `/evals?family=artificial-analysis`).

	---

	## Step 4: Corrections and snapshots

	The corpus follows snapshot discipline: no retroactive edits, only versioned corrections. Practically:

	- A number you published is attributed to its source document and tied to a dated snapshot, so it won't be silently rewritten.
	- If something is genuinely wrong or missing, the path is a correction in a future snapshot, not an edit-in-place. Keep your own source documents stable and clearly dated so they can be cited cleanly.
	- Because nothing is imputed, the fastest way to improve your page is simply to report more, and report it precisely, and the corpus will reflect it on the next snapshot.

	---

	## A pre-launch checklist

	Before your next model announcement, check that each headline benchmark claim ships with:

	- [ ] Harness name + version, and a link to runnable code/config
	- [ ] Decoding settings and seeds
	- [ ] The exact split and metric (with units), matching how the benchmark is normally reported
	- [ ] Coverage beyond capability: at least some safety / robustness / fairness results
	- [ ] At least one path to independent (third-party) evaluation for key claims
	- [ ] A stable, dated source document for every number

	Closing these is what moves your `DOCUMENTED` percentage, and turns marketing numbers into claims that hold up.

	➡️ Related: [Evaluation researchers](evaluation-researchers.md) (how researchers will scrutinize your page) · [Journalists](journalists.md) (how reporters will source your claims) · [Quickstart](quickstart.md).

	# Evaluation Cards for Model Developers

	How your model appears in the corpus, how to read what's documented vs. missing, and how to make your evaluation reporting stronger.

	> Prerequisite: the [Quickstart](quickstart.md) (~6 min) covers the four signals, the five-level hierarchy, and the snapshot model.

	---

	## Why you should care how your model is carded

	Whatever you publish about your model's evaluations, Evaluation Cards re-presents it as a structured record alongside everyone else's, and also shows what you did not disclose. The project treats a published score as a claim and an undisclosed detail as a claim deliberately not made (not an error). That means your card is, in effect, a public read on how legible and verifiable your reporting is.

	A high benchmark number with weak signals reads as a weak claim. Most of those gaps are cheap to close, and this guide is a checklist for doing so.

	> 🖼️ Screenshot — `01-home-overview.png`
	> What to capture: The homepage with the corpus snapshot and the four signals.

	---

	## Step 1: Find your model (and your org)

	- Models view (`/models`): search for your model, filter by parameter range, and open its page.

	> 🖼️ Screenshot — `09-models-index.png`
	> What to capture: The Models index list.

	- Developers view: toggle Models → Developers on the same page to see your organization's footprint across all its models.

	> 🖼️ Screenshot — `22-developers-view.png`
	> What to capture: The `/models` page with the Developers toggle active.

	---

	## Step 2: Read your page honestly

	Open your model's page in Summary View, then switch to Researcher View to see the per-result detail others will scrutinize.

	> 🖼️ Screenshot — `13-card-summary-full.png`
	> What to capture: A full model page in Summary View.

	> 🖼️ Screenshot — `19-card-researcher-full.png`
	> What to capture: The same page after clicking Researcher View.

	Three things to look at first:

	1. The `DOCUMENTED` badge (e.g. "36%, 14 / 39 reported"). This is the headline read on your reporting hygiene: how much of your reported record is fully specified. Treat a low number as a backlog, not a verdict.

	> 🖼️ Screenshot — `14-card-summary-top.png`
	> What to capture: The page header with the DOCUMENTED badge.

	2. §1 Identification. Confirm the basics are right: model name, developer, release date, modalities, system ID. Errors here propagate everywhere.

	> 🖼️ Screenshot — `15-card-identification.png`
	> What to capture: The §1 Identification block.

	3. §2 Benchmark coverage. What categories are present, and where the split spread is. Gaps here are visible to everyone reading your page.

	> 🖼️ Screenshot — `16-card-coverage.png`
	> What to capture: The §2 Benchmark coverage section.

	---

	## Step 3: Work the four signals as a reporting checklist

	Each signal maps to concrete, mostly low-cost actions.

	### 🔁 Reproducibility: make it re-runnable
	Disclose, per result: setup variants, prompts, decoding parameters, harness name + version, random seeds, and code/artifacts. Corpus-wide only ~3% of scores have complete setup documentation, so even modest disclosure stands out.

	- ✅ Publish the exact harness + version (e.g. the eval framework and commit).
	- ✅ State decoding settings (temperature, top-p, max tokens) and seeds.
	- ✅ Link runnable code or a config, not just a number.

	### 📋 Completeness: cover the categories that matter
	Completeness is judged relative to expectations for your model class. A strong capability record that is silent on safety, robustness, or fairness reads as incomplete.

	- ✅ Report beyond the flattering capability benchmarks.
	- ✅ If a category is intentionally out of scope, the absence will still show, so consider documenting why.

	### 👤 Provenance & Risk: invite independent evaluation
	Your page splits results into first-party (your own) vs third-party (independent), and maps results to IBM Risk Atlas risk domains. A category that is entirely first-party is a visible blind spot: impressive, but uncorroborated.

	> 🖼️ Screenshot — `17-card-who-reports.png`
	> What to capture: The §3 "Who reports what" first-party vs third-party breakdown.

	- ✅ Encourage / enable third-party evaluation, especially for safety-relevant claims.
	- ✅ Don't expect self-reported numbers alone to carry weight with careful readers.

	### ⚖️ Comparability: report so your scores can be compared
	The site flags when two scores on the same benchmark used different splits, metric variants, or units, which invalidates direct comparison.

	> 🖼️ Screenshot — `18-card-metrics.png`
	> What to capture: The §4 Reported metrics section with benchmark charts.

	- ✅ Use the standard splits and metric definitions for each benchmark.
	- ✅ State units explicitly; don't silently switch metric variants.
	- ✅ Use the Evaluations page to see how others report the same benchmark family.

	> 🖼️ Screenshot — `08-evals-family-expanded.png`
	> What to capture: A benchmark family detail page (e.g. `/evals?family=artificial-analysis`).

	---

	## Step 4: Corrections and snapshots

	The corpus follows snapshot discipline: no retroactive edits, only versioned corrections. Practically:

	- A number you published is attributed to its source document and tied to a dated snapshot, so it won't be silently rewritten.
	- If something is genuinely wrong or missing, the path is a correction in a future snapshot, not an edit-in-place. Keep your own source documents stable and clearly dated so they can be cited cleanly.
	- Because nothing is imputed, the fastest way to improve your page is simply to report more, and report it precisely, and the corpus will reflect it on the next snapshot.

	---

	## A pre-launch checklist

	Before your next model announcement, check that each headline benchmark claim ships with:

	- [ ] Harness name + version, and a link to runnable code/config
	- [ ] Decoding settings and seeds
	- [ ] The exact split and metric (with units), matching how the benchmark is normally reported
	- [ ] Coverage beyond capability: at least some safety / robustness / fairness results
	- [ ] At least one path to independent (third-party) evaluation for key claims
	- [ ] A stable, dated source document for every number

	Closing these is what moves your `DOCUMENTED` percentage, and turns marketing numbers into claims that hold up.

	➡️ Related: [Evaluation researchers](evaluation-researchers.md) (how researchers will scrutinize your page) · [Journalists](journalists.md) (how reporters will source your claims) · [Quickstart](quickstart.md).