Spaces:
Running
Running
| # Evaluation Cards for Model Developers | |
| *How your model appears in the corpus, how to read what's documented vs. missing, and how to make your evaluation reporting stronger.* | |
| > **Prerequisite:** the [Quickstart](quickstart.md) (~6 min) covers the four signals, the five-level hierarchy, and the snapshot model. | |
| --- | |
| ## Why you should care how your model is carded | |
| Whatever you publish about your model's evaluations, Evaluation Cards re-presents it as a structured record alongside everyone else's, and also shows **what you did not disclose**. The project treats a published score as a **claim** and an undisclosed detail as **a claim deliberately not made** (not an error). That means your card is, in effect, a public read on how *legible* and *verifiable* your reporting is. | |
| A high benchmark number with weak signals reads as a weak claim. Most of those gaps are cheap to close, and this guide is a checklist for doing so. | |
| > πΌοΈ **Screenshot β `01-home-overview.png`** | |
| > *What to capture:* The homepage with the corpus snapshot and the four signals. | |
| --- | |
| ## Step 1: Find your model (and your org) | |
| - **Models view** (`/models`): search for your model, filter by parameter range, and open its page. | |
| > πΌοΈ **Screenshot β `09-models-index.png`** | |
| > *What to capture:* The Models index list. | |
| - **Developers view**: toggle **Models β Developers** on the same page to see your organization's footprint across all its models. | |
| > πΌοΈ **Screenshot β `22-developers-view.png`** | |
| > *What to capture:* The `/models` page with the **Developers** toggle active. | |
| --- | |
| ## Step 2: Read your page honestly | |
| Open your model's page in **Summary View**, then switch to **Researcher View** to see the per-result detail others will scrutinize. | |
| > πΌοΈ **Screenshot β `13-card-summary-full.png`** | |
| > *What to capture:* A full model page in Summary View. | |
| > πΌοΈ **Screenshot β `19-card-researcher-full.png`** | |
| > *What to capture:* The same page after clicking **Researcher View**. | |
| Three things to look at first: | |
| 1. **The `DOCUMENTED` badge** (e.g. "36%, 14 / 39 reported"). This is the headline read on your reporting hygiene: how much of your reported record is fully specified. Treat a low number as a backlog, not a verdict. | |
| > πΌοΈ **Screenshot β `14-card-summary-top.png`** | |
| > *What to capture:* The page header with the DOCUMENTED badge. | |
| 2. **Β§1 Identification.** Confirm the basics are right: model name, developer, release date, modalities, system ID. Errors here propagate everywhere. | |
| > πΌοΈ **Screenshot β `15-card-identification.png`** | |
| > *What to capture:* The Β§1 Identification block. | |
| 3. **Β§2 Benchmark coverage.** What categories are present, and where the **split spread** is. Gaps here are visible to everyone reading your page. | |
| > πΌοΈ **Screenshot β `16-card-coverage.png`** | |
| > *What to capture:* The Β§2 Benchmark coverage section. | |
| --- | |
| ## Step 3: Work the four signals as a reporting checklist | |
| Each signal maps to concrete, mostly low-cost actions. | |
| ### π Reproducibility: *make it re-runnable* | |
| Disclose, per result: **setup variants, prompts, decoding parameters, harness name + version, random seeds, and code/artifacts**. Corpus-wide only ~3% of scores have complete setup documentation, so even modest disclosure stands out. | |
| - β Publish the exact harness + version (e.g. the eval framework and commit). | |
| - β State decoding settings (temperature, top-p, max tokens) and seeds. | |
| - β Link runnable code or a config, not just a number. | |
| ### π Completeness: *cover the categories that matter* | |
| Completeness is judged relative to expectations for your model class. A strong **capability** record that is silent on **safety**, **robustness**, or **fairness** reads as incomplete. | |
| - β Report beyond the flattering capability benchmarks. | |
| - β If a category is intentionally out of scope, the absence will still show, so consider documenting why. | |
| ### π€ Provenance & Risk: *invite independent evaluation* | |
| Your page splits results into **first-party** (your own) vs **third-party** (independent), and maps results to **IBM Risk Atlas** risk domains. A category that is **entirely first-party** is a visible blind spot: impressive, but uncorroborated. | |
| > πΌοΈ **Screenshot β `17-card-who-reports.png`** | |
| > *What to capture:* The Β§3 "Who reports what" first-party vs third-party breakdown. | |
| - β Encourage / enable third-party evaluation, especially for safety-relevant claims. | |
| - β Don't expect self-reported numbers alone to carry weight with careful readers. | |
| ### βοΈ Comparability: *report so your scores can be compared* | |
| The site flags when two scores on the same benchmark used different splits, metric variants, or units, which invalidates direct comparison. | |
| > πΌοΈ **Screenshot β `18-card-metrics.png`** | |
| > *What to capture:* The Β§4 Reported metrics section with benchmark charts. | |
| - β Use the standard splits and metric definitions for each benchmark. | |
| - β State units explicitly; don't silently switch metric variants. | |
| - β Use the **Evaluations** page to see how others report the same benchmark family. | |
| > πΌοΈ **Screenshot β `08-evals-family-expanded.png`** | |
| > *What to capture:* A benchmark family detail page (e.g. `/evals?family=artificial-analysis`). | |
| --- | |
| ## Step 4: Corrections and snapshots | |
| The corpus follows **snapshot discipline**: no retroactive edits, only versioned corrections. Practically: | |
| - A number you published is attributed to its source document and tied to a dated snapshot, so it won't be silently rewritten. | |
| - If something is genuinely wrong or missing, the path is a **correction in a future snapshot**, not an edit-in-place. Keep your own source documents stable and clearly dated so they can be cited cleanly. | |
| - Because nothing is imputed, the fastest way to improve your page is simply to **report more, and report it precisely**, and the corpus will reflect it on the next snapshot. | |
| --- | |
| ## A pre-launch checklist | |
| Before your next model announcement, check that each headline benchmark claim ships with: | |
| - [ ] Harness name + version, and a link to runnable code/config | |
| - [ ] Decoding settings and seeds | |
| - [ ] The exact split and metric (with units), matching how the benchmark is normally reported | |
| - [ ] Coverage beyond capability: at least some safety / robustness / fairness results | |
| - [ ] At least one path to independent (third-party) evaluation for key claims | |
| - [ ] A stable, dated source document for every number | |
| Closing these is what moves your `DOCUMENTED` percentage, and turns marketing numbers into claims that hold up. | |
| β‘οΈ Related: [Evaluation researchers](evaluation-researchers.md) (how researchers will scrutinize your page) Β· [Journalists](journalists.md) (how reporters will source your claims) Β· [Quickstart](quickstart.md). | |