# Evaluation Cards for Model Developers *How your model appears in the corpus, how to read what's documented vs. missing, and how to make your evaluation reporting stronger.* > **Prerequisite:** the [Quickstart](quickstart.md) (~6 min) covers the four signals, the five-level hierarchy, and the snapshot model. --- ## Why you should care how your model is carded Whatever you publish about your model's evaluations, Evaluation Cards re-presents it as a structured record alongside everyone else's, and also shows **what you did not disclose**. The project treats a published score as a **claim** and an undisclosed detail as **a claim deliberately not made** (not an error). That means your card is, in effect, a public read on how *legible* and *verifiable* your reporting is. A high benchmark number with weak signals reads as a weak claim. Most of those gaps are cheap to close, and this guide is a checklist for doing so. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `01-home-overview.png`** > *What to capture:* The homepage with the corpus snapshot and the four signals. --- ## Step 1: Find your model (and your org) - **Models view** (`/models`): search for your model, filter by parameter range, and open its page. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `09-models-index.png`** > *What to capture:* The Models index list. - **Developers view**: toggle **Models โ†’ Developers** on the same page to see your organization's footprint across all its models. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `22-developers-view.png`** > *What to capture:* The `/models` page with the **Developers** toggle active. --- ## Step 2: Read your page honestly Open your model's page in **Summary View**, then switch to **Researcher View** to see the per-result detail others will scrutinize. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `13-card-summary-full.png`** > *What to capture:* A full model page in Summary View. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `19-card-researcher-full.png`** > *What to capture:* The same page after clicking **Researcher View**. Three things to look at first: 1. **The `DOCUMENTED` badge** (e.g. "36%, 14 / 39 reported"). This is the headline read on your reporting hygiene: how much of your reported record is fully specified. Treat a low number as a backlog, not a verdict. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `14-card-summary-top.png`** > *What to capture:* The page header with the DOCUMENTED badge. 2. **ยง1 Identification.** Confirm the basics are right: model name, developer, release date, modalities, system ID. Errors here propagate everywhere. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `15-card-identification.png`** > *What to capture:* The ยง1 Identification block. 3. **ยง2 Benchmark coverage.** What categories are present, and where the **split spread** is. Gaps here are visible to everyone reading your page. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `16-card-coverage.png`** > *What to capture:* The ยง2 Benchmark coverage section. --- ## Step 3: Work the four signals as a reporting checklist Each signal maps to concrete, mostly low-cost actions. ### ๐Ÿ” Reproducibility: *make it re-runnable* Disclose, per result: **setup variants, prompts, decoding parameters, harness name + version, random seeds, and code/artifacts**. Corpus-wide only ~3% of scores have complete setup documentation, so even modest disclosure stands out. - โœ… Publish the exact harness + version (e.g. the eval framework and commit). - โœ… State decoding settings (temperature, top-p, max tokens) and seeds. - โœ… Link runnable code or a config, not just a number. ### ๐Ÿ“‹ Completeness: *cover the categories that matter* Completeness is judged relative to expectations for your model class. A strong **capability** record that is silent on **safety**, **robustness**, or **fairness** reads as incomplete. - โœ… Report beyond the flattering capability benchmarks. - โœ… If a category is intentionally out of scope, the absence will still show, so consider documenting why. ### ๐Ÿ‘ค Provenance & Risk: *invite independent evaluation* Your page splits results into **first-party** (your own) vs **third-party** (independent), and maps results to **IBM Risk Atlas** risk domains. A category that is **entirely first-party** is a visible blind spot: impressive, but uncorroborated. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `17-card-who-reports.png`** > *What to capture:* The ยง3 "Who reports what" first-party vs third-party breakdown. - โœ… Encourage / enable third-party evaluation, especially for safety-relevant claims. - โœ… Don't expect self-reported numbers alone to carry weight with careful readers. ### โš–๏ธ Comparability: *report so your scores can be compared* The site flags when two scores on the same benchmark used different splits, metric variants, or units, which invalidates direct comparison. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `18-card-metrics.png`** > *What to capture:* The ยง4 Reported metrics section with benchmark charts. - โœ… Use the standard splits and metric definitions for each benchmark. - โœ… State units explicitly; don't silently switch metric variants. - โœ… Use the **Evaluations** page to see how others report the same benchmark family. > ๐Ÿ–ผ๏ธ **Screenshot โ€” `08-evals-family-expanded.png`** > *What to capture:* A benchmark family detail page (e.g. `/evals?family=artificial-analysis`). --- ## Step 4: Corrections and snapshots The corpus follows **snapshot discipline**: no retroactive edits, only versioned corrections. Practically: - A number you published is attributed to its source document and tied to a dated snapshot, so it won't be silently rewritten. - If something is genuinely wrong or missing, the path is a **correction in a future snapshot**, not an edit-in-place. Keep your own source documents stable and clearly dated so they can be cited cleanly. - Because nothing is imputed, the fastest way to improve your page is simply to **report more, and report it precisely**, and the corpus will reflect it on the next snapshot. --- ## A pre-launch checklist Before your next model announcement, check that each headline benchmark claim ships with: - [ ] Harness name + version, and a link to runnable code/config - [ ] Decoding settings and seeds - [ ] The exact split and metric (with units), matching how the benchmark is normally reported - [ ] Coverage beyond capability: at least some safety / robustness / fairness results - [ ] At least one path to independent (third-party) evaluation for key claims - [ ] A stable, dated source document for every number Closing these is what moves your `DOCUMENTED` percentage, and turns marketing numbers into claims that hold up. โžก๏ธ Related: [Evaluation researchers](evaluation-researchers.md) (how researchers will scrutinize your page) ยท [Journalists](journalists.md) (how reporters will source your claims) ยท [Quickstart](quickstart.md).