Spaces:

ellamind
/

base-eval

Running

File size: 3,365 Bytes

---
title: ellamind base-eval
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 20160
hf_oauth_scopes:
  - read-repos
  - gated-repos
---

# ellamind base-eval

Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM — no backend required.

- **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref)
- **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval)

## Features

- **Hierarchical task selection** — eval suite → task group → individual benchmark, with aggregate views
- **Multiple metrics** — `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
- **Model comparison** — toggle models on/off; separate checkpoint runs from baselines
- **Auto chart type** — line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
- **Multi-panel layout** — add multiple independent panels
- **Merge datasets** — append rows from additional HF datasets (including private ones via OAuth)
- **Smoothing** — configurable moving average for line charts
- **Benchmark goodness metrics** — per-task quality indicators below line charts
- **Export** — download charts as PNG or SVG

## Merge Datasets

You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset.

For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.

## Benchmark Goodness Metrics

Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half).

| Metric | What it measures | Green | Yellow | Red |
|---|---|---|---|---|
| **Monotonicity** | Spearman correlation between steps and score | ≥ 0.7 | 0.4–0.7 | < 0.4 |
| **Signal Strength** | Relative improvement over initial performance | ≥ 0.10 | 0.03–0.10 | < 0.03 |
| **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | — | — | — |
| **Ordering** | Kendall's Tau of model rankings between steps | ≥ 0.6 | 0.3–0.6 | < 0.3 |
| **Discrimination** | Std of scores across models at last checkpoint | ≥ 0.03 | 0.01–0.03 | < 0.01 |

## Configuration

Model colors in `config.yaml`:

```yaml
model_colors:
  "Qwen3 1.7B": "#9575CD"
  "Gemma 3 4B": "#00B0FF"
```

## Local Development

```bash
python3 -m http.server 8080
```

OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.

## Deployment

```bash
pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
```

## Project Structure

```
index.html    # Single-file web app (HTML + CSS + JS)
config.yaml   # Model color overrides
README.md     # HF Spaces metadata + docs
```