base-eval / README.md
maxidl's picture
Upload README.md with huggingface_hub
665faf7 verified
---
title: ellamind base-eval
emoji: πŸ“Š
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 20160
hf_oauth_scopes:
- read-repos
- gated-repos
---
# ellamind base-eval
Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM β€” no backend required.
- **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref)
- **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval)
## Features
- **Hierarchical task selection** β€” eval suite β†’ task group β†’ individual benchmark, with aggregate views
- **Multiple metrics** β€” `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
- **Model comparison** β€” toggle models on/off; separate checkpoint runs from baselines
- **Auto chart type** β€” line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
- **Multi-panel layout** β€” add multiple independent panels
- **Merge datasets** β€” append rows from additional HF datasets (including private ones via OAuth)
- **Smoothing** β€” configurable moving average for line charts
- **Benchmark goodness metrics** β€” per-task quality indicators below line charts
- **Export** β€” download charts as PNG or SVG
## Merge Datasets
You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset.
For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.
## Benchmark Goodness Metrics
Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half).
| Metric | What it measures | Green | Yellow | Red |
|---|---|---|---|---|
| **Monotonicity** | Spearman correlation between steps and score | β‰₯ 0.7 | 0.4–0.7 | < 0.4 |
| **Signal Strength** | Relative improvement over initial performance | β‰₯ 0.10 | 0.03–0.10 | < 0.03 |
| **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | β€” | β€” | β€” |
| **Ordering** | Kendall's Tau of model rankings between steps | β‰₯ 0.6 | 0.3–0.6 | < 0.3 |
| **Discrimination** | Std of scores across models at last checkpoint | β‰₯ 0.03 | 0.01–0.03 | < 0.01 |
## Configuration
Model colors in `config.yaml`:
```yaml
model_colors:
"Qwen3 1.7B": "#9575CD"
"Gemma 3 4B": "#00B0FF"
```
## Local Development
```bash
python3 -m http.server 8080
```
OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.
## Deployment
```bash
pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
```
## Project Structure
```
index.html # Single-file web app (HTML + CSS + JS)
config.yaml # Model color overrides
README.md # HF Spaces metadata + docs
```