title: ellamind base-eval
emoji: π
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 20160
hf_oauth_scopes:
- read-repos
- gated-repos
ellamind base-eval
Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM β no backend required.
- Default dataset: ellamind/eval-scores-ref
- GitHub: ellamind/base-eval
Features
- Hierarchical task selection β eval suite β task group β individual benchmark, with aggregate views
- Multiple metrics β
acc,acc_norm,bits_per_byte,exact_match,pass@1, etc. - Model comparison β toggle models on/off; separate checkpoint runs from baselines
- Auto chart type β line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
- Multi-panel layout β add multiple independent panels
- Merge datasets β append rows from additional HF datasets (including private ones via OAuth)
- Smoothing β configurable moving average for line charts
- Benchmark goodness metrics β per-task quality indicators below line charts
- Export β download charts as PNG or SVG
Merge Datasets
You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. org/dataset-name or org/dataset-name/custom.parquet) and click Merge Dataset. The additional data is row-appended to the base dataset.
For private datasets, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.
Benchmark Goodness Metrics
Line charts display quality indicators below the plot, inspired by the FineTasks methodology. Metrics are computed client-side across three stages: Overall, Early (first half of training), and Late (second half).
| Metric | What it measures | Green | Yellow | Red |
|---|---|---|---|---|
| Monotonicity | Spearman correlation between steps and score | β₯ 0.7 | 0.4β0.7 | < 0.4 |
| Signal Strength | Relative improvement over initial performance | β₯ 0.10 | 0.03β0.10 | < 0.03 |
| Noise | MAD of consecutive score diffs (robust to data-mix jumps) | β | β | β |
| Ordering | Kendall's Tau of model rankings between steps | β₯ 0.6 | 0.3β0.6 | < 0.3 |
| Discrimination | Std of scores across models at last checkpoint | β₯ 0.03 | 0.01β0.03 | < 0.01 |
Configuration
Model colors in config.yaml:
model_colors:
"Qwen3 1.7B": "#9575CD"
"Gemma 3 4B": "#00B0FF"
Local Development
python3 -m http.server 8080
OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.
Deployment
pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
Project Structure
index.html # Single-file web app (HTML + CSS + JS)
config.yaml # Model color overrides
README.md # HF Spaces metadata + docs