File size: 3,365 Bytes
c4a5bf1
2727392
c4a5bf1
2c80f80
 
c4a5bf1
 
2727392
665faf7
2727392
 
9381350
c4a5bf1
 
2727392
2c80f80
2727392
 
 
 
2c80f80
 
 
2727392
 
 
 
 
 
 
 
 
2c80f80
2727392
2c80f80
2727392
2c80f80
2727392
2c80f80
2727392
2c80f80
2727392
2c80f80
2727392
 
 
 
 
 
 
2c80f80
 
 
2727392
2c80f80
 
 
2727392
 
2c80f80
 
2727392
2c80f80
2727392
 
 
 
 
2c80f80
2727392
2c80f80
 
2727392
 
 
 
 
 
 
 
 
 
 
2c80f80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
title: ellamind base-eval
emoji: πŸ“Š
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 20160
hf_oauth_scopes:
  - read-repos
  - gated-repos
---

# ellamind base-eval

Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM β€” no backend required.

- **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref)
- **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval)

## Features

- **Hierarchical task selection** β€” eval suite β†’ task group β†’ individual benchmark, with aggregate views
- **Multiple metrics** β€” `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
- **Model comparison** β€” toggle models on/off; separate checkpoint runs from baselines
- **Auto chart type** β€” line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
- **Multi-panel layout** β€” add multiple independent panels
- **Merge datasets** β€” append rows from additional HF datasets (including private ones via OAuth)
- **Smoothing** β€” configurable moving average for line charts
- **Benchmark goodness metrics** β€” per-task quality indicators below line charts
- **Export** β€” download charts as PNG or SVG

## Merge Datasets

You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset.

For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.

## Benchmark Goodness Metrics

Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half).

| Metric | What it measures | Green | Yellow | Red |
|---|---|---|---|---|
| **Monotonicity** | Spearman correlation between steps and score | β‰₯ 0.7 | 0.4–0.7 | < 0.4 |
| **Signal Strength** | Relative improvement over initial performance | β‰₯ 0.10 | 0.03–0.10 | < 0.03 |
| **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | β€” | β€” | β€” |
| **Ordering** | Kendall's Tau of model rankings between steps | β‰₯ 0.6 | 0.3–0.6 | < 0.3 |
| **Discrimination** | Std of scores across models at last checkpoint | β‰₯ 0.03 | 0.01–0.03 | < 0.01 |

## Configuration

Model colors in `config.yaml`:

```yaml
model_colors:
  "Qwen3 1.7B": "#9575CD"
  "Gemma 3 4B": "#00B0FF"
```

## Local Development

```bash
python3 -m http.server 8080
```

OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.

## Deployment

```bash
pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
```

## Project Structure

```
index.html    # Single-file web app (HTML + CSS + JS)
config.yaml   # Model color overrides
README.md     # HF Spaces metadata + docs
```