| --- |
| title: ellamind base-eval |
| emoji: π |
| colorFrom: blue |
| colorTo: indigo |
| sdk: static |
| pinned: false |
| hf_oauth: true |
| hf_oauth_expiration_minutes: 20160 |
| hf_oauth_scopes: |
| - read-repos |
| - gated-repos |
| --- |
| |
| # ellamind base-eval |
|
|
| Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM β no backend required. |
|
|
| - **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref) |
| - **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval) |
|
|
| ## Features |
|
|
| - **Hierarchical task selection** β eval suite β task group β individual benchmark, with aggregate views |
| - **Multiple metrics** β `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc. |
| - **Model comparison** β toggle models on/off; separate checkpoint runs from baselines |
| - **Auto chart type** β line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons |
| - **Multi-panel layout** β add multiple independent panels |
| - **Merge datasets** β append rows from additional HF datasets (including private ones via OAuth) |
| - **Smoothing** β configurable moving average for line charts |
| - **Benchmark goodness metrics** β per-task quality indicators below line charts |
| - **Export** β download charts as PNG or SVG |
|
|
| ## Merge Datasets |
|
|
| You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset. |
|
|
| For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching. |
|
|
| ## Benchmark Goodness Metrics |
|
|
| Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half). |
|
|
| | Metric | What it measures | Green | Yellow | Red | |
| |---|---|---|---|---| |
| | **Monotonicity** | Spearman correlation between steps and score | β₯ 0.7 | 0.4β0.7 | < 0.4 | |
| | **Signal Strength** | Relative improvement over initial performance | β₯ 0.10 | 0.03β0.10 | < 0.03 | |
| | **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | β | β | β | |
| | **Ordering** | Kendall's Tau of model rankings between steps | β₯ 0.6 | 0.3β0.6 | < 0.3 | |
| | **Discrimination** | Std of scores across models at last checkpoint | β₯ 0.03 | 0.01β0.03 | < 0.01 | |
|
|
| ## Configuration |
|
|
| Model colors in `config.yaml`: |
|
|
| ```yaml |
| model_colors: |
| "Qwen3 1.7B": "#9575CD" |
| "Gemma 3 4B": "#00B0FF" |
| ``` |
|
|
| ## Local Development |
|
|
| ```bash |
| python3 -m http.server 8080 |
| ``` |
|
|
| OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden. |
|
|
| ## Deployment |
|
|
| ```bash |
| pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space |
| pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space |
| pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space |
| ``` |
|
|
| ## Project Structure |
|
|
| ``` |
| index.html # Single-file web app (HTML + CSS + JS) |
| config.yaml # Model color overrides |
| README.md # HF Spaces metadata + docs |
| ``` |
|
|