--- title: ellamind base-eval emoji: 📊 colorFrom: blue colorTo: indigo sdk: static pinned: false hf_oauth: true hf_oauth_expiration_minutes: 20160 hf_oauth_scopes: - read-repos - gated-repos --- # ellamind base-eval Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM — no backend required. - **Default dataset**: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref) - **GitHub**: [ellamind/base-eval](https://github.com/ellamind/base-eval) ## Features - **Hierarchical task selection** — eval suite → task group → individual benchmark, with aggregate views - **Multiple metrics** — `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc. - **Model comparison** — toggle models on/off; separate checkpoint runs from baselines - **Auto chart type** — line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons - **Multi-panel layout** — add multiple independent panels - **Merge datasets** — append rows from additional HF datasets (including private ones via OAuth) - **Smoothing** — configurable moving average for line charts - **Benchmark goodness metrics** — per-task quality indicators below line charts - **Export** — download charts as PNG or SVG ## Merge Datasets You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click **Merge Dataset**. The additional data is row-appended to the base dataset. For **private datasets**, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching. ## Benchmark Goodness Metrics Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: **Overall**, **Early** (first half of training), and **Late** (second half). | Metric | What it measures | Green | Yellow | Red | |---|---|---|---|---| | **Monotonicity** | Spearman correlation between steps and score | ≥ 0.7 | 0.4–0.7 | < 0.4 | | **Signal Strength** | Relative improvement over initial performance | ≥ 0.10 | 0.03–0.10 | < 0.03 | | **Noise** | MAD of consecutive score diffs (robust to data-mix jumps) | — | — | — | | **Ordering** | Kendall's Tau of model rankings between steps | ≥ 0.6 | 0.3–0.6 | < 0.3 | | **Discrimination** | Std of scores across models at last checkpoint | ≥ 0.03 | 0.01–0.03 | < 0.01 | ## Configuration Model colors in `config.yaml`: ```yaml model_colors: "Qwen3 1.7B": "#9575CD" "Gemma 3 4B": "#00B0FF" ``` ## Local Development ```bash python3 -m http.server 8080 ``` OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden. ## Deployment ```bash pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space ``` ## Project Structure ``` index.html # Single-file web app (HTML + CSS + JS) config.yaml # Model color overrides README.md # HF Spaces metadata + docs ```