Spaces:

ellamind
/

base-eval

Running

App Files Files Community

base-eval / README.md

maxidl

Upload README.md with huggingface_hub

665faf7 verified 9 days ago

preview code

raw

history blame contribute delete

3.37 kB

	---
	title: ellamind base-eval
	emoji: 📊
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	hf_oauth: true
	hf_oauth_expiration_minutes: 20160
	hf_oauth_scopes:
	- read-repos
	- gated-repos
	---

	# ellamind base-eval

	Interactive visualization for LLM evaluation scores during pre-training. Data is loaded from HuggingFace datasets via DuckDB-WASM — no backend required.

	- Default dataset: [ellamind/eval-scores-ref](https://huggingface.co/datasets/ellamind/eval-scores-ref)
	- GitHub: [ellamind/base-eval](https://github.com/ellamind/base-eval)

	## Features

	- Hierarchical task selection — eval suite → task group → individual benchmark, with aggregate views
	- Multiple metrics — `acc`, `acc_norm`, `bits_per_byte`, `exact_match`, `pass@1`, etc.
	- Model comparison — toggle models on/off; separate checkpoint runs from baselines
	- Auto chart type — line charts for training runs (tokens trained on x-axis), bar charts for single-point comparisons
	- Multi-panel layout — add multiple independent panels
	- Merge datasets — append rows from additional HF datasets (including private ones via OAuth)
	- Smoothing — configurable moving average for line charts
	- Benchmark goodness metrics — per-task quality indicators below line charts
	- Export — download charts as PNG or SVG

	## Merge Datasets

	You can merge additional HF datasets into the visualization at runtime. Enter a dataset path (e.g. `org/dataset-name` or `org/dataset-name/custom.parquet`) and click Merge Dataset. The additional data is row-appended to the base dataset.

	For private datasets, sign in with your HuggingFace account using the OAuth button. The access token is used automatically when fetching.

	## Benchmark Goodness Metrics

	Line charts display quality indicators below the plot, inspired by the [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) methodology. Metrics are computed client-side across three stages: Overall, Early (first half of training), and Late (second half).

	\| Metric \| What it measures \| Green \| Yellow \| Red \|
	\|---\|---\|---\|---\|---\|
	\| Monotonicity \| Spearman correlation between steps and score \| ≥ 0.7 \| 0.4–0.7 \| < 0.4 \|
	\| Signal Strength \| Relative improvement over initial performance \| ≥ 0.10 \| 0.03–0.10 \| < 0.03 \|
	\| Noise \| MAD of consecutive score diffs (robust to data-mix jumps) \| — \| — \| — \|
	\| Ordering \| Kendall's Tau of model rankings between steps \| ≥ 0.6 \| 0.3–0.6 \| < 0.3 \|
	\| Discrimination \| Std of scores across models at last checkpoint \| ≥ 0.03 \| 0.01–0.03 \| < 0.01 \|

	## Configuration

	Model colors in `config.yaml`:

	```yaml
	model_colors:
	"Qwen3 1.7B": "#9575CD"
	"Gemma 3 4B": "#00B0FF"
	```

	## Local Development

	```bash
	python3 -m http.server 8080
	```

	OAuth sign-in is only available when deployed as an HF Space. Locally, it is hidden.

	## Deployment

	```bash
	pixi run -- hf upload ellamind/base-eval index.html index.html --repo-type space
	pixi run -- hf upload ellamind/base-eval config.yaml config.yaml --repo-type space
	pixi run -- hf upload ellamind/base-eval README.md README.md --repo-type space
	```

	## Project Structure

	```
	index.html # Single-file web app (HTML + CSS + JS)
	config.yaml # Model color overrides
	README.md # HF Spaces metadata + docs
	```