Spaces:

representational-alignment
/

iclr2026-realign-challenge

Running

App Files Files Community

iclr2026-realign-challenge / docs /evaluation_contract.md

siddsuresh97

Initial commit: ICLR 2026 Representational Alignment Challenge

d6c8a4f 2 months ago

preview code

raw

history blame contribute delete

4.19 kB

	# Evaluation Contract (Draft)

	This document defines how submissions are evaluated using real model forward passes and CKA.
	It replaces the dummy embeddings in `src/hackathon/data.py`.

	## Scope

	- Applies to Blue Team (model selection) and Red Team (stimulus selection) submissions.
	- All scoring uses real model forward passes to compute embeddings, then linear CKA.

	## Entities

	### Stimulus

	A stimulus is identified by:
	- dataset_name: canonical dataset id (e.g., cifar100, imagenet1k)
	- image_identifier: path relative to dataset root (e.g., val/n01440764/ILSVRC2012_val_00000964.JPEG)

	### Model

	A model is identified by:
	- model_name: unique registry key (e.g., resnet50, clip_vit_b32)

	## Model Registry Spec (planned location: configs/model_registry.json)

	Each entry defines how to load a model and extract embeddings.

	Required fields:
	- model_name: string, unique
	- source: string (torchvision, timm, open_clip, custom)
	- weights: string or null (pretrained identifier)
	- layer: string module path or alias (e.g., fc, classifier.4, visual)
	- embedding: string strategy (pool, cls, flatten, mean)
	- input_size: [height, width]
	- preprocess: {mean: [...], std: [...], resize: int, crop: int}
	- output_dim: int (expected embedding dimension)

	Optional fields:
	- model_parameters: object for model constructor
	- forward_args: object for forward call
	- notes: string

	Example:
	```json
	{
	"model_name": "resnet50",
	"source": "torchvision",
	"weights": "IMAGENET1K_V2",
	"layer": "fc",
	"embedding": "flatten",
	"input_size": [224, 224],
	"preprocess": {
	"mean": [0.485, 0.456, 0.406],
	"std": [0.229, 0.224, 0.225],
	"resize": 256,
	"crop": 224
	},
	"output_dim": 2048
	}
	```

	## Stimuli Catalog Spec (planned location: configs/stimuli_catalog.jsonl)

	Each line is one stimulus with:
	- dataset_name
	- image_identifier

	Example lines:
	```json
	{"dataset_name": "cifar100", "image_identifier": "test/bear/image_0007.png"}
	{"dataset_name": "imagenet1k", "image_identifier": "val/n03445777/ILSVRC2012_val_00003572.JPEG"}
	```

	## Submission Contract

	### Blue Team

	- `models`: list of model_name strings.
	- Each model_name must exist in the model registry.
	- Minimum 2 models; no duplicates.

	### Red Team

	- `differentiating_images`: list of stimulus objects.
	- Each stimulus must exist in the stimuli catalog.
	- Minimum 2 stimuli; no duplicates.

	## Evaluation Procedure

	### Blue Team scoring

	1. Load the stimuli catalog (full evaluation set).
	2. For each submitted model, run forward pass on all stimuli and extract embeddings.
	3. Compute mean pairwise linear CKA across submitted models.

	### Red Team scoring

	1. Load the model registry (full evaluation model set).
	2. For each model, run forward pass on submitted stimuli and extract embeddings.
	3. Compute mean pairwise linear CKA across all models, then score = 1 - avg CKA.

	## Embedding Extraction Requirements

	- `model.eval()` and `torch.no_grad()` for all forward passes.
	- Deterministic settings (seed, disable dropout).
	- Embeddings must be 2D arrays shaped [num_samples, dim].
	- If a layer produces spatial features, apply the registry's embedding strategy
	(e.g., global average pool then flatten).

	## CKA Definition

	- Use `src/cka/compute.py` linear CKA (biased HSIC by default).
	- Arrays are converted to float64 before CKA.

	## Storage and Paths

	- Dataset roots come from env vars (see `AGENTS.md` path hygiene).
	- Cache embeddings per model/layer/dataset version (Modal volume).
	- Durable logs and final scores go to `/orcd/data/...`.

	## Validation Rules (for future validator)

	- JSON schema checks for required fields.
	- Name and stimulus existence checks.
	- Minimum counts and uniqueness.
	- Dataset path resolution errors are surfaced as submission failures.

	## Validation Script

	- `scripts/validate_submission.py` validates JSON submissions.
	- Optional envs: `HACKATHON_MODEL_REGISTRY`, `HACKATHON_STIMULI_CATALOG`.

	## Modal Scoring (optional)

	- Set `HACKATHON_MODAL_ENABLE=true` to route scoring through Modal.
	- Requires `HACKATHON_MODEL_REGISTRY` and `HACKATHON_STIMULI_CATALOG`.

	## Versioning

	- This contract should include a `contract_version` when enforced in code.