Spaces:

representational-alignment
/

iclr2026-realign-challenge

Running

App Files Files Community

iclr2026-realign-challenge / docs /evaluation_contract.md

siddsuresh97

Initial commit: ICLR 2026 Representational Alignment Challenge

d6c8a4f 2 months ago

preview code

raw

history blame contribute delete

4.19 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Evaluation Contract (Draft)

This document defines how submissions are evaluated using real model forward passes and CKA. It replaces the dummy embeddings in src/hackathon/data.py.

Scope

Applies to Blue Team (model selection) and Red Team (stimulus selection) submissions.
All scoring uses real model forward passes to compute embeddings, then linear CKA.

Entities

Stimulus

A stimulus is identified by:

dataset_name: canonical dataset id (e.g., cifar100, imagenet1k)
image_identifier: path relative to dataset root (e.g., val/n01440764/ILSVRC2012_val_00000964.JPEG)

Model

A model is identified by:

model_name: unique registry key (e.g., resnet50, clip_vit_b32)

Model Registry Spec (planned location: configs/model_registry.json)

Each entry defines how to load a model and extract embeddings.

Required fields:

model_name: string, unique
source: string (torchvision, timm, open_clip, custom)
weights: string or null (pretrained identifier)
layer: string module path or alias (e.g., fc, classifier.4, visual)
embedding: string strategy (pool, cls, flatten, mean)
input_size: [height, width]
preprocess: {mean: [...], std: [...], resize: int, crop: int}
output_dim: int (expected embedding dimension)

Optional fields:

model_parameters: object for model constructor
forward_args: object for forward call
notes: string

Example:

{
  "model_name": "resnet50",
  "source": "torchvision",
  "weights": "IMAGENET1K_V2",
  "layer": "fc",
  "embedding": "flatten",
  "input_size": [224, 224],
  "preprocess": {
    "mean": [0.485, 0.456, 0.406],
    "std": [0.229, 0.224, 0.225],
    "resize": 256,
    "crop": 224
  },
  "output_dim": 2048
}

Stimuli Catalog Spec (planned location: configs/stimuli_catalog.jsonl)

Each line is one stimulus with:

dataset_name
image_identifier

Example lines:

{"dataset_name": "cifar100", "image_identifier": "test/bear/image_0007.png"}
{"dataset_name": "imagenet1k", "image_identifier": "val/n03445777/ILSVRC2012_val_00003572.JPEG"}

Submission Contract

Blue Team

models: list of model_name strings.
Each model_name must exist in the model registry.
Minimum 2 models; no duplicates.

Red Team

differentiating_images: list of stimulus objects.
Each stimulus must exist in the stimuli catalog.
Minimum 2 stimuli; no duplicates.

Evaluation Procedure

Blue Team scoring

Load the stimuli catalog (full evaluation set).
For each submitted model, run forward pass on all stimuli and extract embeddings.
Compute mean pairwise linear CKA across submitted models.

Red Team scoring

Load the model registry (full evaluation model set).
For each model, run forward pass on submitted stimuli and extract embeddings.
Compute mean pairwise linear CKA across all models, then score = 1 - avg CKA.

Embedding Extraction Requirements

model.eval() and torch.no_grad() for all forward passes.
Deterministic settings (seed, disable dropout).
Embeddings must be 2D arrays shaped [num_samples, dim].
If a layer produces spatial features, apply the registry's embedding strategy (e.g., global average pool then flatten).

CKA Definition

Use src/cka/compute.py linear CKA (biased HSIC by default).
Arrays are converted to float64 before CKA.

Storage and Paths

Dataset roots come from env vars (see AGENTS.md path hygiene).
Cache embeddings per model/layer/dataset version (Modal volume).
Durable logs and final scores go to /orcd/data/....

Validation Rules (for future validator)

JSON schema checks for required fields.
Name and stimulus existence checks.
Minimum counts and uniqueness.
Dataset path resolution errors are surfaced as submission failures.

Validation Script

scripts/validate_submission.py validates JSON submissions.
Optional envs: HACKATHON_MODEL_REGISTRY, HACKATHON_STIMULI_CATALOG.

Modal Scoring (optional)

Set HACKATHON_MODAL_ENABLE=true to route scoring through Modal.
Requires HACKATHON_MODEL_REGISTRY and HACKATHON_STIMULI_CATALOG.

Versioning

This contract should include a contract_version when enforced in code.