iclr2026-realign-challenge / docs /evaluation_contract.md
siddsuresh97's picture
Initial commit: ICLR 2026 Representational Alignment Challenge
d6c8a4f

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Evaluation Contract (Draft)

This document defines how submissions are evaluated using real model forward passes and CKA. It replaces the dummy embeddings in src/hackathon/data.py.

Scope

  • Applies to Blue Team (model selection) and Red Team (stimulus selection) submissions.
  • All scoring uses real model forward passes to compute embeddings, then linear CKA.

Entities

Stimulus

A stimulus is identified by:

  • dataset_name: canonical dataset id (e.g., cifar100, imagenet1k)
  • image_identifier: path relative to dataset root (e.g., val/n01440764/ILSVRC2012_val_00000964.JPEG)

Model

A model is identified by:

  • model_name: unique registry key (e.g., resnet50, clip_vit_b32)

Model Registry Spec (planned location: configs/model_registry.json)

Each entry defines how to load a model and extract embeddings.

Required fields:

  • model_name: string, unique
  • source: string (torchvision, timm, open_clip, custom)
  • weights: string or null (pretrained identifier)
  • layer: string module path or alias (e.g., fc, classifier.4, visual)
  • embedding: string strategy (pool, cls, flatten, mean)
  • input_size: [height, width]
  • preprocess: {mean: [...], std: [...], resize: int, crop: int}
  • output_dim: int (expected embedding dimension)

Optional fields:

  • model_parameters: object for model constructor
  • forward_args: object for forward call
  • notes: string

Example:

{
  "model_name": "resnet50",
  "source": "torchvision",
  "weights": "IMAGENET1K_V2",
  "layer": "fc",
  "embedding": "flatten",
  "input_size": [224, 224],
  "preprocess": {
    "mean": [0.485, 0.456, 0.406],
    "std": [0.229, 0.224, 0.225],
    "resize": 256,
    "crop": 224
  },
  "output_dim": 2048
}

Stimuli Catalog Spec (planned location: configs/stimuli_catalog.jsonl)

Each line is one stimulus with:

  • dataset_name
  • image_identifier

Example lines:

{"dataset_name": "cifar100", "image_identifier": "test/bear/image_0007.png"}
{"dataset_name": "imagenet1k", "image_identifier": "val/n03445777/ILSVRC2012_val_00003572.JPEG"}

Submission Contract

Blue Team

  • models: list of model_name strings.
  • Each model_name must exist in the model registry.
  • Minimum 2 models; no duplicates.

Red Team

  • differentiating_images: list of stimulus objects.
  • Each stimulus must exist in the stimuli catalog.
  • Minimum 2 stimuli; no duplicates.

Evaluation Procedure

Blue Team scoring

  1. Load the stimuli catalog (full evaluation set).
  2. For each submitted model, run forward pass on all stimuli and extract embeddings.
  3. Compute mean pairwise linear CKA across submitted models.

Red Team scoring

  1. Load the model registry (full evaluation model set).
  2. For each model, run forward pass on submitted stimuli and extract embeddings.
  3. Compute mean pairwise linear CKA across all models, then score = 1 - avg CKA.

Embedding Extraction Requirements

  • model.eval() and torch.no_grad() for all forward passes.
  • Deterministic settings (seed, disable dropout).
  • Embeddings must be 2D arrays shaped [num_samples, dim].
  • If a layer produces spatial features, apply the registry's embedding strategy (e.g., global average pool then flatten).

CKA Definition

  • Use src/cka/compute.py linear CKA (biased HSIC by default).
  • Arrays are converted to float64 before CKA.

Storage and Paths

  • Dataset roots come from env vars (see AGENTS.md path hygiene).
  • Cache embeddings per model/layer/dataset version (Modal volume).
  • Durable logs and final scores go to /orcd/data/....

Validation Rules (for future validator)

  • JSON schema checks for required fields.
  • Name and stimulus existence checks.
  • Minimum counts and uniqueness.
  • Dataset path resolution errors are surfaced as submission failures.

Validation Script

  • scripts/validate_submission.py validates JSON submissions.
  • Optional envs: HACKATHON_MODEL_REGISTRY, HACKATHON_STIMULI_CATALOG.

Modal Scoring (optional)

  • Set HACKATHON_MODAL_ENABLE=true to route scoring through Modal.
  • Requires HACKATHON_MODEL_REGISTRY and HACKATHON_STIMULI_CATALOG.

Versioning

  • This contract should include a contract_version when enforced in code.