A newer version of the Gradio SDK is available: 6.12.0
Evaluation Contract (Draft)
This document defines how submissions are evaluated using real model forward passes and CKA.
It replaces the dummy embeddings in src/hackathon/data.py.
Scope
- Applies to Blue Team (model selection) and Red Team (stimulus selection) submissions.
- All scoring uses real model forward passes to compute embeddings, then linear CKA.
Entities
Stimulus
A stimulus is identified by:
- dataset_name: canonical dataset id (e.g., cifar100, imagenet1k)
- image_identifier: path relative to dataset root (e.g., val/n01440764/ILSVRC2012_val_00000964.JPEG)
Model
A model is identified by:
- model_name: unique registry key (e.g., resnet50, clip_vit_b32)
Model Registry Spec (planned location: configs/model_registry.json)
Each entry defines how to load a model and extract embeddings.
Required fields:
- model_name: string, unique
- source: string (torchvision, timm, open_clip, custom)
- weights: string or null (pretrained identifier)
- layer: string module path or alias (e.g., fc, classifier.4, visual)
- embedding: string strategy (pool, cls, flatten, mean)
- input_size: [height, width]
- preprocess: {mean: [...], std: [...], resize: int, crop: int}
- output_dim: int (expected embedding dimension)
Optional fields:
- model_parameters: object for model constructor
- forward_args: object for forward call
- notes: string
Example:
{
"model_name": "resnet50",
"source": "torchvision",
"weights": "IMAGENET1K_V2",
"layer": "fc",
"embedding": "flatten",
"input_size": [224, 224],
"preprocess": {
"mean": [0.485, 0.456, 0.406],
"std": [0.229, 0.224, 0.225],
"resize": 256,
"crop": 224
},
"output_dim": 2048
}
Stimuli Catalog Spec (planned location: configs/stimuli_catalog.jsonl)
Each line is one stimulus with:
- dataset_name
- image_identifier
Example lines:
{"dataset_name": "cifar100", "image_identifier": "test/bear/image_0007.png"}
{"dataset_name": "imagenet1k", "image_identifier": "val/n03445777/ILSVRC2012_val_00003572.JPEG"}
Submission Contract
Blue Team
models: list of model_name strings.- Each model_name must exist in the model registry.
- Minimum 2 models; no duplicates.
Red Team
differentiating_images: list of stimulus objects.- Each stimulus must exist in the stimuli catalog.
- Minimum 2 stimuli; no duplicates.
Evaluation Procedure
Blue Team scoring
- Load the stimuli catalog (full evaluation set).
- For each submitted model, run forward pass on all stimuli and extract embeddings.
- Compute mean pairwise linear CKA across submitted models.
Red Team scoring
- Load the model registry (full evaluation model set).
- For each model, run forward pass on submitted stimuli and extract embeddings.
- Compute mean pairwise linear CKA across all models, then score = 1 - avg CKA.
Embedding Extraction Requirements
model.eval()andtorch.no_grad()for all forward passes.- Deterministic settings (seed, disable dropout).
- Embeddings must be 2D arrays shaped [num_samples, dim].
- If a layer produces spatial features, apply the registry's embedding strategy (e.g., global average pool then flatten).
CKA Definition
- Use
src/cka/compute.pylinear CKA (biased HSIC by default). - Arrays are converted to float64 before CKA.
Storage and Paths
- Dataset roots come from env vars (see
AGENTS.mdpath hygiene). - Cache embeddings per model/layer/dataset version (Modal volume).
- Durable logs and final scores go to
/orcd/data/....
Validation Rules (for future validator)
- JSON schema checks for required fields.
- Name and stimulus existence checks.
- Minimum counts and uniqueness.
- Dataset path resolution errors are surfaced as submission failures.
Validation Script
scripts/validate_submission.pyvalidates JSON submissions.- Optional envs:
HACKATHON_MODEL_REGISTRY,HACKATHON_STIMULI_CATALOG.
Modal Scoring (optional)
- Set
HACKATHON_MODAL_ENABLE=trueto route scoring through Modal. - Requires
HACKATHON_MODEL_REGISTRYandHACKATHON_STIMULI_CATALOG.
Versioning
- This contract should include a
contract_versionwhen enforced in code.