| # Evaluation Contract (Draft) |
|
|
| This document defines how submissions are evaluated using real model forward passes and CKA. |
| It replaces the dummy embeddings in `src/hackathon/data.py`. |
|
|
| ## Scope |
|
|
| - Applies to Blue Team (model selection) and Red Team (stimulus selection) submissions. |
| - All scoring uses real model forward passes to compute embeddings, then linear CKA. |
|
|
| ## Entities |
|
|
| ### Stimulus |
|
|
| A stimulus is identified by: |
| - dataset_name: canonical dataset id (e.g., cifar100, imagenet1k) |
| - image_identifier: path relative to dataset root (e.g., val/n01440764/ILSVRC2012_val_00000964.JPEG) |
|
|
| ### Model |
|
|
| A model is identified by: |
| - model_name: unique registry key (e.g., resnet50, clip_vit_b32) |
| |
| ## Model Registry Spec (planned location: configs/model_registry.json) |
|
|
| Each entry defines how to load a model and extract embeddings. |
|
|
| Required fields: |
| - model_name: string, unique |
| - source: string (torchvision, timm, open_clip, custom) |
| - weights: string or null (pretrained identifier) |
| - layer: string module path or alias (e.g., fc, classifier.4, visual) |
| - embedding: string strategy (pool, cls, flatten, mean) |
| - input_size: [height, width] |
| - preprocess: {mean: [...], std: [...], resize: int, crop: int} |
| - output_dim: int (expected embedding dimension) |
|
|
| Optional fields: |
| - model_parameters: object for model constructor |
| - forward_args: object for forward call |
| - notes: string |
|
|
| Example: |
| ```json |
| { |
| "model_name": "resnet50", |
| "source": "torchvision", |
| "weights": "IMAGENET1K_V2", |
| "layer": "fc", |
| "embedding": "flatten", |
| "input_size": [224, 224], |
| "preprocess": { |
| "mean": [0.485, 0.456, 0.406], |
| "std": [0.229, 0.224, 0.225], |
| "resize": 256, |
| "crop": 224 |
| }, |
| "output_dim": 2048 |
| } |
| ``` |
|
|
| ## Stimuli Catalog Spec (planned location: configs/stimuli_catalog.jsonl) |
| |
| Each line is one stimulus with: |
| - dataset_name |
| - image_identifier |
| |
| Example lines: |
| ```json |
| {"dataset_name": "cifar100", "image_identifier": "test/bear/image_0007.png"} |
| {"dataset_name": "imagenet1k", "image_identifier": "val/n03445777/ILSVRC2012_val_00003572.JPEG"} |
| ``` |
| |
| ## Submission Contract |
| |
| ### Blue Team |
| |
| - `models`: list of model_name strings. |
| - Each model_name must exist in the model registry. |
| - Minimum 2 models; no duplicates. |
| |
| ### Red Team |
| |
| - `differentiating_images`: list of stimulus objects. |
| - Each stimulus must exist in the stimuli catalog. |
| - Minimum 2 stimuli; no duplicates. |
| |
| ## Evaluation Procedure |
| |
| ### Blue Team scoring |
| |
| 1. Load the stimuli catalog (full evaluation set). |
| 2. For each submitted model, run forward pass on all stimuli and extract embeddings. |
| 3. Compute mean pairwise linear CKA across submitted models. |
| |
| ### Red Team scoring |
| |
| 1. Load the model registry (full evaluation model set). |
| 2. For each model, run forward pass on submitted stimuli and extract embeddings. |
| 3. Compute mean pairwise linear CKA across all models, then score = 1 - avg CKA. |
| |
| ## Embedding Extraction Requirements |
| |
| - `model.eval()` and `torch.no_grad()` for all forward passes. |
| - Deterministic settings (seed, disable dropout). |
| - Embeddings must be 2D arrays shaped [num_samples, dim]. |
| - If a layer produces spatial features, apply the registry's embedding strategy |
| (e.g., global average pool then flatten). |
| |
| ## CKA Definition |
| |
| - Use `src/cka/compute.py` linear CKA (biased HSIC by default). |
| - Arrays are converted to float64 before CKA. |
| |
| ## Storage and Paths |
| |
| - Dataset roots come from env vars (see `AGENTS.md` path hygiene). |
| - Cache embeddings per model/layer/dataset version (Modal volume). |
| - Durable logs and final scores go to `/orcd/data/...`. |
| |
| ## Validation Rules (for future validator) |
| |
| - JSON schema checks for required fields. |
| - Name and stimulus existence checks. |
| - Minimum counts and uniqueness. |
| - Dataset path resolution errors are surfaced as submission failures. |
| |
| ## Validation Script |
| |
| - `scripts/validate_submission.py` validates JSON submissions. |
| - Optional envs: `HACKATHON_MODEL_REGISTRY`, `HACKATHON_STIMULI_CATALOG`. |
| |
| ## Modal Scoring (optional) |
| |
| - Set `HACKATHON_MODAL_ENABLE=true` to route scoring through Modal. |
| - Requires `HACKATHON_MODEL_REGISTRY` and `HACKATHON_STIMULI_CATALOG`. |
| |
| ## Versioning |
| |
| - This contract should include a `contract_version` when enforced in code. |
| |