arc-state-norman-gears-corrected

Leak-corrected fine-tuned Arc State checkpoint on the Norman 2019 K562 perturbation dataset, produced by VCBench v1.0.0 to enable independent reproduction of Arc State's perturbation prediction performance under a clean train/test split, plus a forensic_artifacts/ subfolder with the deprecated/leaked checkpoint and both runs' raw eval AnnData for full auditability.

What this is

This release supersedes Arc Institute's published Arc State Norman fine-tune for benchmark purposes. The published norman_fewshot.toml configuration in ArcInstitute/state contains a misconfigured cell-type filter ([zeroshot] "norman.double_perts" = "test") that matches zero cells in the Norman dataset (which has only cell_type == "A549"). With zero cells held out as test, all 107 nominally-held-out test perturbations remain in the training pool. The published Arc State Norman PRR of 0.963 is therefore the result of training-set memorisation, not genuine generalisation.

This checkpoint was fine-tuned using configs/dim_a/arc_state_norman_gears_split.toml from the VCBench repository at tag v1.0.0, which explicitly enumerates 139 training perturbations and 107 held-out test perturbations matching the GEARS simulation split (seed=1) used by every other foundation model evaluated in VCBench.

Headline metric: PRR = 0.402 on the 107 held-out Norman test perturbations (vcbench evaluate_dim_a real-control anchor — the canonical convention for VC Level decisions). Cross-validated by upstream cell-eval pearson_delta = 0.408 to 2e-6 absolute under matched anchor convention. See Cross-evaluator anchor reconciliation below.

Reproducing the headline numbers (paste-able, <5 min, CPU-only)

from huggingface_hub import snapshot_download
import anndata as ad
from vcbench.dimensions.dim_a_perturbation import evaluate_dim_a

repo = "VibeCodingScientist/arc-state-norman-gears-corrected"
root = snapshot_download(repo, allow_patterns=["forensic_artifacts/leak_corrected/*"])

pred = ad.read_h5ad(f"{root}/forensic_artifacts/leak_corrected/adata_pred.h5ad")
real = ad.read_h5ad(f"{root}/forensic_artifacts/leak_corrected/adata_real.h5ad")

# Canonical real-anchor PRR (used for VC Level decisions): 0.402
res_canonical = evaluate_dim_a(pred, real, perturbation_col="condition", control_label="ctrl")
assert round(res_canonical.mean_pearson_r_delta, 4) == 0.4021

# Cell-eval cross-validation under matched anchor: 0.408 (≈ cell-eval pearson_delta to 2e-6)
res_xval = evaluate_dim_a(pred, real, perturbation_col="condition", control_label="ctrl",
                          control_anchor="pred")
assert round(res_xval.mean_pearson_r_delta, 4) == 0.4076

End-to-end retrain via the VCBench wrapper (requires GPU, ~4-5h on A40):

from vcbench.models import ArcState
arc = ArcState()                                    # defaults to leak-corrected config
arc.load_pretrained(ckpt_path="final.ckpt")         # raises ArcStateLeakError if config has overlap
result = arc.run_dim_a()                            # full pipeline → DimAResult
print(f"PRR: {result.mean_pearson_r_delta:.4f}")    # ≈ 0.4021 (real anchor, canonical)

Cross-evaluator anchor reconciliation (vcbench ↔ cell-eval)

VCBench's evaluate_dim_a and Arc Institute's upstream cell-eval pearson_delta differ in one design choice: the control-anchor convention used to form the Δ-expression vectors fed into per-perturbation Pearson R.

Convention	Definition	Role
Real anchor (vcbench default, `control_anchor="real"`)	`pred_delta = pert_pred − ctrl_real` and `real_delta = pert_real − ctrl_real` (both anchored on observed real control)	CANONICAL — used for VC Level decisions. Right convention for cross-model benchmarking (no per-model free baseline) and leak forensics (model can't hide baseline memorisation in `ctrl_pred`).
Pred anchor (cell-eval, `control_anchor="pred"`)	`pred_delta = pert_pred − ctrl_pred` (model's own predicted control), `real_delta = pert_real − ctrl_real`	For cell-eval cross-validation only. Reproduces upstream `cell-eval pearson_delta` to 1e-6 absolute. NOT used for VC Level decisions.

Under matched conventions the two evaluators agree to numerical precision (1e-6 on real Arc State predictions; locked by tests/unit/test_dim_a_evaluate.py::test_control_anchor_pred_reproduces_cell_eval_algorithm in the source repo). The asymmetry — real anchor is strictly more conservative than pred anchor under per-gene baseline drift — is locked by test_real_anchor_is_more_conservative_than_pred_anchor_under_baseline_drift.

Training recipe

Field	Value
Base model	`arc-state==0.10.2` (`state` model variant)
Dataset	Norman 2019 K562 (GSE133344, via GEARS API)
Train perturbations	139 (per `[fewshot."norman.A549"].train` in the config TOML)
Test perturbations	107 (matches GEARS simulation split, seed=1, used by scGPT and others)
Train/test overlap	0 perturbations, 0 cells (verified by `vcbench.models.arc_state.ArcState._verify_no_train_test_overlap`)
Architecture	LLaMA bidirectional backbone, `num_hidden_layers=8`, `hidden_dim=768`, `cell_set_len=512`, `n_attention_heads=12`
Total params	110 M (86 M trainable)
Optimizer	AdamW
Learning rate	1×10⁻⁴
Batch size	8
Max steps	40,000
Loss	energy distance (samples loss)
Random seed	42
Hardware	NVIDIA A40 (46 GB), CUDA 12.4
Wall clock	4h12m end-to-end (training only; predict + eval ~10 min on top)
Train loss	2.94 → 0.027 (full convergence)
Val loss	oscillated 0.26–0.61, ended 0.402 (overfit signature consistent with held-out split on a small training set)

Results

Evaluated on the 107 GEARS test perturbations using both cell-eval==0.x (Arc Institute's official evaluator) and vcbench.dimensions.dim_a_perturbation.evaluate_dim_a (VCBench's reimplementation):

Evaluator / convention	mean Pearson R on Δ-expression (PRR)	Direction score (top-20 DEG sign-agreement)
`vcbench.evaluate_dim_a` (real anchor — CANONICAL)	0.4021	0.7514
`vcbench.evaluate_dim_a` (pred anchor)	0.4076	0.7846
`cell-eval pearson_delta`	0.4076	—

vcbench(pred-anchor) and cell-eval agree to 2e-6 absolute. Per-perturbation results are in eval_per_perturbation.csv; aggregate metrics in eval_aggregate.csv.

VC Level

Under the VCBench pre-registration, Arc State scores VC Level 1 on Norman: it exceeds the no-change baseline (PRR 0.000) on Dim A but does not exceed the mean-prediction baseline (PRR 0.579). The VC Level decision is unchanged whether one uses 0.115 (the v0.1 buggy number) or 0.402 (the canonical v1.0.0 number) — both are below the binding 0.579 threshold. Arc Institute's published 0.963 on the leaky config would have placed Arc State at Level 3+.

Files

Public release (root)

File	Size	Description
`final.ckpt`	1.13 GB	Final model state at step 40,000 (the canonical artefact)
`best.ckpt`	1.13 GB	Model state at lowest validation loss (step 27,999, val_loss 0.263)
`training_config.yaml`	2.6 KB	Resolved Hydra config that arc-state v0.10.2 used at runtime
`data_split_leak_corrected.toml`	4.2 KB	The leak-corrected GEARS-split TOML — the binding artefact
`eval_aggregate.csv`	3.6 KB	Aggregate cell-eval metrics across all 107 test perts (under cell-eval / pred-anchor convention)
`eval_per_perturbation.csv`	41 KB	Per-perturbation cell-eval metrics (107 rows × all metrics)

Forensic artefacts (`forensic_artifacts/`)

Auditability companion to CHANGELOG v1.0.0 § Verified. Lets a reviewer reproduce the leak forensic numbers without running the 4h12m A40 retrain. Full layout + paste-able reproduction snippets in forensic_artifacts/README.md.

File	Size	Description
`forensic_artifacts/README.md`	3.0 KB	Reproduction instructions + layout
`forensic_artifacts/leak_corrected/adata_pred.h5ad`	1.02 GB	State predict CLI output, leak-corrected ckpt
`forensic_artifacts/leak_corrected/adata_real.h5ad`	1.02 GB	Matched real-cells output
`forensic_artifacts/deprecated/final.ckpt`	1.13 GB	40k-step checkpoint under the leaked `norman_fewshot.toml`
`forensic_artifacts/deprecated/adata_pred.h5ad`	1.02 GB	State predict output (with `--toml` override to the leak-corrected split — without override the deprecated test loader is empty and predict CLI raises `AttributeError`)
`forensic_artifacts/deprecated/adata_real.h5ad`	1.02 GB	Matched real-cells output
`forensic_artifacts/deprecated/training_metrics.csv`	94 KB	The training-time leak signature — 8 columns, no `val_loss`, no `val/decoder_loss`. Lightning's val DataLoader is empty under the leaked TOML, so the validation step never fires. The leak is visible in training dynamics, not just at eval time.

Provenance + reproducibility

Source repo: https://github.com/VibeCodingScientist/VCBench (tag v1.0.0)
Forensic test that proves the leak vector (no GPU; static-config part runs in ~7s without any data download; full empirical part needs Norman on disk, ~7 min): tests/integration/test_arc_state_leak_forensic.py
Cross-evaluator anchor reconciliation tests: tests/unit/test_dim_a_evaluate.py (5 tests covering pred-anchor cell-eval equivalence to 1e-9, real-anchor strict-conservatism, uniform-shift invariance, fallback warning, invalid-value rejection)
Pre-registration: configs/pre_registration.yaml
Manuscript: VCBench (2026)
CHANGELOG: see v1.0.0 entries Arc State Norman PRR: 0.115 → 0.402 (gene-vocabulary alignment fix) and Cross-evaluator anchor-convention reconciliation for the full bug story + diff of the fix + asymmetric-role rationale.

Citation

@misc{vcbench-arc-state-norman-gears-corrected,
  author       = {{VCBench contributors}},
  title        = {Arc State Norman GEARS-split leak-corrected checkpoint},
  year         = {2026},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/VibeCodingScientist/arc-state-norman-gears-corrected}},
  note         = {Companion artefact to VCBench v1.0.0 (github.com/VibeCodingScientist/VCBench, release tag v1.0.0)},
}

License

MIT — same as the upstream ArcInstitute/state codebase.

Notes for reviewers

This release exists because the published Arc State Norman PRR is not directly reproducible from the published norman_fewshot.toml without inheriting the train-test leak. Arc Institute was notified prior to preprint posting. We retain the deprecated configuration in the VCBench repo at configs/dim_a/arc_state_norman_fewshot_DEPRECATED.toml for auditability, behind a use_deprecated_fewshot=True opt-in flag in the vcbench.models.arc_state.ArcState wrapper.

Three independent forensic signatures of the leak (in increasing order of evidential strength):

Training-time signature. Under the deprecated TOML, Lightning's val DataLoader is empty (the cell-type filter matches 0 cells), so validation_step is never invoked. The training metrics CSV has 8 columns and no val_loss column at all — visible without ever running an evaluation. (forensic_artifacts/deprecated/training_metrics.csv for inspection; the leak-corrected run's metrics CSV has both val_loss and val/decoder_loss for comparison.)
Inference-time signature. state tx predict structurally cannot run on the deprecated checkpoint — it raises AttributeError: 'list' object has no attribute 'batch_sampler' at state/_cli/_tx/_predict.py:263 because the test loader is an empty list, not a DataLoader. To produce a number at all we ran predict with --toml=arc_state_norman_gears_split.toml, which feeds the proper 107 GEARS test perts as the eval set; the model itself was trained on all 246 perts under the leak config.
Test-time signature. PRR = 0.949 (real anchor) / 0.964 (cell-eval cross-validation) on the 107 perts the model trained on, vs 0.402 / 0.408 on the same 107 perts correctly held out — ΔPRR ≈ 0.55, roughly half the dynamic range of the metric.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support