arc-state-norman-gears-corrected
Leak-corrected fine-tuned Arc State checkpoint on the Norman 2019 K562 perturbation dataset, produced by VCBench v1.0.0 to enable independent reproduction of Arc State's perturbation prediction performance under a clean train/test split, plus a forensic_artifacts/ subfolder with the deprecated/leaked checkpoint and both runs' raw eval AnnData for full auditability.
What this is
This release supersedes Arc Institute's published Arc State Norman fine-tune for benchmark purposes. The published norman_fewshot.toml configuration in ArcInstitute/state contains a misconfigured cell-type filter ([zeroshot] "norman.double_perts" = "test") that matches zero cells in the Norman dataset (which has only cell_type == "A549"). With zero cells held out as test, all 107 nominally-held-out test perturbations remain in the training pool. The published Arc State Norman PRR of 0.963 is therefore the result of training-set memorisation, not genuine generalisation.
This checkpoint was fine-tuned using configs/dim_a/arc_state_norman_gears_split.toml from the VCBench repository at tag v1.0.0, which explicitly enumerates 139 training perturbations and 107 held-out test perturbations matching the GEARS simulation split (seed=1) used by every other foundation model evaluated in VCBench.
Headline metric: PRR = 0.402 on the 107 held-out Norman test perturbations (vcbench evaluate_dim_a real-control anchor β the canonical convention for VC Level decisions). Cross-validated by upstream cell-eval pearson_delta = 0.408 to 2e-6 absolute under matched anchor convention. See Cross-evaluator anchor reconciliation below.
Reproducing the headline numbers (paste-able, <5 min, CPU-only)
from huggingface_hub import snapshot_download
import anndata as ad
from vcbench.dimensions.dim_a_perturbation import evaluate_dim_a
repo = "VibeCodingScientist/arc-state-norman-gears-corrected"
root = snapshot_download(repo, allow_patterns=["forensic_artifacts/leak_corrected/*"])
pred = ad.read_h5ad(f"{root}/forensic_artifacts/leak_corrected/adata_pred.h5ad")
real = ad.read_h5ad(f"{root}/forensic_artifacts/leak_corrected/adata_real.h5ad")
# Canonical real-anchor PRR (used for VC Level decisions): 0.402
res_canonical = evaluate_dim_a(pred, real, perturbation_col="condition", control_label="ctrl")
assert round(res_canonical.mean_pearson_r_delta, 4) == 0.4021
# Cell-eval cross-validation under matched anchor: 0.408 (β cell-eval pearson_delta to 2e-6)
res_xval = evaluate_dim_a(pred, real, perturbation_col="condition", control_label="ctrl",
control_anchor="pred")
assert round(res_xval.mean_pearson_r_delta, 4) == 0.4076
End-to-end retrain via the VCBench wrapper (requires GPU, ~4-5h on A40):
from vcbench.models import ArcState
arc = ArcState() # defaults to leak-corrected config
arc.load_pretrained(ckpt_path="final.ckpt") # raises ArcStateLeakError if config has overlap
result = arc.run_dim_a() # full pipeline β DimAResult
print(f"PRR: {result.mean_pearson_r_delta:.4f}") # β 0.4021 (real anchor, canonical)
Cross-evaluator anchor reconciliation (vcbench β cell-eval)
VCBench's evaluate_dim_a and Arc Institute's upstream cell-eval pearson_delta differ in one design choice: the control-anchor convention used to form the Ξ-expression vectors fed into per-perturbation Pearson R.
| Convention | Definition | Role |
|---|---|---|
Real anchor (vcbench default, control_anchor="real") |
pred_delta = pert_pred β ctrl_real and real_delta = pert_real β ctrl_real (both anchored on observed real control) |
CANONICAL β used for VC Level decisions. Right convention for cross-model benchmarking (no per-model free baseline) and leak forensics (model can't hide baseline memorisation in ctrl_pred). |
Pred anchor (cell-eval, control_anchor="pred") |
pred_delta = pert_pred β ctrl_pred (model's own predicted control), real_delta = pert_real β ctrl_real |
For cell-eval cross-validation only. Reproduces upstream cell-eval pearson_delta to 1e-6 absolute. NOT used for VC Level decisions. |
Under matched conventions the two evaluators agree to numerical precision (1e-6 on real Arc State predictions; locked by tests/unit/test_dim_a_evaluate.py::test_control_anchor_pred_reproduces_cell_eval_algorithm in the source repo). The asymmetry β real anchor is strictly more conservative than pred anchor under per-gene baseline drift β is locked by test_real_anchor_is_more_conservative_than_pred_anchor_under_baseline_drift.
Training recipe
| Field | Value |
|---|---|
| Base model | arc-state==0.10.2 (state model variant) |
| Dataset | Norman 2019 K562 (GSE133344, via GEARS API) |
| Train perturbations | 139 (per [fewshot."norman.A549"].train in the config TOML) |
| Test perturbations | 107 (matches GEARS simulation split, seed=1, used by scGPT and others) |
| Train/test overlap | 0 perturbations, 0 cells (verified by vcbench.models.arc_state.ArcState._verify_no_train_test_overlap) |
| Architecture | LLaMA bidirectional backbone, num_hidden_layers=8, hidden_dim=768, cell_set_len=512, n_attention_heads=12 |
| Total params | 110 M (86 M trainable) |
| Optimizer | AdamW |
| Learning rate | 1Γ10β»β΄ |
| Batch size | 8 |
| Max steps | 40,000 |
| Loss | energy distance (samples loss) |
| Random seed | 42 |
| Hardware | NVIDIA A40 (46 GB), CUDA 12.4 |
| Wall clock | 4h12m end-to-end (training only; predict + eval ~10 min on top) |
| Train loss | 2.94 β 0.027 (full convergence) |
| Val loss | oscillated 0.26β0.61, ended 0.402 (overfit signature consistent with held-out split on a small training set) |
Results
Evaluated on the 107 GEARS test perturbations using both cell-eval==0.x (Arc Institute's official evaluator) and vcbench.dimensions.dim_a_perturbation.evaluate_dim_a (VCBench's reimplementation):
| Evaluator / convention | mean Pearson R on Ξ-expression (PRR) | Direction score (top-20 DEG sign-agreement) |
|---|---|---|
vcbench.evaluate_dim_a (real anchor β CANONICAL) |
0.4021 | 0.7514 |
vcbench.evaluate_dim_a (pred anchor) |
0.4076 | 0.7846 |
cell-eval pearson_delta |
0.4076 | β |
vcbench(pred-anchor) and cell-eval agree to 2e-6 absolute. Per-perturbation results are in eval_per_perturbation.csv; aggregate metrics in eval_aggregate.csv.
VC Level
Under the VCBench pre-registration, Arc State scores VC Level 1 on Norman: it exceeds the no-change baseline (PRR 0.000) on Dim A but does not exceed the mean-prediction baseline (PRR 0.579). The VC Level decision is unchanged whether one uses 0.115 (the v0.1 buggy number) or 0.402 (the canonical v1.0.0 number) β both are below the binding 0.579 threshold. Arc Institute's published 0.963 on the leaky config would have placed Arc State at Level 3+.
Files
Public release (root)
| File | Size | Description |
|---|---|---|
final.ckpt |
1.13 GB | Final model state at step 40,000 (the canonical artefact) |
best.ckpt |
1.13 GB | Model state at lowest validation loss (step 27,999, val_loss 0.263) |
training_config.yaml |
2.6 KB | Resolved Hydra config that arc-state v0.10.2 used at runtime |
data_split_leak_corrected.toml |
4.2 KB | The leak-corrected GEARS-split TOML β the binding artefact |
eval_aggregate.csv |
3.6 KB | Aggregate cell-eval metrics across all 107 test perts (under cell-eval / pred-anchor convention) |
eval_per_perturbation.csv |
41 KB | Per-perturbation cell-eval metrics (107 rows Γ all metrics) |
Forensic artefacts (forensic_artifacts/)
Auditability companion to CHANGELOG v1.0.0 Β§ Verified. Lets a reviewer reproduce the leak forensic numbers without running the 4h12m A40 retrain. Full layout + paste-able reproduction snippets in forensic_artifacts/README.md.
| File | Size | Description |
|---|---|---|
forensic_artifacts/README.md |
3.0 KB | Reproduction instructions + layout |
forensic_artifacts/leak_corrected/adata_pred.h5ad |
1.02 GB | State predict CLI output, leak-corrected ckpt |
forensic_artifacts/leak_corrected/adata_real.h5ad |
1.02 GB | Matched real-cells output |
forensic_artifacts/deprecated/final.ckpt |
1.13 GB | 40k-step checkpoint under the leaked norman_fewshot.toml |
forensic_artifacts/deprecated/adata_pred.h5ad |
1.02 GB | State predict output (with --toml override to the leak-corrected split β without override the deprecated test loader is empty and predict CLI raises AttributeError) |
forensic_artifacts/deprecated/adata_real.h5ad |
1.02 GB | Matched real-cells output |
forensic_artifacts/deprecated/training_metrics.csv |
94 KB | The training-time leak signature β 8 columns, no val_loss, no val/decoder_loss. Lightning's val DataLoader is empty under the leaked TOML, so the validation step never fires. The leak is visible in training dynamics, not just at eval time. |
Provenance + reproducibility
- Source repo: https://github.com/VibeCodingScientist/VCBench (tag
v1.0.0) - Forensic test that proves the leak vector (no GPU; static-config part runs in ~7s without any data download; full empirical part needs Norman on disk, ~7 min):
tests/integration/test_arc_state_leak_forensic.py - Cross-evaluator anchor reconciliation tests:
tests/unit/test_dim_a_evaluate.py(5 tests covering pred-anchor cell-eval equivalence to 1e-9, real-anchor strict-conservatism, uniform-shift invariance, fallback warning, invalid-value rejection) - Pre-registration:
configs/pre_registration.yaml - Manuscript: VCBench (2026)
- CHANGELOG: see v1.0.0 entries Arc State Norman PRR: 0.115 β 0.402 (gene-vocabulary alignment fix) and Cross-evaluator anchor-convention reconciliation for the full bug story + diff of the fix + asymmetric-role rationale.
Citation
@misc{vcbench-arc-state-norman-gears-corrected,
author = {{VCBench contributors}},
title = {Arc State Norman GEARS-split leak-corrected checkpoint},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/VibeCodingScientist/arc-state-norman-gears-corrected}},
note = {Companion artefact to VCBench v1.0.0 (github.com/VibeCodingScientist/VCBench, release tag v1.0.0)},
}
License
MIT β same as the upstream ArcInstitute/state codebase.
Notes for reviewers
This release exists because the published Arc State Norman PRR is not directly reproducible from the published norman_fewshot.toml without inheriting the train-test leak. Arc Institute was notified prior to preprint posting. We retain the deprecated configuration in the VCBench repo at configs/dim_a/arc_state_norman_fewshot_DEPRECATED.toml for auditability, behind a use_deprecated_fewshot=True opt-in flag in the vcbench.models.arc_state.ArcState wrapper.
Three independent forensic signatures of the leak (in increasing order of evidential strength):
- Training-time signature. Under the deprecated TOML, Lightning's val DataLoader is empty (the cell-type filter matches 0 cells), so
validation_stepis never invoked. The training metrics CSV has 8 columns and noval_losscolumn at all β visible without ever running an evaluation. (forensic_artifacts/deprecated/training_metrics.csvfor inspection; the leak-corrected run's metrics CSV has bothval_lossandval/decoder_lossfor comparison.) - Inference-time signature.
state tx predictstructurally cannot run on the deprecated checkpoint β it raisesAttributeError: 'list' object has no attribute 'batch_sampler'atstate/_cli/_tx/_predict.py:263because the test loader is an empty list, not a DataLoader. To produce a number at all we ran predict with--toml=arc_state_norman_gears_split.toml, which feeds the proper 107 GEARS test perts as the eval set; the model itself was trained on all 246 perts under the leak config. - Test-time signature. PRR = 0.949 (real anchor) / 0.964 (cell-eval cross-validation) on the 107 perts the model trained on, vs 0.402 / 0.408 on the same 107 perts correctly held out β ΞPRR β 0.55, roughly half the dynamic range of the metric.