arc-state-norman-gears-corrected

Leak-corrected fine-tuned Arc State checkpoint on the Norman 2019 K562 perturbation dataset, produced by VCBench v1.0.0 to enable independent reproduction of Arc State's perturbation prediction performance under a clean train/test split, plus a forensic_artifacts/ subfolder with the deprecated/leaked checkpoint and both runs' raw eval AnnData for full auditability.

What this is

This release supersedes Arc Institute's published Arc State Norman fine-tune for benchmark purposes. The published norman_fewshot.toml configuration in ArcInstitute/state contains a misconfigured cell-type filter ([zeroshot] "norman.double_perts" = "test") that matches zero cells in the Norman dataset (which has only cell_type == "A549"). With zero cells held out as test, all 107 nominally-held-out test perturbations remain in the training pool. The published Arc State Norman PRR of 0.963 is therefore the result of training-set memorisation, not genuine generalisation.

This checkpoint was fine-tuned using configs/dim_a/arc_state_norman_gears_split.toml from the VCBench repository at tag v1.0.0, which explicitly enumerates 139 training perturbations and 107 held-out test perturbations matching the GEARS simulation split (seed=1) used by every other foundation model evaluated in VCBench.

Headline metric: PRR = 0.402 on the 107 held-out Norman test perturbations (vcbench evaluate_dim_a real-control anchor β€” the canonical convention for VC Level decisions). Cross-validated by upstream cell-eval pearson_delta = 0.408 to 2e-6 absolute under matched anchor convention. See Cross-evaluator anchor reconciliation below.

Reproducing the headline numbers (paste-able, <5 min, CPU-only)

from huggingface_hub import snapshot_download
import anndata as ad
from vcbench.dimensions.dim_a_perturbation import evaluate_dim_a

repo = "VibeCodingScientist/arc-state-norman-gears-corrected"
root = snapshot_download(repo, allow_patterns=["forensic_artifacts/leak_corrected/*"])

pred = ad.read_h5ad(f"{root}/forensic_artifacts/leak_corrected/adata_pred.h5ad")
real = ad.read_h5ad(f"{root}/forensic_artifacts/leak_corrected/adata_real.h5ad")

# Canonical real-anchor PRR (used for VC Level decisions): 0.402
res_canonical = evaluate_dim_a(pred, real, perturbation_col="condition", control_label="ctrl")
assert round(res_canonical.mean_pearson_r_delta, 4) == 0.4021

# Cell-eval cross-validation under matched anchor: 0.408 (β‰ˆ cell-eval pearson_delta to 2e-6)
res_xval = evaluate_dim_a(pred, real, perturbation_col="condition", control_label="ctrl",
                          control_anchor="pred")
assert round(res_xval.mean_pearson_r_delta, 4) == 0.4076

End-to-end retrain via the VCBench wrapper (requires GPU, ~4-5h on A40):

from vcbench.models import ArcState
arc = ArcState()                                    # defaults to leak-corrected config
arc.load_pretrained(ckpt_path="final.ckpt")         # raises ArcStateLeakError if config has overlap
result = arc.run_dim_a()                            # full pipeline β†’ DimAResult
print(f"PRR: {result.mean_pearson_r_delta:.4f}")    # β‰ˆ 0.4021 (real anchor, canonical)

Cross-evaluator anchor reconciliation (vcbench ↔ cell-eval)

VCBench's evaluate_dim_a and Arc Institute's upstream cell-eval pearson_delta differ in one design choice: the control-anchor convention used to form the Ξ”-expression vectors fed into per-perturbation Pearson R.

Convention Definition Role
Real anchor (vcbench default, control_anchor="real") pred_delta = pert_pred βˆ’ ctrl_real and real_delta = pert_real βˆ’ ctrl_real (both anchored on observed real control) CANONICAL β€” used for VC Level decisions. Right convention for cross-model benchmarking (no per-model free baseline) and leak forensics (model can't hide baseline memorisation in ctrl_pred).
Pred anchor (cell-eval, control_anchor="pred") pred_delta = pert_pred βˆ’ ctrl_pred (model's own predicted control), real_delta = pert_real βˆ’ ctrl_real For cell-eval cross-validation only. Reproduces upstream cell-eval pearson_delta to 1e-6 absolute. NOT used for VC Level decisions.

Under matched conventions the two evaluators agree to numerical precision (1e-6 on real Arc State predictions; locked by tests/unit/test_dim_a_evaluate.py::test_control_anchor_pred_reproduces_cell_eval_algorithm in the source repo). The asymmetry β€” real anchor is strictly more conservative than pred anchor under per-gene baseline drift β€” is locked by test_real_anchor_is_more_conservative_than_pred_anchor_under_baseline_drift.

Training recipe

Field Value
Base model arc-state==0.10.2 (state model variant)
Dataset Norman 2019 K562 (GSE133344, via GEARS API)
Train perturbations 139 (per [fewshot."norman.A549"].train in the config TOML)
Test perturbations 107 (matches GEARS simulation split, seed=1, used by scGPT and others)
Train/test overlap 0 perturbations, 0 cells (verified by vcbench.models.arc_state.ArcState._verify_no_train_test_overlap)
Architecture LLaMA bidirectional backbone, num_hidden_layers=8, hidden_dim=768, cell_set_len=512, n_attention_heads=12
Total params 110 M (86 M trainable)
Optimizer AdamW
Learning rate 1Γ—10⁻⁴
Batch size 8
Max steps 40,000
Loss energy distance (samples loss)
Random seed 42
Hardware NVIDIA A40 (46 GB), CUDA 12.4
Wall clock 4h12m end-to-end (training only; predict + eval ~10 min on top)
Train loss 2.94 β†’ 0.027 (full convergence)
Val loss oscillated 0.26–0.61, ended 0.402 (overfit signature consistent with held-out split on a small training set)

Results

Evaluated on the 107 GEARS test perturbations using both cell-eval==0.x (Arc Institute's official evaluator) and vcbench.dimensions.dim_a_perturbation.evaluate_dim_a (VCBench's reimplementation):

Evaluator / convention mean Pearson R on Ξ”-expression (PRR) Direction score (top-20 DEG sign-agreement)
vcbench.evaluate_dim_a (real anchor β€” CANONICAL) 0.4021 0.7514
vcbench.evaluate_dim_a (pred anchor) 0.4076 0.7846
cell-eval pearson_delta 0.4076 β€”

vcbench(pred-anchor) and cell-eval agree to 2e-6 absolute. Per-perturbation results are in eval_per_perturbation.csv; aggregate metrics in eval_aggregate.csv.

VC Level

Under the VCBench pre-registration, Arc State scores VC Level 1 on Norman: it exceeds the no-change baseline (PRR 0.000) on Dim A but does not exceed the mean-prediction baseline (PRR 0.579). The VC Level decision is unchanged whether one uses 0.115 (the v0.1 buggy number) or 0.402 (the canonical v1.0.0 number) β€” both are below the binding 0.579 threshold. Arc Institute's published 0.963 on the leaky config would have placed Arc State at Level 3+.

Files

Public release (root)

File Size Description
final.ckpt 1.13 GB Final model state at step 40,000 (the canonical artefact)
best.ckpt 1.13 GB Model state at lowest validation loss (step 27,999, val_loss 0.263)
training_config.yaml 2.6 KB Resolved Hydra config that arc-state v0.10.2 used at runtime
data_split_leak_corrected.toml 4.2 KB The leak-corrected GEARS-split TOML β€” the binding artefact
eval_aggregate.csv 3.6 KB Aggregate cell-eval metrics across all 107 test perts (under cell-eval / pred-anchor convention)
eval_per_perturbation.csv 41 KB Per-perturbation cell-eval metrics (107 rows Γ— all metrics)

Forensic artefacts (forensic_artifacts/)

Auditability companion to CHANGELOG v1.0.0 Β§ Verified. Lets a reviewer reproduce the leak forensic numbers without running the 4h12m A40 retrain. Full layout + paste-able reproduction snippets in forensic_artifacts/README.md.

File Size Description
forensic_artifacts/README.md 3.0 KB Reproduction instructions + layout
forensic_artifacts/leak_corrected/adata_pred.h5ad 1.02 GB State predict CLI output, leak-corrected ckpt
forensic_artifacts/leak_corrected/adata_real.h5ad 1.02 GB Matched real-cells output
forensic_artifacts/deprecated/final.ckpt 1.13 GB 40k-step checkpoint under the leaked norman_fewshot.toml
forensic_artifacts/deprecated/adata_pred.h5ad 1.02 GB State predict output (with --toml override to the leak-corrected split β€” without override the deprecated test loader is empty and predict CLI raises AttributeError)
forensic_artifacts/deprecated/adata_real.h5ad 1.02 GB Matched real-cells output
forensic_artifacts/deprecated/training_metrics.csv 94 KB The training-time leak signature β€” 8 columns, no val_loss, no val/decoder_loss. Lightning's val DataLoader is empty under the leaked TOML, so the validation step never fires. The leak is visible in training dynamics, not just at eval time.

Provenance + reproducibility

  • Source repo: https://github.com/VibeCodingScientist/VCBench (tag v1.0.0)
  • Forensic test that proves the leak vector (no GPU; static-config part runs in ~7s without any data download; full empirical part needs Norman on disk, ~7 min): tests/integration/test_arc_state_leak_forensic.py
  • Cross-evaluator anchor reconciliation tests: tests/unit/test_dim_a_evaluate.py (5 tests covering pred-anchor cell-eval equivalence to 1e-9, real-anchor strict-conservatism, uniform-shift invariance, fallback warning, invalid-value rejection)
  • Pre-registration: configs/pre_registration.yaml
  • Manuscript: VCBench (2026)
  • CHANGELOG: see v1.0.0 entries Arc State Norman PRR: 0.115 β†’ 0.402 (gene-vocabulary alignment fix) and Cross-evaluator anchor-convention reconciliation for the full bug story + diff of the fix + asymmetric-role rationale.

Citation

@misc{vcbench-arc-state-norman-gears-corrected,
  author       = {{VCBench contributors}},
  title        = {Arc State Norman GEARS-split leak-corrected checkpoint},
  year         = {2026},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/VibeCodingScientist/arc-state-norman-gears-corrected}},
  note         = {Companion artefact to VCBench v1.0.0 (github.com/VibeCodingScientist/VCBench, release tag v1.0.0)},
}

License

MIT β€” same as the upstream ArcInstitute/state codebase.

Notes for reviewers

This release exists because the published Arc State Norman PRR is not directly reproducible from the published norman_fewshot.toml without inheriting the train-test leak. Arc Institute was notified prior to preprint posting. We retain the deprecated configuration in the VCBench repo at configs/dim_a/arc_state_norman_fewshot_DEPRECATED.toml for auditability, behind a use_deprecated_fewshot=True opt-in flag in the vcbench.models.arc_state.ArcState wrapper.

Three independent forensic signatures of the leak (in increasing order of evidential strength):

  1. Training-time signature. Under the deprecated TOML, Lightning's val DataLoader is empty (the cell-type filter matches 0 cells), so validation_step is never invoked. The training metrics CSV has 8 columns and no val_loss column at all β€” visible without ever running an evaluation. (forensic_artifacts/deprecated/training_metrics.csv for inspection; the leak-corrected run's metrics CSV has both val_loss and val/decoder_loss for comparison.)
  2. Inference-time signature. state tx predict structurally cannot run on the deprecated checkpoint β€” it raises AttributeError: 'list' object has no attribute 'batch_sampler' at state/_cli/_tx/_predict.py:263 because the test loader is an empty list, not a DataLoader. To produce a number at all we ran predict with --toml=arc_state_norman_gears_split.toml, which feeds the proper 107 GEARS test perts as the eval set; the model itself was trained on all 246 perts under the leak config.
  3. Test-time signature. PRR = 0.949 (real anchor) / 0.964 (cell-eval cross-validation) on the 107 perts the model trained on, vs 0.402 / 0.408 on the same 107 perts correctly held out β€” Ξ”PRR β‰ˆ 0.55, roughly half the dynamic range of the metric.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support