scgpt-replogle-random-ft
scGPT finetuned on Replogle K562 perturbations with a random per-gene
frozen prior table (per-row Gaussian matched to ESM2 mean/std/L2-norm,
seed 42). Architecturally identical to matthewshu/scgpt-replogle-esm-ft
but the gene prior carries no protein-specific information.
This is the control experiment for: "does ESM-specific protein knowledge drive scGPT's K562→RPE1 OOD gain, or does any frozen gene-indexed vector field do it?"
RPE1 (held-out cell line) results
| condition | mean pearson_delta |
|---|---|
| ESM (treatment, scgpt-replogle-esm-ft) | 0.508 |
| Random-prior retrain (this model) | 0.376 |
| base scGPT, no prior (scgpt-replogle-base-ft) | 0.183 |
| ESM with shuffled gene→protein at inference time | 0.085 |
Random-prior is +0.193 above base and −0.132 below ESM: ~59% of the ESM-vs-base OOD gap is captured by the architectural pathway alone (any consistent frozen gene-indexed vector field), and ~41% comes from ESM-specific protein information.
Training
- batch_size 192, max_seq_len 1200, lr 1e-4, num_epochs up to 30, early_stop=10 on val pearson_delta, seed 42
- Stopped at epoch 15 (best val pearson_delta 0.179 at epoch 5)
Random table generated by tools/make_random_gene_prior.py --mode per_row:
preserves per-gene magnitude (matches each row's mean and std from ESM2)
while randomizing per-dim content.
Note: args.json and vocab.json are scGPT's pretraining metadata
(included for parity with sibling repos); they do NOT reflect the
finetune hyperparameters above. See training_stats.json for actual
finetune behavior.
Source commit: github.com/mattshu0410/sc-interp @ feat/model-diffing