scgpt-replogle-random-ft

scGPT finetuned on Replogle K562 perturbations with a random per-gene frozen prior table (per-row Gaussian matched to ESM2 mean/std/L2-norm, seed 42). Architecturally identical to matthewshu/scgpt-replogle-esm-ft but the gene prior carries no protein-specific information.

This is the control experiment for: "does ESM-specific protein knowledge drive scGPT's K562→RPE1 OOD gain, or does any frozen gene-indexed vector field do it?"

RPE1 (held-out cell line) results

condition	mean pearson_delta
ESM (treatment, scgpt-replogle-esm-ft)	0.508
Random-prior retrain (this model)	0.376
base scGPT, no prior (scgpt-replogle-base-ft)	0.183
ESM with shuffled gene→protein at inference time	0.085

Random-prior is +0.193 above base and −0.132 below ESM: ~59% of the ESM-vs-base OOD gap is captured by the architectural pathway alone (any consistent frozen gene-indexed vector field), and ~41% comes from ESM-specific protein information.

Training

batch_size 192, max_seq_len 1200, lr 1e-4, num_epochs up to 30, early_stop=10 on val pearson_delta, seed 42
Stopped at epoch 15 (best val pearson_delta 0.179 at epoch 5)

Random table generated by tools/make_random_gene_prior.py --mode per_row: preserves per-gene magnitude (matches each row's mean and std from ESM2) while randomizing per-dim content.

Note: args.json and vocab.json are scGPT's pretraining metadata (included for parity with sibling repos); they do NOT reflect the finetune hyperparameters above. See training_stats.json for actual finetune behavior.

Source commit: github.com/mattshu0410/sc-interp @ feat/model-diffing

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support