nano-scGPT

The simplest, fastest repository for scGPT inference, (soon) finetuning and trianing, with minimal dependencies. It reimplements the original scGPT from scratch. nano_scgpt/model.py is pure PyTorch in ~270 lines of code, and nano_scgpt/scGPT_tokenizer.py turns raw scRNA data into model input. Small enough to read in one sitting and hack on.

nano_scgpt

Embeddings are numerically equivalent to the original scGPT.

Why nano-scGPT

Cell modeling is potentially the most exciting and under-indexed AI/ML area. The hope is to make state-of-the-art cell models more accessible to run, understand, and tinker with.

Runtime

nano-scGPT produces embeddings numerically equivalent to the original while running 1.38x faster, mostly from a clean forward pass plus torch.compile.

Model Total (65,847 cells) Per cell Throughput Speedup
nano-scGPT 77.7 s 1.18 ms 847 cells/s 1.38×
scGPT 107.0 s 1.63 ms 615 cells/s 1.00×

benchmarked under: single A100, batch size 256, AMP autocast, 10 batch warmup, Tabula Sapiens lung (65,847 cells).

Install

git clone https://github.com/Danqi7/nano-scGPT.git
cd nano-scGPT

Option A: using uv

uv sync
source .venv/bin/activate

Option B: using pip

python -m venv .venv && source .venv/bin/activate
pip install -e .

Quick Start

# scGPT Embedding example
import numpy as np
from nano_scgpt.scGPT_tokenizer import scGPTTokenizer
from nano_scgpt.model import scGPTModel

model = scGPTModel.from_pretrained("scGPT_human") # weights pulled from HF hub
model.eval()

tokenizer = scGPTTokenizer.from_pretrained("scGPT_human")
genes = ['DUX4L30', 'CTB-52I2.4', 'USP17L16P', 'RPL7P23']
exprs = np.array([[1.0, 0.0, 23.0, 6.0], [0.0, 0.0, 3.0, 7.0]])
encoded = tokenizer.encode(exprs, genes)
embeddings = model.encode(encoded["gene_ids"], encoded["exprs"], encoded["padding_mask"]) # [N, D_embd]

Task: Embed .h5ad scRNA data

# Example: Tabula Sapiens lung data (downloaded automatically)
python tasks/embedding.py

# Or on your own local file
python tasks/embedding.py \
    --input <path to local .h5ad file> \
    --output <path to save embeddings>

# Or from a remote URL
python tasks/embedding.py \
    --input_url <URL to a remote .h5ad file> \
    --output <path to save embeddings>

todos

  • Finetuning for perturbation response prediction
  • Training from scratch

Let me know what tasks or even models you'd like to see next!

Acknowledgments

  1. This repository reimplements scGPT from scratch. All credit for the original model and method goes to the authors (Cui et al., Nature Methods, 2024). See the original repo and paper.
  2. nano-scGPT is inspired by Andrej Karpathy's nanoGPT and Chris Hayduk's minAlphaFold2.

License

MIT. See LICENSE

Downloads last month
6
Safetensors
Model size
50.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support