BioTitan: Neural Long-Term Memory for Genomic Foundation Modeling

First application of the TITANS architecture to single-cell genomics, enabling test-time adaptive gene embeddings.

BioTitan applies TITANS (Behrouz et al., Google Research, NeurIPS 2025) to single-cell transcriptomics. Unlike existing genomic foundation models whose gene representations are fixed after training, BioTitan's neural memory updates its weights during inference — gene embeddings improve as the model processes more cells, without any retraining.

Headline Result

Test-time memory adaptation closes 48% of the gap to Geneformer V1 -- without any retraining.

BioTitan Static:    0.656 avg AUC (53 tasks)
BioTitan CTX 254K:  0.716 avg AUC  <- +9.1% relative improvement, zero retraining
Geneformer V1:      0.782 avg AUC  (trained on 120x more data)

On Expression tasks (23 tasks) — the family where single-cell models are expected to excel — BioTitan CTX reaches 0.815, outperforming Gene2vec (0.773) and approaching Geneformer (0.869), trained on 120× less data.

Contextualization saturates at ~60K cells (+0.002 from 60K→254K), indicating that clinically-relevant sample sizes are sufficient for effective memory adaptation.

IBM Gene Benchmark (53 Tasks, 5 Families)

All results verified on the same machine using BiomedSciAI/gene-benchmark. Geneformer and Gene2vec baselines reproduced locally. Published baselines from the IBM benchmark paper (Kan-Tor et al., 2024).

Task Family Averages

Family	Geneformer V1	Gene2vec	BioTitan Static	BioTitan CTX	Tasks
Expression	0.869	0.773	0.732	0.815	23
Genomic Properties	0.782	0.725	0.640	0.687	7
Regulatory Functions	0.759	0.769	0.623	0.704	4
Localization	0.725	0.668	0.616	0.699	2
Protein Properties	0.678	0.641	0.571	0.598	17
Overall	0.782	0.720	0.656	0.716	53

Comparison with All Published Baselines

Family averages from the IBM benchmark paper's Figure 2 heatmap; BioTitan run locally.

Expression / Localization (23 tasks) — BioTitan's strongest family:

Model	Type	Avg AUC
Geneformer	RNA-seq (30M cells)	0.869
cellPLM	RNA-seq (11M cells)	~0.85
ScGPT-H	RNA-seq (33M cells)	~0.84
Gene2vec	Bulk co-expression	~0.82
BioTitan CTX	RNA-seq (255K cells)	0.815
ScGPT-B	RNA-seq (10.3M blood)	~0.75
ESM-1 / ESM-2	Protein sequence	~0.74–0.75
MPNet / DNABert-2	Text / DNA	~0.72
MTEB-S / MTEB-L	Text	~0.67–0.71
Bag of Words	Text	~0.69

BioTitan CTX outperforms all text, protein, and DNA models on expression tasks — and all RNA-seq models trained on fewer diverse tissues.

Genomic Properties (7 tasks):

Model	Type	Avg AUC
ESM-2	Protein sequence	0.84
MTEB-L / Bag of Words	Text	0.81
ScGPT-H / MPNet	Mixed	0.80
Geneformer	RNA-seq (30M cells)	0.79
DNABert-2	DNA sequence	0.79
cellPLM	RNA-seq (11M cells)	0.76
Gene2vec	Bulk co-expression	0.73
BioTitan CTX	RNA-seq (255K cells)	0.687
ScGPT-B	RNA-seq (10.3M blood)	0.67

Regulatory Functions (4 tasks):

Model	Type	Avg AUC
MTEB-S	Text (335M)	0.81
ESM-1 / ESM-2	Protein sequence	0.79
ScGPT-H	RNA-seq (33M cells)	0.77
cellPLM	RNA-seq (11M cells)	0.75
Geneformer / Bag of Words	Mixed	0.74
Gene2vec	Bulk co-expression	0.73
BioTitan CTX	RNA-seq (255K cells)	0.704
ScGPT-B	RNA-seq (10.3M blood)	0.68
DNABert-2	DNA sequence	0.66

Selected Binary Tasks (detail)

11 of 53 tasks. Overall averages in the family table above are computed across all 53 tasks (including 42 categorical tasks not shown here).

Task	Geneformer V1	Gene2vec	BioTitan Static	BioTitan CTX
Dosage sensitive TFs	0.919	0.878	0.723	0.891
Bivalent vs lys4-only	0.925	0.894	0.797	0.889
Bivalent vs non-methylated	0.827	0.688	0.616	0.676
CCD Transcript	0.797	0.744	0.638	0.647
N1 network	0.805	0.796	0.733	0.719
HLA class I vs II	0.745	0.925	0.445	0.730
Gene2Gene	0.730	0.695	0.643	0.702
TF vs non-TF	0.749	0.719	0.630	0.698
N1 targets	0.736	0.635	0.684	0.668
Long vs short range TF	0.726	0.614	0.520	0.459
CCD Protein	0.552	0.559	0.539	0.545

What This Tells Us

1. Test-time learning is a unique capability. Contextualization improved BioTitan by +0.060 AUC across 53 tasks (0.656->0.716), closing 48% of the gap to Geneformer without any retraining. No other model in this benchmark can do this -- their embeddings are architecturally fixed after training.

2. BioTitan excels where expression models should. On Expression tasks (23 tasks), BioTitan CTX (0.815) outperforms every non-RNA-seq model and places 5th among all 13 models evaluated, despite training on 120× less data.

3. The gap is data, not architecture. Among RNA-seq models, performance scales with training data: ScGPT-B (10M, single tissue) < BioTitan CTX (255K, 8 tissues) < Gene2vec (bulk) < cellPLM (11M) < Geneformer (30M) < ScGPT-H (33M). BioTitan sits where its data volume predicts — and test-time learning pushes it above its "data class."

4. Contextualization saturates efficiently. Moving from 60K to 254K inference cells yields only +0.002 avg AUC. This means clinically-relevant sample sizes (~10K–60K cells) are sufficient for effective memory adaptation — a practical advantage for real-world deployment.

What Is Test-Time Learning?

Existing models (Geneformer, scGPT, AIDO.Cell, scFoundation, cellPLM) process every cell identically at inference — their weights are frozen. BioTitan's TITANS memory MLP updates its own weights during the forward pass via gradient descent on a surprise signal:

Cell 1:      Memory is fresh. Gene representations are generic.
Cell 1,000:  Memory has learned tissue-specific co-expression patterns.
Cell 60,000: Memory has seen diverse cellular contexts.
             Gene representations are now RICHER than the static embedding table.
             Further cells provide diminishing returns.

This happens at inference speed (~36 cells/sec on RTX 3090). No optimizer, no backward pass through the full model, no labeled data needed.

Practical implications:

Feed the model a patient's cells → memory adapts → adapted gene representations in minutes
No retraining, no fine-tuning, no GPU cluster needed for adaptation
The same model binary works for every patient, every tissue, every disease
~60K cells is sufficient for near-optimal adaptation

Architecture

TITANS Memory-as-Context (MAC) variant with 6 stacked blocks:

Component	Details
Parameters	18.7M
Architecture	TITANS MAC (6 layers, 256 dim, 4 heads)
Gene vocabulary	25,424 (Geneformer-compatible tokenization)
Memory	2-layer MLP per block, chunk-wise gradient updates (128 tokens/step)
Persistent memory	32 learnable tokens per block
FFN	SwiGLU, hidden dim 512
Pre-training	Masked gene prediction (15% masking rate)
Training data	254,394 cells from Tabula Sapiens (8 human tissues)
Compute	2 epochs, AdamW, cosine LR, 2×RTX 3090 (~8 hours)

Using Pre-computed Embeddings (no code needed)

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load contextualized gene embeddings
df = pd.read_parquet("gene_embeddings_ctx_254k.parquet")
# columns: symbol, dim_0, dim_1, ..., dim_255

# Get embedding for a specific gene
tp53 = df[df['symbol'] == 'TP53'].iloc[:, 1:].values

# Find most similar genes
symbols = df['symbol'].values
embeddings = df.iloc[:, 1:].values
sims = cosine_similarity(tp53, embeddings)[0]
top_10 = np.argsort(-sims)[1:11]
for i in top_10:
    print(f"  {symbols[i]}: {sims[i]:.3f}")

Loading Model Weights

pip install titans-trainer

from titans_trainer import TitansModel

model = TitansModel.from_pretrained("biotitan-20m-tabula-sapiens.pt")
model.eval()

# Get surprise scores (test-time learning)
surprise = model.get_surprise_scores(token_ids)  # (batch, seq_len, n_layers)

# Get cell embeddings
cell_emb = model.get_embeddings(token_ids)  # (batch, 256)

Run IBM Gene Benchmark

# Use pre-computed embeddings directly with the benchmark
# No BioTitan code needed -- just the parquet file
# See: https://github.com/BiomedSciAI/gene-benchmark

Training Framework

BioTitan was trained using titans-trainer, a HuggingFace-style training framework for the TITANS architecture.

Training Data

Tabula Sapiens — 254,394 cells from 8 human tissues (Blood, Lung, Heart, Liver, Kidney, Pancreas, Neural, Bone Marrow), tokenized using rank-value encoding with median normalization.

Limitations

Gene-level only. Cell-level tasks (cell type annotation, perturbation prediction) not yet benchmarked.
Small training set. 255K cells vs 30–50M for Geneformer/scGPT/AIDO.Cell. Performance scales with data — scaling is expected to close the remaining gap.
8 tissues. Broader tissue coverage would improve gene representation diversity.
Contextualization overhead. Extracting contextualized embeddings requires a forward pass over reference cells (~36 cells/sec on RTX 3090). Static embeddings are instant.
Some tasks regress with contextualization. 3 of 11 binary tasks show small decreases, suggesting memory saturation effects on certain task types.
Model weights require titans-trainer. pip install titans-trainer to load the .pt file. Pre-computed embeddings in parquet format can be used without any dependencies.

Roadmap

Scale to 30M cells (Genecorpus-30M) — expected to match/exceed Geneformer
150M parameter model
Full IBM benchmark (multi-label and regression tasks)
Cell-level benchmarks (cell type annotation, zero-shot clustering)
Disease-specific test-time learning demo (cardiomyopathy, Alzheimer's)
BERT ablation (same architecture without TITANS memory)

Citation

@article{yermekov2026biotitan,
  title={BioTitan: Neural Long-Term Memory for Genomic Foundation Modeling},
  author={Yermekov, Akbar},
  year={2026}
}

@article{behrouz2025titans,
  title={Titans: Learning to Memorize at Test Time},
  author={Behrouz, Ali and Zhong, Peilin and Mirrokni, Vahab},
  journal={NeurIPS},
  year={2025}
}

License

Apache 2.0

Downloads last month: 5

Papers for pafos-ai/biotitan

Titans: Learning to Memorize at Test Time

Paper • 2501.00663 • Published Dec 31, 2024 • 31

Does your model understand genes? A benchmark of gene properties for biological and text models

Paper • 2412.04075 • Published Dec 5, 2024