ak-yermek commited on
Commit
4964446
·
0 Parent(s):

BioTitan: TITANS genomic foundation model with test-time learning

Browse files

18.7M parameter model trained on 254K Tabula Sapiens cells. Test-time memory adaptation improves gene embeddings by +12.6% AUC
(0.636 -> 0.716 across 53 IBM gene-benchmark tasks), closing 54% of the gap to Geneformer V1 (30M cells) without retraining.

Includes pre-computed gene embeddings (static + contextualized).

.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ *.pt filter=lfs diff=lfs merge=lfs -text
2
+ *.pkl filter=lfs diff=lfs merge=lfs -text
3
+ *.parquet filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - genomics
6
+ - single-cell
7
+ - transcriptomics
8
+ - gene-expression
9
+ - foundation-model
10
+ - titans
11
+ - test-time-learning
12
+ - biology
13
+ datasets:
14
+ - tabula-sapiens
15
+ library_name: pytorch
16
+ pipeline_tag: feature-extraction
17
+ ---
18
+
19
+ # BioTitan: Neural Long-Term Memory for Genomic Foundation Modeling
20
+
21
+ **First application of the TITANS architecture to single-cell genomics, enabling test-time adaptive gene embeddings.**
22
+
23
+ BioTitan applies [TITANS](https://arxiv.org/abs/2501.00663) (Behrouz et al., Google Research, NeurIPS 2025) to single-cell transcriptomics. Unlike existing genomic foundation models whose gene representations are fixed after training, BioTitan's neural memory **updates its weights during inference** — gene embeddings improve as the model processes more cells, without any retraining.
24
+
25
+ ## Headline Result
26
+
27
+ Test-time memory adaptation closes **54% of the gap** to Geneformer V1 — without any retraining.
28
+
29
+ ```
30
+ BioTitan Static: 0.636 avg AUC (53 tasks)
31
+ BioTitan CTX 254K: 0.716 avg AUC ← +12.6% relative improvement, zero retraining
32
+ Geneformer V1: 0.782 avg AUC (trained on 120× more data)
33
+ ```
34
+
35
+ On Expression tasks (23 tasks) — the family where single-cell models are expected to excel — BioTitan CTX reaches **0.815**, outperforming Gene2vec (0.773) and approaching Geneformer (0.869), trained on 120× less data.
36
+
37
+ Contextualization saturates at ~60K cells (+0.002 from 60K→254K), indicating that clinically-relevant sample sizes are sufficient for effective memory adaptation.
38
+
39
+ ## IBM Gene Benchmark (53 Tasks, 5 Families)
40
+
41
+ All results verified on the same machine using [BiomedSciAI/gene-benchmark](https://github.com/BiomedSciAI/gene-benchmark). Geneformer and Gene2vec baselines reproduced locally. Published baselines from the [IBM benchmark paper](https://arxiv.org/abs/2412.04075) (Kan-Tor et al., 2024).
42
+
43
+ ### Task Family Averages
44
+
45
+ | Family | Geneformer V1 | Gene2vec | BioTitan Static | **BioTitan CTX** | Tasks |
46
+ |--------|:---:|:---:|:---:|:---:|:---:|
47
+ | Expression | **0.869** | 0.773 | 0.732 | **0.815** | 23 |
48
+ | Genomic Properties | **0.782** | 0.725 | 0.640 | 0.687 | 7 |
49
+ | Regulatory Functions | 0.759 | **0.769** | 0.623 | 0.704 | 4 |
50
+ | Localization | **0.725** | 0.668 | 0.616 | 0.699 | 2 |
51
+ | Protein Properties | **0.678** | 0.641 | 0.571 | 0.598 | 17 |
52
+ | **Overall** | **0.782** | 0.715 | 0.636 | **0.716** | **53** |
53
+
54
+ ### Comparison with All Published Baselines
55
+
56
+ Family averages from the IBM benchmark paper's Figure 2 heatmap; BioTitan run locally.
57
+
58
+ **Expression / Localization (23 tasks) — BioTitan's strongest family:**
59
+
60
+ | Model | Type | Avg AUC |
61
+ |-------|------|:---:|
62
+ | Geneformer | RNA-seq (30M cells) | **0.869** |
63
+ | cellPLM | RNA-seq (11M cells) | ~0.85 |
64
+ | ScGPT-H | RNA-seq (33M cells) | ~0.84 |
65
+ | Gene2vec | Bulk co-expression | ~0.82 |
66
+ | **BioTitan CTX** | **RNA-seq (255K cells)** | **0.815** |
67
+ | ScGPT-B | RNA-seq (10.3M blood) | ~0.75 |
68
+ | ESM-1 / ESM-2 | Protein sequence | ~0.74–0.75 |
69
+ | MPNet / DNABert-2 | Text / DNA | ~0.72 |
70
+ | MTEB-S / MTEB-L | Text | ~0.67–0.71 |
71
+ | Bag of Words | Text | ~0.69 |
72
+
73
+ BioTitan CTX outperforms all text, protein, and DNA models on expression tasks — and all RNA-seq models trained on fewer diverse tissues.
74
+
75
+ **Genomic Properties (7 tasks):**
76
+
77
+ | Model | Type | Avg AUC |
78
+ |-------|------|:---:|
79
+ | ESM-2 | Protein sequence | 0.84 |
80
+ | MTEB-L / Bag of Words | Text | 0.81 |
81
+ | ScGPT-H / MPNet | Mixed | 0.80 |
82
+ | Geneformer | RNA-seq (30M cells) | 0.79 |
83
+ | DNABert-2 | DNA sequence | 0.79 |
84
+ | cellPLM | RNA-seq (11M cells) | 0.76 |
85
+ | Gene2vec | Bulk co-expression | 0.73 |
86
+ | **BioTitan CTX** | **RNA-seq (255K cells)** | **0.687** |
87
+ | ScGPT-B | RNA-seq (10.3M blood) | 0.67 |
88
+
89
+ **Regulatory Functions (4 tasks):**
90
+
91
+ | Model | Type | Avg AUC |
92
+ |-------|------|:---:|
93
+ | MTEB-S | Text (335M) | 0.81 |
94
+ | ESM-1 / ESM-2 | Protein sequence | 0.79 |
95
+ | ScGPT-H | RNA-seq (33M cells) | 0.77 |
96
+ | cellPLM | RNA-seq (11M cells) | 0.75 |
97
+ | Geneformer / Bag of Words | Mixed | 0.74 |
98
+ | Gene2vec | Bulk co-expression | 0.73 |
99
+ | **BioTitan CTX** | **RNA-seq (255K cells)** | **0.704** |
100
+ | ScGPT-B | RNA-seq (10.3M blood) | 0.68 |
101
+ | DNABert-2 | DNA sequence | 0.66 |
102
+
103
+ ### Selected Binary Tasks (detail)
104
+
105
+ 11 of 53 tasks. Overall averages in the family table above are computed across all 53 tasks (including 42 categorical tasks not shown here).
106
+
107
+ | Task | Geneformer V1 | Gene2vec | BioTitan Static | **BioTitan CTX** |
108
+ |------|:---:|:---:|:---:|:---:|
109
+ | Dosage sensitive TFs | **0.919** | 0.878 | 0.723 | 0.891 |
110
+ | Bivalent vs lys4-only | **0.925** | 0.894 | 0.797 | 0.889 |
111
+ | Bivalent vs non-methylated | **0.827** | 0.688 | 0.616 | 0.676 |
112
+ | CCD Transcript | **0.797** | 0.744 | 0.638 | 0.647 |
113
+ | N1 network | **0.805** | 0.796 | 0.733 | 0.719 |
114
+ | HLA class I vs II | 0.745 | **0.925** | 0.445 | 0.730 |
115
+ | Gene2Gene | **0.730** | 0.695 | 0.643 | 0.702 |
116
+ | TF vs non-TF | **0.749** | 0.719 | 0.630 | 0.698 |
117
+ | N1 targets | **0.736** | 0.635 | 0.684 | 0.668 |
118
+ | Long vs short range TF | **0.726** | 0.614 | 0.520 | 0.459 |
119
+ | CCD Protein | 0.552 | **0.559** | 0.539 | 0.545 |
120
+
121
+ ### What This Tells Us
122
+
123
+ **1. Test-time learning is a unique capability.** Contextualization improved BioTitan by +0.080 AUC across 53 tasks (0.636→0.716), closing 54% of the gap to Geneformer without any retraining. No other model in this benchmark can do this — their embeddings are architecturally fixed after training.
124
+
125
+ **2. BioTitan excels where expression models should.** On Expression tasks (23 tasks), BioTitan CTX (0.815) outperforms every non-RNA-seq model and places 5th among all 13 models evaluated, despite training on 120× less data.
126
+
127
+ **3. The gap is data, not architecture.** Among RNA-seq models, performance scales with training data: ScGPT-B (10M, single tissue) < BioTitan CTX (255K, 8 tissues) < Gene2vec (bulk) < cellPLM (11M) < Geneformer (30M) < ScGPT-H (33M). BioTitan sits where its data volume predicts — and test-time learning pushes it above its "data class."
128
+
129
+ **4. Contextualization saturates efficiently.** Moving from 60K to 254K inference cells yields only +0.002 avg AUC. This means clinically-relevant sample sizes (~10K–60K cells) are sufficient for effective memory adaptation — a practical advantage for real-world deployment.
130
+
131
+ ## What Is Test-Time Learning?
132
+
133
+ Existing models (Geneformer, scGPT, AIDO.Cell, scFoundation, cellPLM) process every cell identically at inference — their weights are frozen. BioTitan's TITANS memory MLP updates its own weights during the forward pass via gradient descent on a surprise signal:
134
+
135
+ ```
136
+ Cell 1: Memory is fresh. Gene representations are generic.
137
+ Cell 1,000: Memory has learned tissue-specific co-expression patterns.
138
+ Cell 60,000: Memory has seen diverse cellular contexts.
139
+ Gene representations are now RICHER than the static embedding table.
140
+ Further cells provide diminishing returns.
141
+ ```
142
+
143
+ This happens at inference speed (~36 cells/sec on RTX 3090). No optimizer, no backward pass through the full model, no labeled data needed.
144
+
145
+ **Practical implications:**
146
+ - Feed the model a patient's cells → memory adapts → adapted gene representations in minutes
147
+ - No retraining, no fine-tuning, no GPU cluster needed for adaptation
148
+ - The same model binary works for every patient, every tissue, every disease
149
+ - ~60K cells is sufficient for near-optimal adaptation
150
+
151
+ ## Architecture
152
+
153
+ TITANS Memory-as-Context (MAC) variant with 6 stacked blocks:
154
+
155
+ | Component | Details |
156
+ |-----------|---------|
157
+ | Parameters | 18.7M |
158
+ | Architecture | TITANS MAC (6 layers, 256 dim, 4 heads) |
159
+ | Gene vocabulary | 25,424 (Geneformer-compatible tokenization) |
160
+ | Memory | 2-layer MLP per block, chunk-wise gradient updates (128 tokens/step) |
161
+ | Persistent memory | 32 learnable tokens per block |
162
+ | FFN | SwiGLU, hidden dim 512 |
163
+ | Pre-training | Masked gene prediction (15% masking rate) |
164
+ | Training data | 254,394 cells from Tabula Sapiens (8 human tissues) |
165
+ | Compute | 2 epochs, AdamW, cosine LR, 2×RTX 3090 (~8 hours) |
166
+
167
+ ## Training Framework
168
+
169
+ BioTitan was trained using [titans-trainer](https://github.com/pafos-ai/titans-trainer), a HuggingFace-style training framework for the TITANS architecture.
170
+
171
+ ```bash
172
+ pip install titans-trainer
173
+ ```
174
+
175
+ ## Training Data
176
+
177
+ [Tabula Sapiens](https://tabula-sapiens-portal.ds.czbiohub.org/) — 254,394 cells from 8 human tissues (Blood, Lung, Heart, Liver, Kidney, Pancreas, Neural, Bone Marrow), tokenized using rank-value encoding with median normalization.
178
+
179
+ ## Limitations
180
+
181
+ - **Gene-level only.** Cell-level tasks (cell type annotation, perturbation prediction) not yet benchmarked.
182
+ - **Small training set.** 255K cells vs 30–50M for Geneformer/scGPT/AIDO.Cell. Performance scales with data — scaling is expected to close the remaining gap.
183
+ - **8 tissues.** Broader tissue coverage would improve gene representation diversity.
184
+ - **Contextualization overhead.** Extracting contextualized embeddings requires a forward pass over reference cells (~36 cells/sec on RTX 3090). Static embeddings are instant.
185
+ - **Some tasks regress with contextualization.** 3 of 11 binary tasks show small decreases, suggesting memory saturation effects on certain task types.
186
+
187
+ ## Roadmap
188
+
189
+ - [ ] Scale to 30M cells (Genecorpus-30M) — expected to match/exceed Geneformer
190
+ - [ ] 150M parameter model
191
+ - [ ] Full IBM benchmark (multi-label and regression tasks)
192
+ - [ ] Cell-level benchmarks (cell type annotation, zero-shot clustering)
193
+ - [ ] Disease-specific test-time learning demo (cardiomyopathy, Alzheimer's)
194
+ - [ ] BERT ablation (same architecture without TITANS memory)
195
+
196
+ ## Citation
197
+
198
+ ```bibtex
199
+ @article{yermekov2026biotitan,
200
+ title={BioTitan: Neural Long-Term Memory for Genomic Foundation Modeling},
201
+ author={Yermekov, Akbar},
202
+ year={2026}
203
+ }
204
+
205
+ @article{behrouz2025titans,
206
+ title={Titans: Learning to Memorize at Test Time},
207
+ author={Behrouz, Ali and Zhong, Peilin and Mirrokni, Vahab},
208
+ journal={NeurIPS},
209
+ year={2025}
210
+ }
211
+ ```
212
+
213
+ ## License
214
+
215
+ Apache 2.0
biotitan-20m-tabula-sapiens.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e79b40c7fdd87dbf2cd5e831b6545a65fb99b5705f039d7cf2e7bd6a8e7473b
3
+ size 226574859
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "biotitan",
3
+ "architecture": "titans",
4
+ "n_genes": 25424,
5
+ "d_model": 256,
6
+ "n_layers": 6,
7
+ "n_heads": 4,
8
+ "d_ff": 512,
9
+ "max_seq_len": 2048,
10
+ "memory_depth": 2,
11
+ "n_persistent": 32,
12
+ "dropout": 0.02,
13
+ "vocab_size": 25426,
14
+ "pad_token_id": 0,
15
+ "mask_token_id": 25425,
16
+ "training": {
17
+ "dataset": "tabula_sapiens",
18
+ "n_cells": 254394,
19
+ "epochs": 2,
20
+ "batch_size": 32,
21
+ "learning_rate": 5e-4,
22
+ "weight_decay": 0.001,
23
+ "warmup_steps": 300,
24
+ "mask_prob": 0.15,
25
+ "optimizer": "AdamW",
26
+ "mixed_precision": true
27
+ },
28
+ "files": {
29
+ "model_weights": "biotitan-20m-tabula-sapiens.pt",
30
+ "token_dictionary": "token_dictionary.pkl",
31
+ "contextualized_embeddings": "gene_embeddings_ctx_254k.parquet",
32
+ "static_embeddings": "gene_embeddings_static.parquet"
33
+ },
34
+ "benchmark_results": {
35
+ "overall_53_tasks": {
36
+ "static_auc": 0.636,
37
+ "ctx_254k_auc": 0.716,
38
+ "geneformer_v1_auc": 0.782
39
+ },
40
+ "family_averages": {
41
+ "expression_23_tasks": { "static": 0.732, "ctx_254k": 0.815, "geneformer": 0.869 },
42
+ "genomic_properties_7_tasks": { "static": 0.640, "ctx_254k": 0.687, "geneformer": 0.782 },
43
+ "regulatory_functions_4_tasks": { "static": 0.623, "ctx_254k": 0.704, "geneformer": 0.759 },
44
+ "localization_2_tasks": { "static": 0.616, "ctx_254k": 0.699, "geneformer": 0.725 },
45
+ "protein_properties_17_tasks": { "static": 0.571, "ctx_254k": 0.598, "geneformer": 0.678 }
46
+ }
47
+ }
48
+ }
gene_embeddings_ctx_254k.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e2309268d0025cce8a0f27d80a1cf0c621fe3f905200aa0d21172e3ce691b3b
3
+ size 52740260
gene_embeddings_static.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63980543960452f144b25f384b6a658680b1e8275916805b8daa96e06594d2a2
3
+ size 54446255
token_dictionary.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab9dc40973fa5224d77b793e2fd114cacf3d08423ed9c4c49caf0ba9c7f218f1
3
+ size 788424