File size: 4,114 Bytes
f8c5350
 
 
cff0942
 
f8c5350
cff0942
 
 
f8c5350
 
 
cff0942
f8c5350
cff0942
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8c5350
cff0942
 
 
 
f8c5350
cff0942
 
 
 
 
 
 
 
 
 
 
 
f8c5350
cff0942
 
 
f8c5350
cff0942
 
f8c5350
cff0942
 
f8c5350
cff0942
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: mit
tags:
  - rna-seq
  - bulk-rna
  - cancer
  - transcriptomics
  - graph-neural-network
  - transformer
  - performer
  - gcn
  - pytorch
model_size: 48M
pipeline_tag: feature-extraction
library_name: pytorch
---

# 🧬 CancerTranscriptome-Mini-48M  
*A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq*

**CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.  
It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder.

This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.

---

## 🔬 Origin & References

### **Primary Reference (BulkFormer)**
Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.  
**“A large-scale foundation model for bulk transcriptomes.”**  
bioRxiv (2025).  
doi: https://doi.org/10.1101/2025.06.11.659222

### **This Model (CancerTranscriptome-Mini-48M)**
A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.  
Source Code: https://github.com/alwalt/BioFM

---

# 📊 Data Source

All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository:

**ARCHS4 Reference:**  
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.  
**“Massive mining of publicly available RNA-seq data from human and mouse.”**  
*Nature Communications* 9, 1366 (2018).  
Dataset: https://maayanlab.cloud/archs4/

### **Filtering Procedure**
- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5  
- Selected samples matching:  
  `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma`  
- Removed samples lacking clear disease annotations  
- Used ARCHS4 log-TPM matrices (gene × sample)  
- Final dataset: ~76k cancer samples, 19,357 genes

No private, clinical, controlled-access, or proprietary data were used.

---

# 🧠 Model Architecture (Summary)

CancerTranscriptome-Mini-48M includes:

### **1. Gene Identity Embeddings**
- Precomputed **ESM2 embeddings** for each protein-coding gene  
- Projected into model dimension (320)

### **2. Rotary Expression Embeddings (REE)**
- Deterministic sinusoidal continuous-value embedding  
- Masked positions zeroed (mask token = –10)

### **3. Graph Neural Network Layer**
- **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph  
- Injects biological prior knowledge

### **4. Expression Binning**
- Learnable importance scores sort genes  
- Genes divided into 10 bins  
- Each bin receives its own **local Performer** attention

### **5. Global Performer Attention**
- 2 stacked Performer layers across all genes

### **6. Prediction Head**
- MLP → scalar value per gene  
- Used for masked-expression reconstruction

Total parameters: **48,336,162 (~48M)**

---

# 🎯 Intended Use

This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks:

- Tumor subtype prediction  
- Drug response modeling  
- Immune infiltration scoring  
- Survival / risk modeling  
- Gene expression imputation  
- Dimensionality reduction  
- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets  

---

# 🚀 How to Use

Download & run:

```python
import torch
from model import BulkFormer   # from this repo
import safetensors.torch as st

# Load model + weights
model = BulkFormer(
    dim=320,
    graph=torch.load("edge_index.pt"),   # provide your graph
    gene_emb=torch.load("esm2_gene_emb.pt"),
    gene_length=19357,
    bin_head=8,
    full_head=4,
    bins=10,
    gb_repeat=1,
    p_repeat=2
)

state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()

# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)

with torch.no_grad():
    out = model(x)

print(out.shape)  # [1, 19357]