File size: 4,114 Bytes
f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 f8c5350 cff0942 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
license: mit
tags:
- rna-seq
- bulk-rna
- cancer
- transcriptomics
- graph-neural-network
- transformer
- performer
- gcn
- pytorch
model_size: 48M
pipeline_tag: feature-extraction
library_name: pytorch
---
# 🧬 CancerTranscriptome-Mini-48M
*A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq*
**CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder.
This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.
---
## 🔬 Origin & References
### **Primary Reference (BulkFormer)**
Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
**“A large-scale foundation model for bulk transcriptomes.”**
bioRxiv (2025).
doi: https://doi.org/10.1101/2025.06.11.659222
### **This Model (CancerTranscriptome-Mini-48M)**
A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
Source Code: https://github.com/alwalt/BioFM
---
# 📊 Data Source
All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository:
**ARCHS4 Reference:**
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
**“Massive mining of publicly available RNA-seq data from human and mouse.”**
*Nature Communications* 9, 1366 (2018).
Dataset: https://maayanlab.cloud/archs4/
### **Filtering Procedure**
- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
- Selected samples matching:
`cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma`
- Removed samples lacking clear disease annotations
- Used ARCHS4 log-TPM matrices (gene × sample)
- Final dataset: ~76k cancer samples, 19,357 genes
No private, clinical, controlled-access, or proprietary data were used.
---
# 🧠 Model Architecture (Summary)
CancerTranscriptome-Mini-48M includes:
### **1. Gene Identity Embeddings**
- Precomputed **ESM2 embeddings** for each protein-coding gene
- Projected into model dimension (320)
### **2. Rotary Expression Embeddings (REE)**
- Deterministic sinusoidal continuous-value embedding
- Masked positions zeroed (mask token = –10)
### **3. Graph Neural Network Layer**
- **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph
- Injects biological prior knowledge
### **4. Expression Binning**
- Learnable importance scores sort genes
- Genes divided into 10 bins
- Each bin receives its own **local Performer** attention
### **5. Global Performer Attention**
- 2 stacked Performer layers across all genes
### **6. Prediction Head**
- MLP → scalar value per gene
- Used for masked-expression reconstruction
Total parameters: **48,336,162 (~48M)**
---
# 🎯 Intended Use
This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks:
- Tumor subtype prediction
- Drug response modeling
- Immune infiltration scoring
- Survival / risk modeling
- Gene expression imputation
- Dimensionality reduction
- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets
---
# 🚀 How to Use
Download & run:
```python
import torch
from model import BulkFormer # from this repo
import safetensors.torch as st
# Load model + weights
model = BulkFormer(
dim=320,
graph=torch.load("edge_index.pt"), # provide your graph
gene_emb=torch.load("esm2_gene_emb.pt"),
gene_length=19357,
bin_head=8,
full_head=4,
bins=10,
gb_repeat=1,
p_repeat=2
)
state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()
# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)
with torch.no_grad():
out = model(x)
print(out.shape) # [1, 19357]
|