|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- rna-seq |
|
|
- bulk-rna |
|
|
- cancer |
|
|
- transcriptomics |
|
|
- graph-neural-network |
|
|
- transformer |
|
|
- performer |
|
|
- gcn |
|
|
- pytorch |
|
|
model_size: 48M |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: pytorch |
|
|
--- |
|
|
|
|
|
# 𧬠CancerTranscriptome-Mini-48M |
|
|
*A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq* |
|
|
|
|
|
**CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq. |
|
|
It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder. |
|
|
|
|
|
This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes. |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Origin & References |
|
|
|
|
|
### **Primary Reference (BulkFormer)** |
|
|
Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui. |
|
|
**βA large-scale foundation model for bulk transcriptomes.β** |
|
|
bioRxiv (2025). |
|
|
doi: https://doi.org/10.1101/2025.06.11.659222 |
|
|
|
|
|
### **This Model (CancerTranscriptome-Mini-48M)** |
|
|
A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency. |
|
|
Source Code: https://github.com/alwalt/BioFM |
|
|
|
|
|
--- |
|
|
|
|
|
# π Data Source |
|
|
|
|
|
All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository: |
|
|
|
|
|
**ARCHS4 Reference:** |
|
|
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al. |
|
|
**βMassive mining of publicly available RNA-seq data from human and mouse.β** |
|
|
*Nature Communications* 9, 1366 (2018). |
|
|
Dataset: https://maayanlab.cloud/archs4/ |
|
|
|
|
|
### **Filtering Procedure** |
|
|
- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5 |
|
|
- Selected samples matching: |
|
|
`cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma` |
|
|
- Removed samples lacking clear disease annotations |
|
|
- Used ARCHS4 log-TPM matrices (gene Γ sample) |
|
|
- Final dataset: ~76k cancer samples, 19,357 genes |
|
|
|
|
|
No private, clinical, controlled-access, or proprietary data were used. |
|
|
|
|
|
--- |
|
|
|
|
|
# π§ Model Architecture (Summary) |
|
|
|
|
|
CancerTranscriptome-Mini-48M includes: |
|
|
|
|
|
### **1. Gene Identity Embeddings** |
|
|
- Precomputed **ESM2 embeddings** for each protein-coding gene |
|
|
- Projected into model dimension (320) |
|
|
|
|
|
### **2. Rotary Expression Embeddings (REE)** |
|
|
- Deterministic sinusoidal continuous-value embedding |
|
|
- Masked positions zeroed (mask token = β10) |
|
|
|
|
|
### **3. Graph Neural Network Layer** |
|
|
- **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph |
|
|
- Injects biological prior knowledge |
|
|
|
|
|
### **4. Expression Binning** |
|
|
- Learnable importance scores sort genes |
|
|
- Genes divided into 10 bins |
|
|
- Each bin receives its own **local Performer** attention |
|
|
|
|
|
### **5. Global Performer Attention** |
|
|
- 2 stacked Performer layers across all genes |
|
|
|
|
|
### **6. Prediction Head** |
|
|
- MLP β scalar value per gene |
|
|
- Used for masked-expression reconstruction |
|
|
|
|
|
Total parameters: **48,336,162 (~48M)** |
|
|
|
|
|
--- |
|
|
|
|
|
# π― Intended Use |
|
|
|
|
|
This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks: |
|
|
|
|
|
- Tumor subtype prediction |
|
|
- Drug response modeling |
|
|
- Immune infiltration scoring |
|
|
- Survival / risk modeling |
|
|
- Gene expression imputation |
|
|
- Dimensionality reduction |
|
|
- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets |
|
|
|
|
|
--- |
|
|
|
|
|
# π How to Use |
|
|
|
|
|
Download & run: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from model import BulkFormer # from this repo |
|
|
import safetensors.torch as st |
|
|
|
|
|
# Load model + weights |
|
|
model = BulkFormer( |
|
|
dim=320, |
|
|
graph=torch.load("edge_index.pt"), # provide your graph |
|
|
gene_emb=torch.load("esm2_gene_emb.pt"), |
|
|
gene_length=19357, |
|
|
bin_head=8, |
|
|
full_head=4, |
|
|
bins=10, |
|
|
gb_repeat=1, |
|
|
p_repeat=2 |
|
|
) |
|
|
|
|
|
state = st.load_file("model.safetensors") |
|
|
model.load_state_dict(state) |
|
|
model.eval() |
|
|
|
|
|
# Example input: 19,357-gene log-TPM vector |
|
|
x = torch.randn(1, 19357) |
|
|
|
|
|
with torch.no_grad(): |
|
|
out = model(x) |
|
|
|
|
|
print(out.shape) # [1, 19357] |
|
|
|