alvawalt's picture
Update README.md
b051c48 verified
---
license: mit
tags:
- rna-seq
- bulk-rna
- cancer
- transcriptomics
- graph-neural-network
- transformer
- performer
- gcn
- pytorch
model_size: 48M
pipeline_tag: feature-extraction
library_name: pytorch
---
# 🧬 CancerTranscriptome-Mini-48M
*A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq*
**CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.
It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder.
This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.
---
## πŸ”¬ Origin & References
### **Primary Reference (BulkFormer)**
Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.
**β€œA large-scale foundation model for bulk transcriptomes.”**
bioRxiv (2025).
doi: https://doi.org/10.1101/2025.06.11.659222
### **This Model (CancerTranscriptome-Mini-48M)**
A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.
Source Code: https://github.com/alwalt/BioFM
---
# πŸ“Š Data Source
All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository:
**ARCHS4 Reference:**
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.
**β€œMassive mining of publicly available RNA-seq data from human and mouse.”**
*Nature Communications* 9, 1366 (2018).
Dataset: https://maayanlab.cloud/archs4/
### **Filtering Procedure**
- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5
- Selected samples matching:
`cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma`
- Removed samples lacking clear disease annotations
- Used ARCHS4 log-TPM matrices (gene Γ— sample)
- Final dataset: ~76k cancer samples, 19,357 genes
No private, clinical, controlled-access, or proprietary data were used.
---
# 🧠 Model Architecture (Summary)
CancerTranscriptome-Mini-48M includes:
### **1. Gene Identity Embeddings**
- Precomputed **ESM2 embeddings** for each protein-coding gene
- Projected into model dimension (320)
### **2. Rotary Expression Embeddings (REE)**
- Deterministic sinusoidal continuous-value embedding
- Masked positions zeroed (mask token = –10)
### **3. Graph Neural Network Layer**
- **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph
- Injects biological prior knowledge
### **4. Expression Binning**
- Learnable importance scores sort genes
- Genes divided into 10 bins
- Each bin receives its own **local Performer** attention
### **5. Global Performer Attention**
- 2 stacked Performer layers across all genes
### **6. Prediction Head**
- MLP β†’ scalar value per gene
- Used for masked-expression reconstruction
Total parameters: **48,336,162 (~48M)**
---
# 🎯 Intended Use
This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks:
- Tumor subtype prediction
- Drug response modeling
- Immune infiltration scoring
- Survival / risk modeling
- Gene expression imputation
- Dimensionality reduction
- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets
---
# πŸš€ How to Use
Download & run:
```python
import torch
from model import BulkFormer # from this repo
import safetensors.torch as st
# Load model + weights
model = BulkFormer(
dim=320,
graph=torch.load("edge_index.pt"), # provide your graph
gene_emb=torch.load("esm2_gene_emb.pt"),
gene_length=19357,
bin_head=8,
full_head=4,
bins=10,
gb_repeat=1,
p_repeat=2
)
state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()
# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)
with torch.no_grad():
out = model(x)
print(out.shape) # [1, 19357]