---
license: mit
tags:
  - rna-seq
  - bulk-rna
  - cancer
  - transcriptomics
  - graph-neural-network
  - transformer
  - performer
  - gcn
  - pytorch
model_size: 48M
pipeline_tag: feature-extraction
library_name: pytorch
---

# 🧬 CancerTranscriptome-Mini-48M  
*A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq*

**CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq.  
It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder.

This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes.

---

## 🔬 Origin & References

### **Primary Reference (BulkFormer)**
Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui.  
**“A large-scale foundation model for bulk transcriptomes.”**  
bioRxiv (2025).  
doi: https://doi.org/10.1101/2025.06.11.659222

### **This Model (CancerTranscriptome-Mini-48M)**
A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency.  
Source Code: https://github.com/alwalt/BioFM

---

# 📊 Data Source

All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository:

**ARCHS4 Reference:**  
Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al.  
**“Massive mining of publicly available RNA-seq data from human and mouse.”**  
*Nature Communications* 9, 1366 (2018).  
Dataset: https://maayanlab.cloud/archs4/

### **Filtering Procedure**
- Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5  
- Selected samples matching:  
  `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma`  
- Removed samples lacking clear disease annotations  
- Used ARCHS4 log-TPM matrices (gene × sample)  
- Final dataset: ~76k cancer samples, 19,357 genes

No private, clinical, controlled-access, or proprietary data were used.

---

# 🧠 Model Architecture (Summary)

CancerTranscriptome-Mini-48M includes:

### **1. Gene Identity Embeddings**
- Precomputed **ESM2 embeddings** for each protein-coding gene  
- Projected into model dimension (320)

### **2. Rotary Expression Embeddings (REE)**
- Deterministic sinusoidal continuous-value embedding  
- Masked positions zeroed (mask token = –10)

### **3. Graph Neural Network Layer**
- **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph  
- Injects biological prior knowledge

### **4. Expression Binning**
- Learnable importance scores sort genes  
- Genes divided into 10 bins  
- Each bin receives its own **local Performer** attention

### **5. Global Performer Attention**
- 2 stacked Performer layers across all genes

### **6. Prediction Head**
- MLP → scalar value per gene  
- Used for masked-expression reconstruction

Total parameters: **48,336,162 (~48M)**

---

# 🎯 Intended Use

This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks:

- Tumor subtype prediction  
- Drug response modeling  
- Immune infiltration scoring  
- Survival / risk modeling  
- Gene expression imputation  
- Dimensionality reduction  
- Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets  

---

# 🚀 How to Use

Download & run:

```python
import torch
from model import BulkFormer   # from this repo
import safetensors.torch as st

# Load model + weights
model = BulkFormer(
    dim=320,
    graph=torch.load("edge_index.pt"),   # provide your graph
    gene_emb=torch.load("esm2_gene_emb.pt"),
    gene_length=19357,
    bin_head=8,
    full_head=4,
    bins=10,
    gb_repeat=1,
    p_repeat=2
)

state = st.load_file("model.safetensors")
model.load_state_dict(state)
model.eval()

# Example input: 19,357-gene log-TPM vector
x = torch.randn(1, 19357)

with torch.no_grad():
    out = model(x)

print(out.shape)  # [1, 19357]