--- license: mit tags: - rna-seq - bulk-rna - cancer - transcriptomics - graph-neural-network - transformer - performer - gcn - pytorch model_size: 48M pipeline_tag: feature-extraction library_name: pytorch --- # 🧬 CancerTranscriptome-Mini-48M *A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq* **CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq. It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder. This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes. --- ## 🔬 Origin & References ### **Primary Reference (BulkFormer)** Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui. **“A large-scale foundation model for bulk transcriptomes.”** bioRxiv (2025). doi: https://doi.org/10.1101/2025.06.11.659222 ### **This Model (CancerTranscriptome-Mini-48M)** A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency. Source Code: https://github.com/alwalt/BioFM --- # 📊 Data Source All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository: **ARCHS4 Reference:** Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al. **“Massive mining of publicly available RNA-seq data from human and mouse.”** *Nature Communications* 9, 1366 (2018). Dataset: https://maayanlab.cloud/archs4/ ### **Filtering Procedure** - Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5 - Selected samples matching: `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma` - Removed samples lacking clear disease annotations - Used ARCHS4 log-TPM matrices (gene × sample) - Final dataset: ~76k cancer samples, 19,357 genes No private, clinical, controlled-access, or proprietary data were used. --- # 🧠 Model Architecture (Summary) CancerTranscriptome-Mini-48M includes: ### **1. Gene Identity Embeddings** - Precomputed **ESM2 embeddings** for each protein-coding gene - Projected into model dimension (320) ### **2. Rotary Expression Embeddings (REE)** - Deterministic sinusoidal continuous-value embedding - Masked positions zeroed (mask token = –10) ### **3. Graph Neural Network Layer** - **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph - Injects biological prior knowledge ### **4. Expression Binning** - Learnable importance scores sort genes - Genes divided into 10 bins - Each bin receives its own **local Performer** attention ### **5. Global Performer Attention** - 2 stacked Performer layers across all genes ### **6. Prediction Head** - MLP → scalar value per gene - Used for masked-expression reconstruction Total parameters: **48,336,162 (~48M)** --- # 🎯 Intended Use This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks: - Tumor subtype prediction - Drug response modeling - Immune infiltration scoring - Survival / risk modeling - Gene expression imputation - Dimensionality reduction - Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets --- # 🚀 How to Use Download & run: ```python import torch from model import BulkFormer # from this repo import safetensors.torch as st # Load model + weights model = BulkFormer( dim=320, graph=torch.load("edge_index.pt"), # provide your graph gene_emb=torch.load("esm2_gene_emb.pt"), gene_length=19357, bin_head=8, full_head=4, bins=10, gb_repeat=1, p_repeat=2 ) state = st.load_file("model.safetensors") model.load_state_dict(state) model.eval() # Example input: 19,357-gene log-TPM vector x = torch.randn(1, 19357) with torch.no_grad(): out = model(x) print(out.shape) # [1, 19357]