File size: 3,297 Bytes
c31a0a9 fd3e64f 0653c88 fd3e64f 0653c88 fd3e64f 0653c88 fd3e64f 0653c88 fd3e64f 0653c88 fd3e64f 0653c88 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: apache-2.0
datasets:
- JFLa/GF-CAB_Datasets
language:
- en
metrics:
- accuracy
base_model:
- ctheodoris/Geneformer
pipeline_tag: token-classification
library_name: transformers
tags:
- biology
- single-cell
- transcriptomics
---
# 🧬 Geneformer-CAB: Benchmarking Scale and Architecture in Foundation Models for Single-Cell Transcriptomics
## Model Overview
**Geneformer-CAB (Cumulative-Assignment-Blocking)** is a benchmarked variant of the Geneformer architecture for modeling single-cell transcriptomic data.
Rather than introducing an entirely new model, Geneformer-CAB systematically evaluates how **data scale** and **architectural refinements** interact to influence model generalization, predictive diversity, and robustness to batch effects.
This model integrates two architectural enhancements:
- **Cumulative probability recalibration**, which adjusts token-level prediction dynamics to reduce overconfident, frequency-driven outputs.
- **Similarity-based regularization**, which penalizes redundant token predictions to promote diversity and alignment with rank-ordered gene expression profiles.
Together, these mechanisms provide insight into the **limits of scale** in single-cell foundation models — revealing that scaling up pretraining data does not always yield superior downstream performance.
---
## Key Results
| Task Type | Comparison | Key Finding |
|------------|-------------|-------------|
| **Pretraining Objectives** | GF-CAB vs. Geneformer | Higher masked prediction accuracy and diversity across scales |
| **Classification Tasks** | GF-CAB-1M vs. Geneformer-1M | Comparable or improved accuracy, narrowing the scale gap |
| **Zero-shot Batch Mitigation** | GF-CAB vs. Geneformer | Stronger generalization across datasets, less scale-dependent |
> Scaling pretraining data from 1M to 30M profiles improved discriminative tasks but reduced cross-dataset robustness — while architectural calibration in GF-CAB balanced both.
---
## Model Architecture
- **Base architecture:** Transformer encoder (BERT-style masked modeling)
- **Input representation:** Ranked gene expression profiles per cell
- **Masking objective:** Predict masked gene ranks, excluding unmasked regions
- **Innovations:**
- Cumulative probability recalibration (adjusted decoding dynamics)
- Similarity-based penalty loss (reduces redundancy in token predictions)
---
## Pretraining Data
| Dataset | Description | Size |
|----------|--------------|------|
| **Genecorpus-1M** | Random subset of ranked single-cell profiles from public scRNA-seq datasets | 1 million profiles |
| **Genecorpus-30M** | Large-scale extension incorporating additional datasets and donors | 30 million profiles |
---
## Downstream Evaluation
1. **Cell-type classification** (3 benchmark tasks)
2. **Zero-shot batch-effect mitigation** (4 public datasets)
Evaluation followed standardized pipelines based on Theodoris et al. (for classification) and Kedzierska et al. (for zero-shot robustness).
---
## Intended Use
This model is designed for:
- Benchmarking **foundation models** on single-cell gene expression tasks
- Studying **scaling effects** in biological pretraining
- Investigating **rank-based profile modeling** and representation diversity
--- |