JFLa commited on
Commit
0653c88
·
verified ·
1 Parent(s): 9ea683c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -8
README.md CHANGED
@@ -16,16 +16,67 @@ tags:
16
  - transcriptomics
17
  ---
18
 
19
- # Geneformer-CAB: Benchmarking Scale and Architecture in Foundation Models for Single-Cell Transcriptomics
20
- - Model Overview:
21
 
22
- Geneformer-CAB (Cumulative-Assignment-Blocking, GF-CAB) is a benchmarked variant of the Geneformer architecture for modeling single-cell transcriptomic data.
23
- Rather than introducing an entirely new model, GF-CAB systematically evaluates how data scale and architectural refinements interact to influence model generalization, predictive diversity, and robustness to batch effects.
24
 
25
- - This model integrates two architectural enhancements:
 
26
 
27
- 1. Cumulative probability recalibration, which adjusts token-level prediction dynamics to reduce overconfident, frequency-driven outputs.
 
 
28
 
29
- 2. Similarity-based regularization, which penalizes redundant token predictions to promote diversity and alignment with rank-ordered gene expression profiles.
30
 
31
- Together, these mechanisms provide insight into the limits of scale in single-cell foundation models, revealing that scaling up pretraining data does not always yield superior downstream performance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  - transcriptomics
17
  ---
18
 
19
+ # 🧬 Geneformer-CAB: Benchmarking Scale and Architecture in Foundation Models for Single-Cell Transcriptomics
 
20
 
21
+ ## Model Overview
 
22
 
23
+ **Geneformer-CAB (Cumulative-Assignment-Blocking)** is a benchmarked variant of the Geneformer architecture for modeling single-cell transcriptomic data.
24
+ Rather than introducing an entirely new model, Geneformer-CAB systematically evaluates how **data scale** and **architectural refinements** interact to influence model generalization, predictive diversity, and robustness to batch effects.
25
 
26
+ This model integrates two architectural enhancements:
27
+ - **Cumulative probability recalibration**, which adjusts token-level prediction dynamics to reduce overconfident, frequency-driven outputs.
28
+ - **Similarity-based regularization**, which penalizes redundant token predictions to promote diversity and alignment with rank-ordered gene expression profiles.
29
 
30
+ Together, these mechanisms provide insight into the **limits of scale** in single-cell foundation models revealing that scaling up pretraining data does not always yield superior downstream performance.
31
 
32
+ ---
33
+
34
+ ## Key Results
35
+
36
+ | Task Type | Comparison | Key Finding |
37
+ |------------|-------------|-------------|
38
+ | **Pretraining Objectives** | GF-CAB vs. Geneformer | Higher masked prediction accuracy and diversity across scales |
39
+ | **Classification Tasks** | GF-CAB-1M vs. Geneformer-1M | Comparable or improved accuracy, narrowing the scale gap |
40
+ | **Zero-shot Batch Mitigation** | GF-CAB vs. Geneformer | Stronger generalization across datasets, less scale-dependent |
41
+
42
+ > Scaling pretraining data from 1M to 30M profiles improved discriminative tasks but reduced cross-dataset robustness — while architectural calibration in GF-CAB balanced both.
43
+
44
+ ---
45
+
46
+ ## Model Architecture
47
+
48
+ - **Base architecture:** Transformer encoder (BERT-style masked modeling)
49
+ - **Input representation:** Ranked gene expression profiles per cell
50
+ - **Masking objective:** Predict masked gene ranks, excluding unmasked regions
51
+ - **Innovations:**
52
+ - Cumulative probability recalibration (adjusted decoding dynamics)
53
+ - Similarity-based penalty loss (reduces redundancy in token predictions)
54
+
55
+ ---
56
+
57
+ ## Pretraining Data
58
+
59
+ | Dataset | Description | Size |
60
+ |----------|--------------|------|
61
+ | **Genecorpus-1M** | Random subset of ranked single-cell profiles from public scRNA-seq datasets | 1 million profiles |
62
+ | **Genecorpus-30M** | Large-scale extension incorporating additional datasets and donors | 30 million profiles |
63
+
64
+ ---
65
+
66
+ ## Downstream Evaluation
67
+
68
+ 1. **Cell-type classification** (3 benchmark tasks)
69
+ 2. **Zero-shot batch-effect mitigation** (4 public datasets)
70
+
71
+ Evaluation followed standardized pipelines based on Theodoris et al. (for classification) and Kedzierska et al. (for zero-shot robustness).
72
+
73
+ ---
74
+
75
+ ## Intended Use
76
+
77
+ This model is designed for:
78
+ - Benchmarking **foundation models** on single-cell gene expression tasks
79
+ - Studying **scaling effects** in biological pretraining
80
+ - Investigating **rank-based profile modeling** and representation diversity
81
+
82
+ ---