JFLa
/

GF-CAB

Token Classification

transcriptomics

Model card Files Files and versions

GF-CAB / README.md

JFLa's picture

Update README.md

0653c88 verified 4 months ago

|

history blame contribute delete

3.3 kB

	---
	license: apache-2.0
	datasets:
	- JFLa/GF-CAB_Datasets
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- ctheodoris/Geneformer
	pipeline_tag: token-classification
	library_name: transformers
	tags:
	- biology
	- single-cell
	- transcriptomics
	---

	# 🧬 Geneformer-CAB: Benchmarking Scale and Architecture in Foundation Models for Single-Cell Transcriptomics

	## Model Overview

	Geneformer-CAB (Cumulative-Assignment-Blocking) is a benchmarked variant of the Geneformer architecture for modeling single-cell transcriptomic data.
	Rather than introducing an entirely new model, Geneformer-CAB systematically evaluates how data scale and architectural refinements interact to influence model generalization, predictive diversity, and robustness to batch effects.

	This model integrates two architectural enhancements:
	- Cumulative probability recalibration, which adjusts token-level prediction dynamics to reduce overconfident, frequency-driven outputs.
	- Similarity-based regularization, which penalizes redundant token predictions to promote diversity and alignment with rank-ordered gene expression profiles.

	Together, these mechanisms provide insight into the limits of scale in single-cell foundation models — revealing that scaling up pretraining data does not always yield superior downstream performance.

	---

	## Key Results

	\| Task Type \| Comparison \| Key Finding \|
	\|------------\|-------------\|-------------\|
	\| Pretraining Objectives \| GF-CAB vs. Geneformer \| Higher masked prediction accuracy and diversity across scales \|
	\| Classification Tasks \| GF-CAB-1M vs. Geneformer-1M \| Comparable or improved accuracy, narrowing the scale gap \|
	\| Zero-shot Batch Mitigation \| GF-CAB vs. Geneformer \| Stronger generalization across datasets, less scale-dependent \|

	> Scaling pretraining data from 1M to 30M profiles improved discriminative tasks but reduced cross-dataset robustness — while architectural calibration in GF-CAB balanced both.

	---

	## Model Architecture

	- Base architecture: Transformer encoder (BERT-style masked modeling)
	- Input representation: Ranked gene expression profiles per cell
	- Masking objective: Predict masked gene ranks, excluding unmasked regions
	- Innovations:
	- Cumulative probability recalibration (adjusted decoding dynamics)
	- Similarity-based penalty loss (reduces redundancy in token predictions)

	---

	## Pretraining Data

	\| Dataset \| Description \| Size \|
	\|----------\|--------------\|------\|
	\| Genecorpus-1M \| Random subset of ranked single-cell profiles from public scRNA-seq datasets \| 1 million profiles \|
	\| Genecorpus-30M \| Large-scale extension incorporating additional datasets and donors \| 30 million profiles \|

	---

	## Downstream Evaluation

	1. Cell-type classification (3 benchmark tasks)
	2. Zero-shot batch-effect mitigation (4 public datasets)

	Evaluation followed standardized pipelines based on Theodoris et al. (for classification) and Kedzierska et al. (for zero-shot robustness).

	---

	## Intended Use

	This model is designed for:
	- Benchmarking foundation models on single-cell gene expression tasks
	- Studying scaling effects in biological pretraining
	- Investigating rank-based profile modeling and representation diversity

	---