🧬 Geneformer Distilled

A compact, efficient distillation of Geneformer optimized for resource-constrained environments

📋 Model Overview

This distilled version of Geneformer brings the power of single-cell sequence representation to researchers with limited computational resources. Built on a BERT-like architecture, it preserves core capabilities while dramatically reducing model size and inference time.

Architecture Details

Component	Specification
Parameters	4.3M
Hidden Size	128
Layers	4
Attention Heads	4
Intermediate Size	512
Max Sequence Length	2048
Vocabulary Size	25,426
Dropout	0.1
Format	PyTorch (.pt)

🔬 Distillation Process

Knowledge distillation was employed to transfer the learned representations from the original Geneformer to this compact student model.

Teacher Model: Original Geneformer
Distillation Method: Temperature-softened Knowledge Distillation
Loss Function: Weighted combination of KL Divergence and Cross-Entropy
Objective: Maximize reproducibility while minimizing computational requirements

📊 Performance Metrics

Masked Language Modeling

Metric	Teacher	Student	Delta
MLM Accuracy	30.59%	25.34%	-5.25%
Perplexity	15.40	22.48	+7.08

Downstream Classification Task

Metric	Teacher	Student
Accuracy	83.02%	72.67%
Macro F1	79.04%	66.73%

Note: The student model achieves ~83% of teacher performance while using only a fraction of the parameters, making it highly suitable for deployment in resource-limited settings.

🎯 Intended Use Cases

This model is designed for:

Single-cell RNA-seq analysis with limited GPU memory
Rapid prototyping of genomic ML pipelines
Educational purposes and reproducibility studies
Downstream tasks including:
- Cell type classification
- Gene expression clustering
- Transfer learning for specialized datasets

⚠️ Limitations

Performance metrics are lower than the full-scale Geneformer
Reduced model capacity may limit representation of complex biological patterns
Best suited for tasks where computational efficiency is prioritized over maximum accuracy

🚀 Getting Started

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("kkkamur07/geneformer-4.3M")
dataset = load_dataset("ctheodoris/Genecorpus-30M") # The dataset is pre tokenized

# Example usage
inputs = tokenizer(gene_sequences, return_tensors="pt", padding=True)
outputs = model(**inputs)

📄 License

This model is released under the Apache 2.0 License, inheriting the licensing terms of the original Geneformer model.

🤝 Acknowledgments

Built upon the foundational work of the Geneformer team and trained on the Genecorpus-30M dataset.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for kkkamur07/geneformer-4.3M

Base model

ctheodoris/Geneformer

Finetuned

(12)

this model

kkkamur07
/

geneformer-4.3M