𧬠Geneformer Distilled
A compact, efficient distillation of Geneformer optimized for resource-constrained environments
π Model Overview
This distilled version of Geneformer brings the power of single-cell sequence representation to researchers with limited computational resources. Built on a BERT-like architecture, it preserves core capabilities while dramatically reducing model size and inference time.
Architecture Details
| Component | Specification |
|---|---|
| Parameters | 4.3M |
| Hidden Size | 128 |
| Layers | 4 |
| Attention Heads | 4 |
| Intermediate Size | 512 |
| Max Sequence Length | 2048 |
| Vocabulary Size | 25,426 |
| Dropout | 0.1 |
| Format | PyTorch (.pt) |
π¬ Distillation Process
Knowledge distillation was employed to transfer the learned representations from the original Geneformer to this compact student model.
- Teacher Model: Original Geneformer
- Distillation Method: Temperature-softened Knowledge Distillation
- Loss Function: Weighted combination of KL Divergence and Cross-Entropy
- Objective: Maximize reproducibility while minimizing computational requirements
π Performance Metrics
Masked Language Modeling
| Metric | Teacher | Student | Delta |
|---|---|---|---|
| MLM Accuracy | 30.59% | 25.34% | -5.25% |
| Perplexity | 15.40 | 22.48 | +7.08 |
Downstream Classification Task
| Metric | Teacher | Student |
|---|---|---|
| Accuracy | 83.02% | 72.67% |
| Macro F1 | 79.04% | 66.73% |
Note: The student model achieves ~83% of teacher performance while using only a fraction of the parameters, making it highly suitable for deployment in resource-limited settings.
π― Intended Use Cases
This model is designed for:
- Single-cell RNA-seq analysis with limited GPU memory
- Rapid prototyping of genomic ML pipelines
- Educational purposes and reproducibility studies
- Downstream tasks including:
- Cell type classification
- Gene expression clustering
- Transfer learning for specialized datasets
β οΈ Limitations
- Performance metrics are lower than the full-scale Geneformer
- Reduced model capacity may limit representation of complex biological patterns
- Best suited for tasks where computational efficiency is prioritized over maximum accuracy
π Getting Started
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("kkkamur07/geneformer-4.3M")
dataset = load_dataset("ctheodoris/Genecorpus-30M") # The dataset is pre tokenized
# Example usage
inputs = tokenizer(gene_sequences, return_tensors="pt", padding=True)
outputs = model(**inputs)
π License
This model is released under the Apache 2.0 License, inheriting the licensing terms of the original Geneformer model.
π€ Acknowledgments
Built upon the foundational work of the Geneformer team and trained on the Genecorpus-30M dataset.
Model tree for kkkamur07/geneformer-4.3M
Base model
ctheodoris/Geneformer