🧬 Geneformer Distilled

A compact, efficient distillation of Geneformer optimized for resource-constrained environments


πŸ“‹ Model Overview

This distilled version of Geneformer brings the power of single-cell sequence representation to researchers with limited computational resources. Built on a BERT-like architecture, it preserves core capabilities while dramatically reducing model size and inference time.

Architecture Details

Component Specification
Parameters 4.3M
Hidden Size 128
Layers 4
Attention Heads 4
Intermediate Size 512
Max Sequence Length 2048
Vocabulary Size 25,426
Dropout 0.1
Format PyTorch (.pt)

πŸ”¬ Distillation Process

Knowledge distillation was employed to transfer the learned representations from the original Geneformer to this compact student model.

  • Teacher Model: Original Geneformer
  • Distillation Method: Temperature-softened Knowledge Distillation
  • Loss Function: Weighted combination of KL Divergence and Cross-Entropy
  • Objective: Maximize reproducibility while minimizing computational requirements

πŸ“Š Performance Metrics

Masked Language Modeling

Metric Teacher Student Delta
MLM Accuracy 30.59% 25.34% -5.25%
Perplexity 15.40 22.48 +7.08

Downstream Classification Task

Metric Teacher Student
Accuracy 83.02% 72.67%
Macro F1 79.04% 66.73%

Note: The student model achieves ~83% of teacher performance while using only a fraction of the parameters, making it highly suitable for deployment in resource-limited settings.


🎯 Intended Use Cases

This model is designed for:

  • Single-cell RNA-seq analysis with limited GPU memory
  • Rapid prototyping of genomic ML pipelines
  • Educational purposes and reproducibility studies
  • Downstream tasks including:
    • Cell type classification
    • Gene expression clustering
    • Transfer learning for specialized datasets

⚠️ Limitations

  • Performance metrics are lower than the full-scale Geneformer
  • Reduced model capacity may limit representation of complex biological patterns
  • Best suited for tasks where computational efficiency is prioritized over maximum accuracy

πŸš€ Getting Started

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("kkkamur07/geneformer-4.3M")
dataset = load_dataset("ctheodoris/Genecorpus-30M") # The dataset is pre tokenized

# Example usage
inputs = tokenizer(gene_sequences, return_tensors="pt", padding=True)
outputs = model(**inputs)

πŸ“„ License

This model is released under the Apache 2.0 License, inheriting the licensing terms of the original Geneformer model.


🀝 Acknowledgments

Built upon the foundational work of the Geneformer team and trained on the Genecorpus-30M dataset.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kkkamur07/geneformer-4.3M

Finetuned
(12)
this model

Dataset used to train kkkamur07/geneformer-4.3M