Add nano-Geneformer as a community reference implementation

#588
by nqhuya - opened

What is nano-Geneformer?

I recently built nano-Geneformer, a lightweight and faithful reimplementation of Geneformer designed to make the core implementation easier to read, reproduce, benchmark, and extend while preserving the original architecture and inference behavior.

Repository:
https://github.com/huynguyen250896/nano-Geneformer

Highlights

  • Supports all official Geneformer checkpoints (V1, V2-104M, V2-104M_CLcancer, and V2-316M)
  • Faithfully reproduces the original Geneformer architecture and inference pipeline
  • Cleaner, modern PyTorch implementation with simplified installation and dependency management
  • Suitable for learning, benchmarking, experimentation, fine-tuning, and future training from scratch

Validation

I carefully benchmarked nano-Geneformer against the official implementation to ensure it can serve as a practical community reference implementation.

Compared with the official implementation, nano-Geneformer:

  • reduces peak GPU memory by up to 56.8% for the largest Geneformer model (V2-316M)
  • achieves 1.06–1.15Γ— faster inference
  • reproduces cell embeddings with mean cosine similarity β‰ˆ 1.000000
  • preserves local/global representation geometry and pairwise distance structure across all official checkpoints

The full benchmark notebook is available in the repository.

Why this PR?

The goal of nano-Geneformer is not to replace the official implementation, but to provide a lightweight community resource for users who want a smaller, easier-to-read implementation for learning, reproducibility, benchmarking, and research.

This PR only adds a link under Community Projects. It does not modify any code, pretrained models, datasets, checkpoints, or model behavior.

Thank you for your consideration.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment