| --- |
| tags: |
| - model_hub_mixin |
| - pytorch_model_hub_mixin |
| --- |
| |
| # sCellTransformer |
|
|
| sCellTransformer (sCT) is a long-range foundation model designed for zero-shot |
| prediction tasks in single-cell RNA-seq and spatial transcriptomics data. It processes |
| raw gene expression profiles across multiple cells to predict discretized gene |
| expression levels for unseen cells without retraining. The model can handle up to 20,000 |
| protein-coding genes and a bag of 50 cells in the same sample. This ability |
| (around a million-gene expressions tokens) allows it to learn cross-cell |
| relationships and capture long-range dependencies in gene expression data, |
| and to mitigate the sparsity typical in single-cell datasets. |
|
|
| sCT is trained on a large dataset of single-cell RNA-seq and finetuned on spatial |
| transcriptomics data. Evaluation tasks include zero-shot imputation of masked gene |
| expression, and zero-shot prediction of cell types. |
|
|
| **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
|
|
| ### Model Sources |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository: |
| ** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) |
| - **Paper: |
| ** [A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data](https://openreview.net/pdf?id=VdX9tL3VXH) |
|
|
| ### How to use |
|
|
| Until its next release, the transformers library needs to be installed from source with |
| the following command in order to use the models. |
| PyTorch should also be installed. |
|
|
| ``` |
| pip install --upgrade git+https://github.com/huggingface/transformers.git |
| pip install torch |
| ``` |
|
|
| A small snippet of code is given here in order to infer with the model from random |
| input. |
|
|
| ``` |
| import torch |
| from transformers import AutoModel |
| |
| model = AutoModel.from_pretrained( |
| "InstaDeepAI/sCellTransformer", |
| trust_remote_code=True, |
| ) |
| num_cells = model.config.num_cells |
| dummy_gene_expressions = torch.randint(0, 5, (1, 19968 * num_cells)) |
| torch_output = model(dummy_gene_expressions) |
| ``` |
|
|
| A more concrete example is provided in the notebook example on one of the downstream |
| evaluation dataset. |
|
|
| #### Training data |
|
|
| The model was trained following a two-step procedure: |
| pre-training on single-cell data, then finetuning on spatial transcriptomics data. |
| The single-cell data used for pre-training, comes from the |
| [Cellxgene Census collection datasets](https://cellxgene.cziscience.com/) |
| used to train the scGPT models. It consists of around 50 millions |
| cells and approximately 60,000 genes. The spatial data comes from both the [human |
| breast cell atlas](https://cellxgene.cziscience.com/collections/4195ab4c-20bd-4cd3-8b3d-65601277e731) |
| and [the human heart atlas](https://www.heartcellatlas.org/). |
|
|
| #### Training procedure |
|
|
| As detailed in the paper, the gene expressions are first binned into a pre-defined |
| number of bins. This allows the model to better learn the distribution of the gene |
| expressions through sparsity mitigation, noise reduction, and extreme-values handling. |
| Then, the training objective is to predict the masked gene expressions in a cell, |
| following a BERT-like style training. |
|
|
| ### BibTeX entry and citation info |
|
|
| ``` |
| @misc{joshi2025a, |
| title={A long range foundation model for zero-shot predictions in single-cell and |
| spatial transcriptomics data}, |
| author={Ameya Joshi and Raphael Boige and Lee Zamparo and Ugo Tanielian and Juan Jose |
| Garau-Luis and Michail Chatzianastasis and Priyanka Pandey and Janik Sielemann and |
| Alexander Seifert and Martin Brand and Maren Lang and Karim Beguir and Thomas PIERROT}, |
| year={2025}, |
| url={https://openreview.net/forum?id=VdX9tL3VXH} |
| } |
| ``` |