InstaDeepAI
/

sCellTransformer

@@ -1,15 +1,23 @@
 ---
 tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
 ---
 # sCellTransformer
-sCellTransformer (sCT) is a long-range foundation model designed for zero-shot prediction tasks
-in single-cell RNA-seq and spatial transcriptomics data. It processes raw gene expression profiles across multiple cells to predict discretized
-gene expression levels for unseen cells without retraining. The model handles up to 20,000 protein-coding genes and outputs around a million
-gene expression tokens, mitigating the sparsity typical in single-cell datasets.
 **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
@@ -17,13 +25,15 @@ gene expression tokens, mitigating the sparsity typical in single-cell datasets.
 <!-- Provide the basic links for the model. -->
-- **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
-- **Paper:** [A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data](https://openreview.net/pdf?id=VdX9tL3VXH)
 ### How to use
-Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
 PyTorch should also be installed.
 ```
@@ -31,7 +41,8 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
 pip install torch
 ```
-A small snippet of code is given here in order to infer with the model from random input.
 ```
 import torch
@@ -44,4 +55,40 @@ model = AutoModel.from_pretrained(
 num_cells = model.config.num_cells
 dummy_gene_expressions = torch.randint(0, 5, (1, 19968 * num_cells))
 torch_output = model(dummy_gene_expressions)
 ```

 ---
 tags:
+  - model_hub_mixin
+  - pytorch_model_hub_mixin
 ---
 # sCellTransformer
+sCellTransformer (sCT) is a long-range foundation model designed for zero-shot
+prediction tasks in single-cell RNA-seq and spatial transcriptomics data. It processes
+raw gene expression profiles across multiple cells to predict discretized gene
+expression levels for unseen cells without retraining. The model can handle up to 20,000
+protein-coding genes and a bag of 50 cells in the same sample. This ability
+(around a million-gene expressions tokens) allows it to learn cross-cell
+relationships and capture long-range dependencies in gene expression data,
+and to mitigate the sparsity typical in single-cell datasets.
+sCT is trained on a large dataset of single-cell RNA-seq and finetuned on spatial
+transcriptomics data. Evaluation tasks include zero-shot imputation of masked gene
+expression, and zero-shot prediction of cell types.
 **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
 <!-- Provide the basic links for the model. -->
+- **Repository:
+  ** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
+- **Paper:
+  ** [A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data](https://openreview.net/pdf?id=VdX9tL3VXH)
 ### How to use
+Until its next release, the transformers library needs to be installed from source with
+the following command in order to use the models.
 PyTorch should also be installed.
 ```
 pip install torch
 ```
+A small snippet of code is given here in order to infer with the model from random
+input.
 ```
 import torch
 num_cells = model.config.num_cells
 dummy_gene_expressions = torch.randint(0, 5, (1, 19968 * num_cells))
 torch_output = model(dummy_gene_expressions)
+```
+A more concrete example is provided in the notebook example on one of the downstream
+evaluation dataset.
+#### Training data
+The model was trained following a two-step procedure:
+pre-training on single-cell data, then finetuning on spatial transcriptomics data.
+The single-cell data used for pre-training, comes from the
+[Cellxgene Census collection datasets](https://cellxgene.cziscience.com/)
+used to train the scGPT models. It consists of around 50 millions
+cells and approximately 60,000 genes. The spatial data comes from both the [human
+breast cell atlas](https://cellxgene.cziscience.com/collections/4195ab4c-20bd-4cd3-8b3d-65601277e731)
+and [the human heart atlas](https://www.heartcellatlas.org/).
+#### Training procedure
+As detailed in the paper, the gene expressions are first binned into a pre-defined
+number of bins. This allows the model to better learn the distribution of the gene
+expressions through sparsity mitigation, noise reduction, and extreme-values handling.
+Then, the training objective is to predict the masked gene expressions in a cell,
+following a BERT-like style training.
+### BibTeX entry and citation info
+```
+@misc{joshi2025a,
+title={A long range foundation model for zero-shot predictions in single-cell and
+spatial transcriptomics data},
+author={Ameya Joshi and Raphael Boige and Lee Zamparo and Ugo Tanielian and Juan Jose
+Garau-Luis and Michail Chatzianastasis and Priyanka Pandey and Janik Sielemann and
+Alexander Seifert and Martin Brand and Maren Lang and Karim Beguir and Thomas PIERROT},
+year={2025},
+url={https://openreview.net/forum?id=VdX9tL3VXH}
+}
 ```