| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - biology |
| - single-cell |
| - rna-seq |
| - scRNA-seq |
| - embeddings |
| --- |
| |
| # SCimilarity β Extended Model |
|
|
| An extended version of [SCimilarity](https://github.com/Genentech/scimilarity), a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in: |
|
|
| > Heimberg et al., **"A cell atlas foundation model for scalable search of similar human cells"**, *Nature*, 2024. https://doi.org/10.1038/s41586-024-08411-y |
|
|
| --- |
|
|
| ## What's different here |
|
|
| The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform). |
|
|
| | | Original | This model | |
| |---|---|---| |
| | Training cells | 7.9 M | **39.5 M** | |
| | Search index cells | 23.4 M | **45.5 M** | |
|
|
| --- |
|
|
| ## Repository contents |
|
|
| ``` |
| βββ encoder.ckpt # encoder weights (use this for embedding) |
| βββ decoder.ckpt # decoder weights (reconstruction) |
| βββ gene_order.tsv # 28,231 gene symbols the model expects as input |
| βββ layer_sizes.json # network architecture |
| βββ hyperparameters.json # training hyperparameters |
| βββ label_ints.csv # cell type label β integer mappings |
| βββ metadata.json # dataset metadata |
| βββ reference_labels.tsv # per-cell metadata for all reference cells |
| β # (cell type, donor, tissue, dataset) |
| βββ annotation/ |
| β βββ labelled_kNN.bin # kNN index for cell type annotation |
| βββ cellsearch/ |
| βββ full_kNN.bin # kNN index for similarity search |
| ``` |
|
|
| **The index files (`annotation/` and `cellsearch/`) are large (~160 GB combined) but optional.** If you only need to embed cells into the latent space β for clustering, visualization, or building your own index β you only need `encoder.ckpt`, `gene_order.tsv`, and `layer_sizes.json`. |
|
|
| --- |
|
|
| ## Installation |
|
|
| ```bash |
| pip install scimilarity |
| ``` |
|
|
| Or from source: |
|
|
| ```bash |
| git clone https://github.com/Genentech/scimilarity |
| cd scimilarity |
| pip install -e . |
| ``` |
|
|
| --- |
|
|
| ## Usage |
|
|
| For full usage examples including cell type annotation and similarity search, see the [original SCimilarity notebooks](https://github.com/Genentech/scimilarity/tree/main/docs/notebooks). Simply point `model_path` to your local copy of this repository instead of the original model directory. |
|
|
| ### Encoder-only (no index required) |
|
|
| If you want to embed cells without downloading the full index: |
|
|
| ```python |
| import scanpy as sc |
| from scimilarity import CellEmbedding |
| from scimilarity.utils import align_dataset, lognorm_counts |
| |
| ce = CellEmbedding(model_path="/path/to/model_v0") |
| |
| adata = sc.read_h5ad("your_data.h5ad") |
| adata = align_dataset(adata, ce.gene_order) |
| adata = lognorm_counts(adata) |
| |
| embeddings = ce.get_embeddings(adata.X) |
| adata.obsm["X_scimilarity"] = embeddings |
| ``` |
|
|
| --- |
|
|
| ## Model architecture |
|
|
| | Parameter | Value | |
| |---|---| |
| | Input genes | 28,230 | |
| | Hidden layers | 3 Γ 1,024 | |
| | Embedding dimension | 128 | |
| | Normalization | L2 (unit hypersphere) | |
| | Loss | Triplet (semi-hard) + MSE reconstruction | |