--- license: apache-2.0 language: - en tags: - biology - single-cell - rna-seq - scRNA-seq - embeddings --- # SCimilarity — Extended Model An extended version of [SCimilarity](https://github.com/Genentech/scimilarity), a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in: > Heimberg et al., **"A cell atlas foundation model for scalable search of similar human cells"**, *Nature*, 2024. https://doi.org/10.1038/s41586-024-08411-y --- ## What's different here The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform). | | Original | This model | |---|---|---| | Training cells | 7.9 M | **39.5 M** | | Search index cells | 23.4 M | **45.5 M** | --- ## Repository contents ``` ├── encoder.ckpt # encoder weights (use this for embedding) ├── decoder.ckpt # decoder weights (reconstruction) ├── gene_order.tsv # 28,231 gene symbols the model expects as input ├── layer_sizes.json # network architecture ├── hyperparameters.json # training hyperparameters ├── label_ints.csv # cell type label → integer mappings ├── metadata.json # dataset metadata ├── reference_labels.tsv # per-cell metadata for all reference cells │ # (cell type, donor, tissue, dataset) ├── annotation/ │ └── labelled_kNN.bin # kNN index for cell type annotation └── cellsearch/ └── full_kNN.bin # kNN index for similarity search ``` **The index files (`annotation/` and `cellsearch/`) are large (~160 GB combined) but optional.** If you only need to embed cells into the latent space — for clustering, visualization, or building your own index — you only need `encoder.ckpt`, `gene_order.tsv`, and `layer_sizes.json`. --- ## Installation ```bash pip install scimilarity ``` Or from source: ```bash git clone https://github.com/Genentech/scimilarity cd scimilarity pip install -e . ``` --- ## Usage For full usage examples including cell type annotation and similarity search, see the [original SCimilarity notebooks](https://github.com/Genentech/scimilarity/tree/main/docs/notebooks). Simply point `model_path` to your local copy of this repository instead of the original model directory. ### Encoder-only (no index required) If you want to embed cells without downloading the full index: ```python import scanpy as sc from scimilarity import CellEmbedding from scimilarity.utils import align_dataset, lognorm_counts ce = CellEmbedding(model_path="/path/to/model_v0") adata = sc.read_h5ad("your_data.h5ad") adata = align_dataset(adata, ce.gene_order) adata = lognorm_counts(adata) embeddings = ce.get_embeddings(adata.X) adata.obsm["X_scimilarity"] = embeddings ``` --- ## Model architecture | Parameter | Value | |---|---| | Input genes | 28,230 | | Hidden layers | 3 × 1,024 | | Embedding dimension | 128 | | Normalization | L2 (unit hypersphere) | | Loss | Triplet (semi-hard) + MSE reconstruction |