File size: 3,395 Bytes
690ad5d a179e31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | ---
license: apache-2.0
language:
- en
tags:
- biology
- single-cell
- rna-seq
- scRNA-seq
- embeddings
---
# SCimilarity β Extended Model
An extended version of [SCimilarity](https://github.com/Genentech/scimilarity), a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in:
> Heimberg et al., **"A cell atlas foundation model for scalable search of similar human cells"**, *Nature*, 2024. https://doi.org/10.1038/s41586-024-08411-y
---
## What's different here
The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform).
| | Original | This model |
|---|---|---|
| Training cells | 7.9 M | **39.5 M** |
| Search index cells | 23.4 M | **45.5 M** |
---
## Repository contents
```
βββ encoder.ckpt # encoder weights (use this for embedding)
βββ decoder.ckpt # decoder weights (reconstruction)
βββ gene_order.tsv # 28,231 gene symbols the model expects as input
βββ layer_sizes.json # network architecture
βββ hyperparameters.json # training hyperparameters
βββ label_ints.csv # cell type label β integer mappings
βββ metadata.json # dataset metadata
βββ reference_labels.tsv # per-cell metadata for all reference cells
β # (cell type, donor, tissue, dataset)
βββ annotation/
β βββ labelled_kNN.bin # kNN index for cell type annotation
βββ cellsearch/
βββ full_kNN.bin # kNN index for similarity search
```
**The index files (`annotation/` and `cellsearch/`) are large (~160 GB combined) but optional.** If you only need to embed cells into the latent space β for clustering, visualization, or building your own index β you only need `encoder.ckpt`, `gene_order.tsv`, and `layer_sizes.json`.
---
## Installation
```bash
pip install scimilarity
```
Or from source:
```bash
git clone https://github.com/Genentech/scimilarity
cd scimilarity
pip install -e .
```
---
## Usage
For full usage examples including cell type annotation and similarity search, see the [original SCimilarity notebooks](https://github.com/Genentech/scimilarity/tree/main/docs/notebooks). Simply point `model_path` to your local copy of this repository instead of the original model directory.
### Encoder-only (no index required)
If you want to embed cells without downloading the full index:
```python
import scanpy as sc
from scimilarity import CellEmbedding
from scimilarity.utils import align_dataset, lognorm_counts
ce = CellEmbedding(model_path="/path/to/model_v0")
adata = sc.read_h5ad("your_data.h5ad")
adata = align_dataset(adata, ce.gene_order)
adata = lognorm_counts(adata)
embeddings = ce.get_embeddings(adata.X)
adata.obsm["X_scimilarity"] = embeddings
```
---
## Model architecture
| Parameter | Value |
|---|---|
| Input genes | 28,230 |
| Hidden layers | 3 Γ 1,024 |
| Embedding dimension | 128 |
| Normalization | L2 (unit hypersphere) |
| Loss | Triplet (semi-hard) + MSE reconstruction | |