File size: 3,395 Bytes
690ad5d
 
 
 
 
 
 
 
 
 
 
 
a179e31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: apache-2.0
language:
  - en
tags:
  - biology
  - single-cell
  - rna-seq
  - scRNA-seq
  - embeddings
---

# SCimilarity β€” Extended Model

An extended version of [SCimilarity](https://github.com/Genentech/scimilarity), a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in:

> Heimberg et al., **"A cell atlas foundation model for scalable search of similar human cells"**, *Nature*, 2024. https://doi.org/10.1038/s41586-024-08411-y

---

## What's different here

The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform).

| | Original | This model |
|---|---|---|
| Training cells | 7.9 M | **39.5 M** |
| Search index cells | 23.4 M | **45.5 M** |

---

## Repository contents

```
β”œβ”€β”€ encoder.ckpt            # encoder weights (use this for embedding)
β”œβ”€β”€ decoder.ckpt            # decoder weights (reconstruction)
β”œβ”€β”€ gene_order.tsv          # 28,231 gene symbols the model expects as input
β”œβ”€β”€ layer_sizes.json        # network architecture
β”œβ”€β”€ hyperparameters.json    # training hyperparameters
β”œβ”€β”€ label_ints.csv          # cell type label β†’ integer mappings
β”œβ”€β”€ metadata.json           # dataset metadata
β”œβ”€β”€ reference_labels.tsv    # per-cell metadata for all reference cells
β”‚                           # (cell type, donor, tissue, dataset)
β”œβ”€β”€ annotation/
β”‚   └── labelled_kNN.bin    # kNN index for cell type annotation
└── cellsearch/
    └── full_kNN.bin        # kNN index for similarity search
```

**The index files (`annotation/` and `cellsearch/`) are large (~160 GB combined) but optional.** If you only need to embed cells into the latent space β€” for clustering, visualization, or building your own index β€” you only need `encoder.ckpt`, `gene_order.tsv`, and `layer_sizes.json`.

---

## Installation

```bash
pip install scimilarity
```

Or from source:

```bash
git clone https://github.com/Genentech/scimilarity
cd scimilarity
pip install -e .
```

---

## Usage

For full usage examples including cell type annotation and similarity search, see the [original SCimilarity notebooks](https://github.com/Genentech/scimilarity/tree/main/docs/notebooks). Simply point `model_path` to your local copy of this repository instead of the original model directory.

### Encoder-only (no index required)

If you want to embed cells without downloading the full index:

```python
import scanpy as sc
from scimilarity import CellEmbedding
from scimilarity.utils import align_dataset, lognorm_counts

ce = CellEmbedding(model_path="/path/to/model_v0")

adata = sc.read_h5ad("your_data.h5ad")
adata = align_dataset(adata, ce.gene_order)
adata = lognorm_counts(adata)

embeddings = ce.get_embeddings(adata.X)
adata.obsm["X_scimilarity"] = embeddings
```

---

## Model architecture

| Parameter | Value |
|---|---|
| Input genes | 28,230 |
| Hidden layers | 3 Γ— 1,024 |
| Embedding dimension | 128 |
| Normalization | L2 (unit hypersphere) |
| Loss | Triplet (semi-hard) + MSE reconstruction |