hussenmi
/

scimilarity_expanded_model

Model card Files Files and versions

scimilarity_expanded_model / README.md

hussenmi's picture

Upload README.md with huggingface_hub

690ad5d verified 28 days ago

|

history blame contribute delete

3.4 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- biology
	- single-cell
	- rna-seq
	- scRNA-seq
	- embeddings
	---

	# SCimilarity — Extended Model

	An extended version of [SCimilarity](https://github.com/Genentech/scimilarity), a metric-learning model for single-cell RNA-seq that maps cells to a unified 128-dimensional embedding space. The original model and method are described in:

	> Heimberg et al., "A cell atlas foundation model for scalable search of similar human cells", Nature, 2024. https://doi.org/10.1038/s41586-024-08411-y

	---

	## What's different here

	The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform).

	\| \| Original \| This model \|
	\|---\|---\|---\|
	\| Training cells \| 7.9 M \| 39.5 M \|
	\| Search index cells \| 23.4 M \| 45.5 M \|

	---

	## Repository contents

	```
	├── encoder.ckpt # encoder weights (use this for embedding)
	├── decoder.ckpt # decoder weights (reconstruction)
	├── gene_order.tsv # 28,231 gene symbols the model expects as input
	├── layer_sizes.json # network architecture
	├── hyperparameters.json # training hyperparameters
	├── label_ints.csv # cell type label → integer mappings
	├── metadata.json # dataset metadata
	├── reference_labels.tsv # per-cell metadata for all reference cells
	│ # (cell type, donor, tissue, dataset)
	├── annotation/
	│ └── labelled_kNN.bin # kNN index for cell type annotation
	└── cellsearch/
	└── full_kNN.bin # kNN index for similarity search
	```

	The index files (`annotation/` and `cellsearch/`) are large (~160 GB combined) but optional. If you only need to embed cells into the latent space — for clustering, visualization, or building your own index — you only need `encoder.ckpt`, `gene_order.tsv`, and `layer_sizes.json`.

	---

	## Installation

	```bash
	pip install scimilarity
	```

	Or from source:

	```bash
	git clone https://github.com/Genentech/scimilarity
	cd scimilarity
	pip install -e .
	```

	---

	## Usage

	For full usage examples including cell type annotation and similarity search, see the [original SCimilarity notebooks](https://github.com/Genentech/scimilarity/tree/main/docs/notebooks). Simply point `model_path` to your local copy of this repository instead of the original model directory.

	### Encoder-only (no index required)

	If you want to embed cells without downloading the full index:

	```python
	import scanpy as sc
	from scimilarity import CellEmbedding
	from scimilarity.utils import align_dataset, lognorm_counts

	ce = CellEmbedding(model_path="/path/to/model_v0")

	adata = sc.read_h5ad("your_data.h5ad")
	adata = align_dataset(adata, ce.gene_order)
	adata = lognorm_counts(adata)

	embeddings = ce.get_embeddings(adata.X)
	adata.obsm["X_scimilarity"] = embeddings
	```

	---

	## Model architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Input genes \| 28,230 \|
	\| Hidden layers \| 3 × 1,024 \|
	\| Embedding dimension \| 128 \|
	\| Normalization \| L2 (unit hypersphere) \|
	\| Loss \| Triplet (semi-hard) + MSE reconstruction \|