Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,159 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: feature-extraction
|
| 4 |
+
tags:
|
| 5 |
+
- Single-cell
|
| 6 |
+
---
|
| 7 |
+
# scConcept
|
| 8 |
+
|
| 9 |
+
[![Tests][badge-tests]][tests]
|
| 10 |
+
[![Documentation][badge-docs]][documentation]
|
| 11 |
+
|
| 12 |
+
[badge-tests]: https://img.shields.io/github/actions/workflow/status/theislab/scConcept/test.yaml?branch=main
|
| 13 |
+
[badge-docs]: https://img.shields.io/readthedocs/scConcept
|
| 14 |
+
|
| 15 |
+
This repository contains the python package to train and use scConcept (Single-cell contrastive cell pre-training) method for single-cell transcriptomics.
|
| 16 |
+
|
| 17 |
+
<!-- ## Getting started
|
| 18 |
+
|
| 19 |
+
Please refer to the [documentation][],
|
| 20 |
+
in particular, the [API documentation][]. -->
|
| 21 |
+
|
| 22 |
+
## Installation
|
| 23 |
+
|
| 24 |
+
You need to have Python 3.12 or newer installed on your system.
|
| 25 |
+
If you don't have Python installed, we recommend installing [uv][].
|
| 26 |
+
|
| 27 |
+
### Default installation
|
| 28 |
+
|
| 29 |
+
Install the latest release of `sc-concept` from [PyPI][]:
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
pip install sc-concept
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
### Latest development version
|
| 36 |
+
|
| 37 |
+
To install the latest development version directly from GitHub:
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
pip install git+https://github.com/theislab/scConcept.git@main
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Optional Flash Attention speedup
|
| 44 |
+
|
| 45 |
+
The standard installation is enough for loading pretrained models, extracting embeddings, and light adaptation. For faster inference, embedding extraction, adaptation, or large-scale training, install [Flash Attention][] with one of the following options.
|
| 46 |
+
|
| 47 |
+
1. Recommended: `cd` to the project root and run [`./scripts/setup_env.sh`](https://github.com/theislab/scConcept/blob/main/scripts/setup_env.sh), which installs uv if needed and creates a virtual environment with the training dependencies.
|
| 48 |
+
|
| 49 |
+
2. Manual: make sure a CUDA-enabled version of PyTorch is installed. More information is available in the [PyTorch installation guide](https://pytorch.org/get-started/locally/). Then install Flash Attention:
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
MAX_JOBS=4 pip install "flash-attn>=2.7" --no-build-isolation
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
This can take up to an hour depending on the system specifications and whether a pre-built release of `flash-attn` is available for your exact versions of Python, PyTorch, and CUDA. If this takes long, we recommend using the setup script instead.
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
## How to use
|
| 59 |
+
|
| 60 |
+
scConcept provides a simple API to load and adapt [pre-trained models](https://huggingface.co/theislab/scConcept/tree/main) and extract embeddings from scRNA-seq data.
|
| 61 |
+
|
| 62 |
+
### Pre-trained models
|
| 63 |
+
|
| 64 |
+
The following models are available from the [scConcept Hugging Face repository](https://huggingface.co/theislab/scConcept/tree/main). Use the value in the `model_name` column with `concept.load_config_and_model(model_name=...)`.
|
| 65 |
+
|
| 66 |
+
| `model_name` | Training corpus | Architecture | Max tokens | Species | Notes |
|
| 67 |
+
| --- | --- | --- | ---: | --- | --- |
|
| 68 |
+
| `corpus360M[multi-species]-model170M` | 360M cells (CellxGene 2026 + scBaseCount 2025) | 170M parameters, 16 layers, 1024 hidden size, 16 heads | 20,000 | 16 species | Largest multi-species checkpoint; best suited for cross-species applications with sufficient memory. |
|
| 69 |
+
| `corpus40M-model30M` | 40M cells (CellxGene 2023) | 30M parameters, 8 layers, 512 hidden size, 8 heads | 1,000 | Human | Recommended default for embedding extraction and light adaptation. |
|
| 70 |
+
|
| 71 |
+
Here's a basic example:
|
| 72 |
+
|
| 73 |
+
```python
|
| 74 |
+
from concept import scConcept
|
| 75 |
+
import scanpy as sc
|
| 76 |
+
|
| 77 |
+
# Load your single-cell data
|
| 78 |
+
adata = sc.read_h5ad("your_data.h5ad")
|
| 79 |
+
|
| 80 |
+
# Initialize scConcept and load a pretrained model
|
| 81 |
+
concept = scConcept(cache_dir='./cache/')
|
| 82 |
+
|
| 83 |
+
# Option 1: Load a model directly from HuggingFace
|
| 84 |
+
concept.load_config_and_model(model_name='corpus40M-model30M')
|
| 85 |
+
|
| 86 |
+
# Option 2: Load any local model
|
| 87 |
+
concept.load_config_and_model(
|
| 88 |
+
config='<path-to-config.yaml>',
|
| 89 |
+
model_path='<path-to-model.ckpt>',
|
| 90 |
+
gene_mappings_path='<path-to-gene-mappings-directory>',
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
# scConcept accepts Gene Ensemble IDs as input. You can use built-in helper methods to do the mapping if needed:
|
| 94 |
+
adata.var['gene_id'] = concept.map_gene_names_to_ids(
|
| 95 |
+
species='hsapiens', # see concept.species for available species names
|
| 96 |
+
gene_names=adata.var_names.tolist(),
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
# Extract embeddings --> adata.var['gene_id']: ENSGXXXXXXXXXXX
|
| 100 |
+
result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
|
| 101 |
+
|
| 102 |
+
# Use embeddings for downstream analysis
|
| 103 |
+
adata.obsm['X_scConcept'] = result['cls_cell_emb']
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### Model adaptation
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
# Adapt a pre-trained model on your own data
|
| 110 |
+
concept.train(adata, max_steps=10000, batch_size=128)
|
| 111 |
+
|
| 112 |
+
# Important: For multiple datasets pass them separately
|
| 113 |
+
concept.train([adata1, adata2, ...], max_steps=20000, batch_size=128)
|
| 114 |
+
|
| 115 |
+
result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
|
| 116 |
+
adata.obsm['X_scConcept_adapted'] = result['cls_cell_emb']
|
| 117 |
+
```
|
| 118 |
+
<!-- For more detailed example, see the [notebook example](docs/notebooks/embedding_extraction.ipynb). -->
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
## Large-scale pre-training from scratch
|
| 122 |
+
|
| 123 |
+
`scConcept.train()` is only for light adaptation of pretrained models or small trainings on the fly. Use [train.py](https://github.com/theislab/scConcept/blob/main/src/concept/train.py) for distributed model pre-training from scratch over large corpus of data.
|
| 124 |
+
|
| 125 |
+
Before using `train.py` follow the instructions on [lamindb](https://github.com/laminlabs/lamindb) for setting up a lamin instance.
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
## Troubleshooting
|
| 129 |
+
|
| 130 |
+
If you encounter an error when loading a pre-trained model, try the following:
|
| 131 |
+
|
| 132 |
+
1. Remove the repository and clone the most recent version
|
| 133 |
+
2. Remove the cache directory (`cache/` by default)
|
| 134 |
+
3. Run again
|
| 135 |
+
|
| 136 |
+
This will force a fresh download of the pre-trained model and should resolve most loading issues.
|
| 137 |
+
|
| 138 |
+
<!-- ## Release notes
|
| 139 |
+
|
| 140 |
+
See the [changelog][]. -->
|
| 141 |
+
|
| 142 |
+
<!-- ## Contact
|
| 143 |
+
|
| 144 |
+
For questions and help requests, you can reach out in the [scverse discourse][].
|
| 145 |
+
If you found a bug, please use the [issue tracker][]. -->
|
| 146 |
+
|
| 147 |
+
## Citation
|
| 148 |
+
|
| 149 |
+
> Bahrami, M., Tejada-Lapuerta, A., Becker, S., Hashemi G, F.S. and Theis, F.J., 2025. scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv, pp.2025-10. doi: https://doi.org/10.1101/2025.10.14.682419
|
| 150 |
+
|
| 151 |
+
[uv]: https://github.com/astral-sh/uv
|
| 152 |
+
[Flash Attention]: https://github.com/Dao-AILab/flash-attention
|
| 153 |
+
[scverse discourse]: https://discourse.scverse.org/
|
| 154 |
+
[issue tracker]: https://github.com/theislab/scConcept/issues
|
| 155 |
+
[tests]: https://github.com/theislab/scConcept/actions/workflows/test.yaml
|
| 156 |
+
[documentation]: https://scConcept.readthedocs.io
|
| 157 |
+
[changelog]: https://scConcept.readthedocs.io/en/latest/changelog.html
|
| 158 |
+
[api documentation]: https://scConcept.readthedocs.io/en/latest/api.html
|
| 159 |
+
[pypi]: https://pypi.org/project/sc-concept
|