mojtababahrami commited on
Commit
de7c301
·
verified ·
1 Parent(s): 45d8b03

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: feature-extraction
4
+ tags:
5
+ - Single-cell
6
+ ---
7
+ # scConcept
8
+
9
+ [![Tests][badge-tests]][tests]
10
+ [![Documentation][badge-docs]][documentation]
11
+
12
+ [badge-tests]: https://img.shields.io/github/actions/workflow/status/theislab/scConcept/test.yaml?branch=main
13
+ [badge-docs]: https://img.shields.io/readthedocs/scConcept
14
+
15
+ This repository contains the python package to train and use scConcept (Single-cell contrastive cell pre-training) method for single-cell transcriptomics.
16
+
17
+ <!-- ## Getting started
18
+
19
+ Please refer to the [documentation][],
20
+ in particular, the [API documentation][]. -->
21
+
22
+ ## Installation
23
+
24
+ You need to have Python 3.12 or newer installed on your system.
25
+ If you don't have Python installed, we recommend installing [uv][].
26
+
27
+ ### Default installation
28
+
29
+ Install the latest release of `sc-concept` from [PyPI][]:
30
+
31
+ ```bash
32
+ pip install sc-concept
33
+ ```
34
+
35
+ ### Latest development version
36
+
37
+ To install the latest development version directly from GitHub:
38
+
39
+ ```bash
40
+ pip install git+https://github.com/theislab/scConcept.git@main
41
+ ```
42
+
43
+ ### Optional Flash Attention speedup
44
+
45
+ The standard installation is enough for loading pretrained models, extracting embeddings, and light adaptation. For faster inference, embedding extraction, adaptation, or large-scale training, install [Flash Attention][] with one of the following options.
46
+
47
+ 1. Recommended: `cd` to the project root and run [`./scripts/setup_env.sh`](https://github.com/theislab/scConcept/blob/main/scripts/setup_env.sh), which installs uv if needed and creates a virtual environment with the training dependencies.
48
+
49
+ 2. Manual: make sure a CUDA-enabled version of PyTorch is installed. More information is available in the [PyTorch installation guide](https://pytorch.org/get-started/locally/). Then install Flash Attention:
50
+
51
+ ```bash
52
+ MAX_JOBS=4 pip install "flash-attn>=2.7" --no-build-isolation
53
+ ```
54
+
55
+ This can take up to an hour depending on the system specifications and whether a pre-built release of `flash-attn` is available for your exact versions of Python, PyTorch, and CUDA. If this takes long, we recommend using the setup script instead.
56
+
57
+
58
+ ## How to use
59
+
60
+ scConcept provides a simple API to load and adapt [pre-trained models](https://huggingface.co/theislab/scConcept/tree/main) and extract embeddings from scRNA-seq data.
61
+
62
+ ### Pre-trained models
63
+
64
+ The following models are available from the [scConcept Hugging Face repository](https://huggingface.co/theislab/scConcept/tree/main). Use the value in the `model_name` column with `concept.load_config_and_model(model_name=...)`.
65
+
66
+ | `model_name` | Training corpus | Architecture | Max tokens | Species | Notes |
67
+ | --- | --- | --- | ---: | --- | --- |
68
+ | `corpus360M[multi-species]-model170M` | 360M cells (CellxGene 2026 + scBaseCount 2025) | 170M parameters, 16 layers, 1024 hidden size, 16 heads | 20,000 | 16 species | Largest multi-species checkpoint; best suited for cross-species applications with sufficient memory. |
69
+ | `corpus40M-model30M` | 40M cells (CellxGene 2023) | 30M parameters, 8 layers, 512 hidden size, 8 heads | 1,000 | Human | Recommended default for embedding extraction and light adaptation. |
70
+
71
+ Here's a basic example:
72
+
73
+ ```python
74
+ from concept import scConcept
75
+ import scanpy as sc
76
+
77
+ # Load your single-cell data
78
+ adata = sc.read_h5ad("your_data.h5ad")
79
+
80
+ # Initialize scConcept and load a pretrained model
81
+ concept = scConcept(cache_dir='./cache/')
82
+
83
+ # Option 1: Load a model directly from HuggingFace
84
+ concept.load_config_and_model(model_name='corpus40M-model30M')
85
+
86
+ # Option 2: Load any local model
87
+ concept.load_config_and_model(
88
+ config='<path-to-config.yaml>',
89
+ model_path='<path-to-model.ckpt>',
90
+ gene_mappings_path='<path-to-gene-mappings-directory>',
91
+ )
92
+
93
+ # scConcept accepts Gene Ensemble IDs as input. You can use built-in helper methods to do the mapping if needed:
94
+ adata.var['gene_id'] = concept.map_gene_names_to_ids(
95
+ species='hsapiens', # see concept.species for available species names
96
+ gene_names=adata.var_names.tolist(),
97
+ )
98
+
99
+ # Extract embeddings --> adata.var['gene_id']: ENSGXXXXXXXXXXX
100
+ result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
101
+
102
+ # Use embeddings for downstream analysis
103
+ adata.obsm['X_scConcept'] = result['cls_cell_emb']
104
+ ```
105
+
106
+ ### Model adaptation
107
+
108
+ ```python
109
+ # Adapt a pre-trained model on your own data
110
+ concept.train(adata, max_steps=10000, batch_size=128)
111
+
112
+ # Important: For multiple datasets pass them separately
113
+ concept.train([adata1, adata2, ...], max_steps=20000, batch_size=128)
114
+
115
+ result = concept.extract_embeddings(adata=adata, gene_id_column='gene_id')
116
+ adata.obsm['X_scConcept_adapted'] = result['cls_cell_emb']
117
+ ```
118
+ <!-- For more detailed example, see the [notebook example](docs/notebooks/embedding_extraction.ipynb). -->
119
+
120
+
121
+ ## Large-scale pre-training from scratch
122
+
123
+ `scConcept.train()` is only for light adaptation of pretrained models or small trainings on the fly. Use [train.py](https://github.com/theislab/scConcept/blob/main/src/concept/train.py) for distributed model pre-training from scratch over large corpus of data.
124
+
125
+ Before using `train.py` follow the instructions on [lamindb](https://github.com/laminlabs/lamindb) for setting up a lamin instance.
126
+
127
+
128
+ ## Troubleshooting
129
+
130
+ If you encounter an error when loading a pre-trained model, try the following:
131
+
132
+ 1. Remove the repository and clone the most recent version
133
+ 2. Remove the cache directory (`cache/` by default)
134
+ 3. Run again
135
+
136
+ This will force a fresh download of the pre-trained model and should resolve most loading issues.
137
+
138
+ <!-- ## Release notes
139
+
140
+ See the [changelog][]. -->
141
+
142
+ <!-- ## Contact
143
+
144
+ For questions and help requests, you can reach out in the [scverse discourse][].
145
+ If you found a bug, please use the [issue tracker][]. -->
146
+
147
+ ## Citation
148
+
149
+ > Bahrami, M., Tejada-Lapuerta, A., Becker, S., Hashemi G, F.S. and Theis, F.J., 2025. scConcept: Contrastive pretraining for technology-agnostic single-cell representations beyond reconstruction. bioRxiv, pp.2025-10. doi: https://doi.org/10.1101/2025.10.14.682419
150
+
151
+ [uv]: https://github.com/astral-sh/uv
152
+ [Flash Attention]: https://github.com/Dao-AILab/flash-attention
153
+ [scverse discourse]: https://discourse.scverse.org/
154
+ [issue tracker]: https://github.com/theislab/scConcept/issues
155
+ [tests]: https://github.com/theislab/scConcept/actions/workflows/test.yaml
156
+ [documentation]: https://scConcept.readthedocs.io
157
+ [changelog]: https://scConcept.readthedocs.io/en/latest/changelog.html
158
+ [api documentation]: https://scConcept.readthedocs.io/en/latest/api.html
159
+ [pypi]: https://pypi.org/project/sc-concept