|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- aletlvl/Nicheformer |
|
|
tags: |
|
|
- single-cell |
|
|
- biology |
|
|
- transcriptomics |
|
|
--- |
|
|
# Nicheformer |
|
|
|
|
|
Nicheformer is a transformer-based model designed for understanding and predicting cellular niches and their interactions. The model uses masked language modeling to learn representations of cellular contexts and their relationships. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Nicheformer is built on a transformer architecture with the following key features: |
|
|
|
|
|
- **Architecture**: Transformer encoder with customizable number of layers and attention heads |
|
|
- **Pre-training**: Masked Language Modeling (MLM) objective with dynamic masking |
|
|
- **Input Processing**: Handles cell type, assay, and modality information |
|
|
- **Positional Encoding**: Supports both learnable and fixed positional embeddings |
|
|
- **Masking Strategy**: |
|
|
- 80% of selected tokens are replaced with [MASK] |
|
|
- 10% are replaced with random tokens |
|
|
- 10% remain unchanged |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- Transformer encoder layers: 12 |
|
|
- Hidden dimension: 512 |
|
|
- Attention heads: 16 |
|
|
- Feedforward dimension: 1024 |
|
|
- Maximum sequence length: 1500 |
|
|
- Vocabulary size: 25000 |
|
|
- Masking probability: 15% |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
import anndata as ad |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForMaskedLM.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True) |
|
|
|
|
|
# Set technology mean for HF tokenizer |
|
|
technology_mean_path = 'technology_mean.npy' |
|
|
technology_mean = np.load(technology_mean_path) |
|
|
tokenizer._load_technology_mean(technology_mean) |
|
|
|
|
|
# Load your single-cell data |
|
|
adata = ad.read_h5ad("your_data.h5ad") |
|
|
|
|
|
# Tokenize the data |
|
|
inputs = tokenizer(adata) |
|
|
|
|
|
# Get embeddings |
|
|
embeddings = model.get_embeddings( |
|
|
input_ids=inputs["input_ids"], |
|
|
attention_mask=inputs["attention_mask"], |
|
|
layer=-1, |
|
|
with_context=False |
|
|
) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on single-cell gene expression data from various tissues and organisms. It supports: |
|
|
|
|
|
- **Modalities**: spatial and dissociated |
|
|
- **Species**: human and mouse |
|
|
- **Technologies**: |
|
|
- MERFISH |
|
|
- CosMx |
|
|
- Xenium |
|
|
- 10x Genomics (various versions) |
|
|
- CITE-seq |
|
|
- Smart-seq v4 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is specifically designed for gene expression data and may not generalize to other types of biological data |
|
|
- Performance may vary depending on the quality and type of input data |
|
|
- The model works best with data from supported species and technologies |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. See the LICENSE file for more details. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions and issues, please open an issue on the GitHub repository or contact the maintainers. |
|
|
|
|
|
# nicheformer |
|
|
|
|
|
This is the official repository for **Nicheformer: a foundation model for single-cell and spatial omics** |
|
|
|
|
|
[](https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use our tool or build upon our concepts in your own work, please cite it as |
|
|
|
|
|
``` |
|
|
Schaar, A.C., Tejada-Lapuerta, A., et al. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv (2024). doi: https://doi.org/10.1101/2024.04.15.589472 |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions and help requests, you can reach out on GitHub or email to the corresponding author (alejandro.tejadalapuerta@helmholtz-munich.de). |
|
|
|
|
|
|
|
|
[issue-tracker]: https://github.com/theislab/nicheformer/issues |