Nicheformer / README.md
aletlvl's picture
Update README.md
2ef2902 verified
---
license: mit
language:
- en
base_model:
- aletlvl/Nicheformer
tags:
- single-cell
- biology
- transcriptomics
---
# Nicheformer
Nicheformer is a transformer-based model designed for understanding and predicting cellular niches and their interactions. The model uses masked language modeling to learn representations of cellular contexts and their relationships.
## Model Description
Nicheformer is built on a transformer architecture with the following key features:
- **Architecture**: Transformer encoder with customizable number of layers and attention heads
- **Pre-training**: Masked Language Modeling (MLM) objective with dynamic masking
- **Input Processing**: Handles cell type, assay, and modality information
- **Positional Encoding**: Supports both learnable and fixed positional embeddings
- **Masking Strategy**:
- 80% of selected tokens are replaced with [MASK]
- 10% are replaced with random tokens
- 10% remain unchanged
### Model Architecture
- Transformer encoder layers: 12
- Hidden dimension: 512
- Attention heads: 16
- Feedforward dimension: 1024
- Maximum sequence length: 1500
- Vocabulary size: 25000
- Masking probability: 15%
## Usage
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import anndata as ad
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)
# Set technology mean for HF tokenizer
technology_mean_path = 'technology_mean.npy'
technology_mean = np.load(technology_mean_path)
tokenizer._load_technology_mean(technology_mean)
# Load your single-cell data
adata = ad.read_h5ad("your_data.h5ad")
# Tokenize the data
inputs = tokenizer(adata)
# Get embeddings
embeddings = model.get_embeddings(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
layer=-1,
with_context=False
)
```
## Training Data
The model was trained on single-cell gene expression data from various tissues and organisms. It supports:
- **Modalities**: spatial and dissociated
- **Species**: human and mouse
- **Technologies**:
- MERFISH
- CosMx
- Xenium
- 10x Genomics (various versions)
- CITE-seq
- Smart-seq v4
## Limitations
- The model is specifically designed for gene expression data and may not generalize to other types of biological data
- Performance may vary depending on the quality and type of input data
- The model works best with data from supported species and technologies
## License
This model is released under the MIT License. See the LICENSE file for more details.
## Contact
For questions and issues, please open an issue on the GitHub repository or contact the maintainers.
# nicheformer
This is the official repository for **Nicheformer: a foundation model for single-cell and spatial omics**
[![Preprint](https://img.shields.io/badge/preprint-available-brightgreen)](https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1)  
## Citation
If you use our tool or build upon our concepts in your own work, please cite it as
```
Schaar, A.C., Tejada-Lapuerta, A., et al. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv (2024). doi: https://doi.org/10.1101/2024.04.15.589472
```
## Contact
For questions and help requests, you can reach out on GitHub or email to the corresponding author (alejandro.tejadalapuerta@helmholtz-munich.de).
[issue-tracker]: https://github.com/theislab/nicheformer/issues