--- license: mit language: - en base_model: - aletlvl/Nicheformer tags: - single-cell - biology - transcriptomics --- # Nicheformer Nicheformer is a transformer-based model designed for understanding and predicting cellular niches and their interactions. The model uses masked language modeling to learn representations of cellular contexts and their relationships. ## Model Description Nicheformer is built on a transformer architecture with the following key features: - **Architecture**: Transformer encoder with customizable number of layers and attention heads - **Pre-training**: Masked Language Modeling (MLM) objective with dynamic masking - **Input Processing**: Handles cell type, assay, and modality information - **Positional Encoding**: Supports both learnable and fixed positional embeddings - **Masking Strategy**: - 80% of selected tokens are replaced with [MASK] - 10% are replaced with random tokens - 10% remain unchanged ### Model Architecture - Transformer encoder layers: 12 - Hidden dimension: 512 - Attention heads: 16 - Feedforward dimension: 1024 - Maximum sequence length: 1500 - Vocabulary size: 25000 - Masking probability: 15% ## Usage ```python from transformers import AutoModelForMaskedLM, AutoTokenizer import anndata as ad # Load model and tokenizer model = AutoModelForMaskedLM.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True) # Set technology mean for HF tokenizer technology_mean_path = 'technology_mean.npy' technology_mean = np.load(technology_mean_path) tokenizer._load_technology_mean(technology_mean) # Load your single-cell data adata = ad.read_h5ad("your_data.h5ad") # Tokenize the data inputs = tokenizer(adata) # Get embeddings embeddings = model.get_embeddings( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], layer=-1, with_context=False ) ``` ## Training Data The model was trained on single-cell gene expression data from various tissues and organisms. It supports: - **Modalities**: spatial and dissociated - **Species**: human and mouse - **Technologies**: - MERFISH - CosMx - Xenium - 10x Genomics (various versions) - CITE-seq - Smart-seq v4 ## Limitations - The model is specifically designed for gene expression data and may not generalize to other types of biological data - Performance may vary depending on the quality and type of input data - The model works best with data from supported species and technologies ## License This model is released under the MIT License. See the LICENSE file for more details. ## Contact For questions and issues, please open an issue on the GitHub repository or contact the maintainers. # nicheformer This is the official repository for **Nicheformer: a foundation model for single-cell and spatial omics** [![Preprint](https://img.shields.io/badge/preprint-available-brightgreen)](https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1)   ## Citation If you use our tool or build upon our concepts in your own work, please cite it as ``` Schaar, A.C., Tejada-Lapuerta, A., et al. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv (2024). doi: https://doi.org/10.1101/2024.04.15.589472 ``` ## Contact For questions and help requests, you can reach out on GitHub or email to the corresponding author (alejandro.tejadalapuerta@helmholtz-munich.de). [issue-tracker]: https://github.com/theislab/nicheformer/issues