theislab
/

Nicheformer

transcriptomics

Model card Files Files and versions

Nicheformer / README.md

aletlvl's picture

Update README.md

2ef2902 verified 11 months ago

|

history blame contribute delete

3.59 kB

	---
	license: mit
	language:
	- en
	base_model:
	- aletlvl/Nicheformer
	tags:
	- single-cell
	- biology
	- transcriptomics
	---
	# Nicheformer

	Nicheformer is a transformer-based model designed for understanding and predicting cellular niches and their interactions. The model uses masked language modeling to learn representations of cellular contexts and their relationships.

	## Model Description

	Nicheformer is built on a transformer architecture with the following key features:

	- Architecture: Transformer encoder with customizable number of layers and attention heads
	- Pre-training: Masked Language Modeling (MLM) objective with dynamic masking
	- Input Processing: Handles cell type, assay, and modality information
	- Positional Encoding: Supports both learnable and fixed positional embeddings
	- Masking Strategy:
	- 80% of selected tokens are replaced with [MASK]
	- 10% are replaced with random tokens
	- 10% remain unchanged

	### Model Architecture

	- Transformer encoder layers: 12
	- Hidden dimension: 512
	- Attention heads: 16
	- Feedforward dimension: 1024
	- Maximum sequence length: 1500
	- Vocabulary size: 25000
	- Masking probability: 15%

	## Usage

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer
	import anndata as ad

	# Load model and tokenizer
	model = AutoModelForMaskedLM.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)

	# Set technology mean for HF tokenizer
	technology_mean_path = 'technology_mean.npy'
	technology_mean = np.load(technology_mean_path)
	tokenizer._load_technology_mean(technology_mean)

	# Load your single-cell data
	adata = ad.read_h5ad("your_data.h5ad")

	# Tokenize the data
	inputs = tokenizer(adata)

	# Get embeddings
	embeddings = model.get_embeddings(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	layer=-1,
	with_context=False
	)
	```

	## Training Data

	The model was trained on single-cell gene expression data from various tissues and organisms. It supports:

	- Modalities: spatial and dissociated
	- Species: human and mouse
	- Technologies:
	- MERFISH
	- CosMx
	- Xenium
	- 10x Genomics (various versions)
	- CITE-seq
	- Smart-seq v4

	## Limitations

	- The model is specifically designed for gene expression data and may not generalize to other types of biological data
	- Performance may vary depending on the quality and type of input data
	- The model works best with data from supported species and technologies

	## License

	This model is released under the MIT License. See the LICENSE file for more details.

	## Contact

	For questions and issues, please open an issue on the GitHub repository or contact the maintainers.

	# nicheformer

	This is the official repository for Nicheformer: a foundation model for single-cell and spatial omics

	[![Preprint](https://img.shields.io/badge/preprint-available-brightgreen)](https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1)

	## Citation

	If you use our tool or build upon our concepts in your own work, please cite it as

	```
	Schaar, A.C., Tejada-Lapuerta, A., et al. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv (2024). doi: https://doi.org/10.1101/2024.04.15.589472
	```

	## Contact

	For questions and help requests, you can reach out on GitHub or email to the corresponding author (alejandro.tejadalapuerta@helmholtz-munich.de).


	[issue-tracker]: https://github.com/theislab/nicheformer/issues