File size: 3,588 Bytes
2ef2902
 
 
 
 
 
 
 
 
 
 
f7999b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae8d2a3
 
 
 
 
 
 
f7999b5
 
 
 
 
 
 
ae8d2a3
 
 
 
 
 
 
f7999b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e264b31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4478558
e264b31
 
2ef2902
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: mit
language:
- en
base_model:
- aletlvl/Nicheformer
tags:
- single-cell
- biology
- transcriptomics
---
# Nicheformer

Nicheformer is a transformer-based model designed for understanding and predicting cellular niches and their interactions. The model uses masked language modeling to learn representations of cellular contexts and their relationships.

## Model Description

Nicheformer is built on a transformer architecture with the following key features:

- **Architecture**: Transformer encoder with customizable number of layers and attention heads
- **Pre-training**: Masked Language Modeling (MLM) objective with dynamic masking
- **Input Processing**: Handles cell type, assay, and modality information
- **Positional Encoding**: Supports both learnable and fixed positional embeddings
- **Masking Strategy**: 
  - 80% of selected tokens are replaced with [MASK]
  - 10% are replaced with random tokens
  - 10% remain unchanged

### Model Architecture

- Transformer encoder layers: 12
- Hidden dimension: 512
- Attention heads: 16
- Feedforward dimension: 1024
- Maximum sequence length: 1500
- Vocabulary size: 25000
- Masking probability: 15%

## Usage

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import anndata as ad

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("aletlvl/Nicheformer", trust_remote_code=True)

# Set technology mean for HF tokenizer
technology_mean_path = 'technology_mean.npy'
technology_mean = np.load(technology_mean_path)
tokenizer._load_technology_mean(technology_mean)

# Load your single-cell data
adata = ad.read_h5ad("your_data.h5ad")

# Tokenize the data
inputs = tokenizer(adata)

# Get embeddings
embeddings = model.get_embeddings(
              input_ids=inputs["input_ids"],
              attention_mask=inputs["attention_mask"],
              layer=-1,
              with_context=False
            )
```

## Training Data

The model was trained on single-cell gene expression data from various tissues and organisms. It supports:

- **Modalities**: spatial and dissociated
- **Species**: human and mouse
- **Technologies**: 
  - MERFISH
  - CosMx
  - Xenium
  - 10x Genomics (various versions)
  - CITE-seq
  - Smart-seq v4

## Limitations

- The model is specifically designed for gene expression data and may not generalize to other types of biological data
- Performance may vary depending on the quality and type of input data
- The model works best with data from supported species and technologies

## License

This model is released under the MIT License. See the LICENSE file for more details.

## Contact

For questions and issues, please open an issue on the GitHub repository or contact the maintainers.

# nicheformer

This is the official repository for **Nicheformer: a foundation model for single-cell and spatial omics**

[![Preprint](https://img.shields.io/badge/preprint-available-brightgreen)](https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1)  

## Citation

If you use our tool or build upon our concepts in your own work, please cite it as

```
Schaar, A.C., Tejada-Lapuerta, A., et al. Nicheformer: a foundation model for single-cell and spatial omics. bioRxiv (2024). doi: https://doi.org/10.1101/2024.04.15.589472
```

## Contact

For questions and help requests, you can reach out on GitHub or email to the corresponding author (alejandro.tejadalapuerta@helmholtz-munich.de). 


[issue-tracker]: https://github.com/theislab/nicheformer/issues