enformer / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
0cace0c verified
---
language: dna
tags:
- Biology
- DNA
license: agpl-3.0
library_name: multimolecule
---
# Enformer
Transformer-based deep neural network for predicting genomic coverage tracks from long DNA sequences with long-range context.
## Disclaimer
This is an UNOFFICIAL implementation of [Effective gene expression prediction from sequence by integrating long-range interactions](https://doi.org/10.1038/s41592-021-01252-x) by Žiga Avsec, Vikram Agarwal, Daniel Visentin et al.
The OFFICIAL repository of Enformer is at [google-deepmind/deepmind-research/enformer](https://github.com/google-deepmind/deepmind-research/tree/master/enformer).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing Enformer did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
Enformer is the successor of Basenji. It replaces Basenji's dilated convolution tower with a convolution stem followed by a Transformer trunk, which lets it model long-range genomic interactions. It consumes a long DNA window (~393 kb), passes it through a convolution + attention-pooling stem that downsamples the sequence by `2 ** 7 = 128x`, processes the binned representation with 11 Transformer blocks using Transformer-XL style relative positional encoding, center-crops to 896 output bins, and applies a pointwise head plus a per-species linear track projection with a softplus activation. The prediction is **binned**: the output has shape `(batch_size, target_length, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments for the selected species.
### Model Specification
| Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) |
| ------------ | -------- | ----------- | ----------- | ------ | ----- | ---------- | ------------------ |
| 393216 | 128 | 896 | 1536 | 11 | 8 | 5313 | 246.2 |
The default table reports the human output head. The mouse head predicts 1643 tracks.
### Links
- **Code**: [multimolecule.enformer](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/enformer)
- **Weights**: [multimolecule/enformer](https://huggingface.co/multimolecule/enformer)
- **Paper**: [Effective gene expression prediction from sequence by integrating long-range interactions](https://doi.org/10.1038/s41592-021-01252-x)
- **Developed by**: Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, David R. Kelley
- **Original Repository**: [google-deepmind/deepmind-research/enformer](https://github.com/google-deepmind/deepmind-research/tree/master/enformer)
- **PyTorch port used for weights**: [lucidrains/enformer-pytorch](https://github.com/lucidrains/enformer-pytorch) (MIT), checkpoint [`EleutherAI/enformer-official-rough`](https://huggingface.co/EleutherAI/enformer-official-rough)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
You can use this model to predict binned genomic coverage tracks from a DNA sequence:
```python
>>> import torch
>>> from multimolecule import DnaTokenizer, EnformerConfig, EnformerForTokenPrediction
>>> config = EnformerConfig(
... sequence_length=256, hidden_size=12, num_hidden_layers=1, num_attention_heads=2,
... attention_head_size=4, num_downsamples=3, dim_divisible_by=2, target_length=16,
... num_labels=4,
... )
>>> model = EnformerForTokenPrediction(config)
>>> output = model(torch.randint(config.vocab_size, (1, 256)))
>>> output.logits.shape
torch.Size([1, 16, 4])
```
The binned positional axis is treated as the "token" axis: each output position corresponds to one
genomic bin rather than a single nucleotide. The `species` configuration option selects the
`human` (5,313 tracks) or `mouse` (1,643 tracks) output head.
## Training Details
Enformer was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE)
from the human and mouse reference genomes.
### Training Data
The model was trained on a large compendium of functional genomics experiments aligned to the
human (hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows;
for each window the per-128-bp coverage of every experiment served as the regression target.
### Training Procedure
The model was trained to minimize a Poisson regression loss between predicted and observed
coverage, using a softplus output activation to keep the predicted coverage non-negative.
## Citation
```bibtex
@article{avsec2021effective,
author = {Avsec, {\v{Z}}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.},
title = {Effective gene expression prediction from sequence by integrating long-range interactions},
journal = {Nature Methods},
year = 2021,
volume = 18,
number = 10,
pages = {1196--1203},
doi = {10.1038/s41592-021-01252-x},
publisher = {Nature Publishing Group}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [Enformer paper](https://doi.org/10.1038/s41592-021-01252-x) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```