Instructions to use multimolecule/enformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/enformer with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/enformer") model = AutoModel.from_pretrained("multimolecule/enformer") - Notebooks
- Google Colab
- Kaggle
| language: dna | |
| tags: | |
| - Biology | |
| - DNA | |
| license: agpl-3.0 | |
| library_name: multimolecule | |
| # Enformer | |
| Transformer-based deep neural network for predicting genomic coverage tracks from long DNA sequences with long-range context. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [Effective gene expression prediction from sequence by integrating long-range interactions](https://doi.org/10.1038/s41592-021-01252-x) by Žiga Avsec, Vikram Agarwal, Daniel Visentin et al. | |
| The OFFICIAL repository of Enformer is at [google-deepmind/deepmind-research/enformer](https://github.com/google-deepmind/deepmind-research/tree/master/enformer). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing Enformer did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| Enformer is the successor of Basenji. It replaces Basenji's dilated convolution tower with a convolution stem followed by a Transformer trunk, which lets it model long-range genomic interactions. It consumes a long DNA window (~393 kb), passes it through a convolution + attention-pooling stem that downsamples the sequence by `2 ** 7 = 128x`, processes the binned representation with 11 Transformer blocks using Transformer-XL style relative positional encoding, center-crops to 896 output bins, and applies a pointwise head plus a per-species linear track projection with a softplus activation. The prediction is **binned**: the output has shape `(batch_size, target_length, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments for the selected species. | |
| ### Model Specification | |
| | Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) | | |
| | ------------ | -------- | ----------- | ----------- | ------ | ----- | ---------- | ------------------ | | |
| | 393216 | 128 | 896 | 1536 | 11 | 8 | 5313 | 246.2 | | |
| The default table reports the human output head. The mouse head predicts 1643 tracks. | |
| ### Links | |
| - **Code**: [multimolecule.enformer](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/enformer) | |
| - **Weights**: [multimolecule/enformer](https://huggingface.co/multimolecule/enformer) | |
| - **Paper**: [Effective gene expression prediction from sequence by integrating long-range interactions](https://doi.org/10.1038/s41592-021-01252-x) | |
| - **Developed by**: Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, David R. Kelley | |
| - **Original Repository**: [google-deepmind/deepmind-research/enformer](https://github.com/google-deepmind/deepmind-research/tree/master/enformer) | |
| - **PyTorch port used for weights**: [lucidrains/enformer-pytorch](https://github.com/lucidrains/enformer-pytorch) (MIT), checkpoint [`EleutherAI/enformer-official-rough`](https://huggingface.co/EleutherAI/enformer-official-rough) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| You can use this model to predict binned genomic coverage tracks from a DNA sequence: | |
| ```python | |
| >>> import torch | |
| >>> from multimolecule import DnaTokenizer, EnformerConfig, EnformerForTokenPrediction | |
| >>> config = EnformerConfig( | |
| ... sequence_length=256, hidden_size=12, num_hidden_layers=1, num_attention_heads=2, | |
| ... attention_head_size=4, num_downsamples=3, dim_divisible_by=2, target_length=16, | |
| ... num_labels=4, | |
| ... ) | |
| >>> model = EnformerForTokenPrediction(config) | |
| >>> output = model(torch.randint(config.vocab_size, (1, 256))) | |
| >>> output.logits.shape | |
| torch.Size([1, 16, 4]) | |
| ``` | |
| The binned positional axis is treated as the "token" axis: each output position corresponds to one | |
| genomic bin rather than a single nucleotide. The `species` configuration option selects the | |
| `human` (5,313 tracks) or `mouse` (1,643 tracks) output head. | |
| ## Training Details | |
| Enformer was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) | |
| from the human and mouse reference genomes. | |
| ### Training Data | |
| The model was trained on a large compendium of functional genomics experiments aligned to the | |
| human (hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; | |
| for each window the per-128-bp coverage of every experiment served as the regression target. | |
| ### Training Procedure | |
| The model was trained to minimize a Poisson regression loss between predicted and observed | |
| coverage, using a softplus output activation to keep the predicted coverage non-negative. | |
| ## Citation | |
| ```bibtex | |
| @article{avsec2021effective, | |
| author = {Avsec, {\v{Z}}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.}, | |
| title = {Effective gene expression prediction from sequence by integrating long-range interactions}, | |
| journal = {Nature Methods}, | |
| year = 2021, | |
| volume = 18, | |
| number = 10, | |
| pages = {1196--1203}, | |
| doi = {10.1038/s41592-021-01252-x}, | |
| publisher = {Nature Publishing Group} | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [Enformer paper](https://doi.org/10.1038/s41592-021-01252-x) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` | |