--- language: dna tags: - Biology - DNA license: agpl-3.0 library_name: multimolecule --- # Enformer Transformer-based deep neural network for predicting genomic coverage tracks from long DNA sequences with long-range context. ## Disclaimer This is an UNOFFICIAL implementation of [Effective gene expression prediction from sequence by integrating long-range interactions](https://doi.org/10.1038/s41592-021-01252-x) by Žiga Avsec, Vikram Agarwal, Daniel Visentin et al. The OFFICIAL repository of Enformer is at [google-deepmind/deepmind-research/enformer](https://github.com/google-deepmind/deepmind-research/tree/master/enformer). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing Enformer did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details Enformer is the successor of Basenji. It replaces Basenji's dilated convolution tower with a convolution stem followed by a Transformer trunk, which lets it model long-range genomic interactions. It consumes a long DNA window (~393 kb), passes it through a convolution + attention-pooling stem that downsamples the sequence by `2 ** 7 = 128x`, processes the binned representation with 11 Transformer blocks using Transformer-XL style relative positional encoding, center-crops to 896 output bins, and applies a pointwise head plus a per-species linear track projection with a softplus activation. The prediction is **binned**: the output has shape `(batch_size, target_length, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments for the selected species. ### Model Specification | Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) | | ------------ | -------- | ----------- | ----------- | ------ | ----- | ---------- | ------------------ | | 393216 | 128 | 896 | 1536 | 11 | 8 | 5313 | 246.2 | The default table reports the human output head. The mouse head predicts 1643 tracks. ### Links - **Code**: [multimolecule.enformer](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/enformer) - **Weights**: [multimolecule/enformer](https://huggingface.co/multimolecule/enformer) - **Paper**: [Effective gene expression prediction from sequence by integrating long-range interactions](https://doi.org/10.1038/s41592-021-01252-x) - **Developed by**: Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, David R. Kelley - **Original Repository**: [google-deepmind/deepmind-research/enformer](https://github.com/google-deepmind/deepmind-research/tree/master/enformer) - **PyTorch port used for weights**: [lucidrains/enformer-pytorch](https://github.com/lucidrains/enformer-pytorch) (MIT), checkpoint [`EleutherAI/enformer-official-rough`](https://huggingface.co/EleutherAI/enformer-official-rough) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use You can use this model to predict binned genomic coverage tracks from a DNA sequence: ```python >>> import torch >>> from multimolecule import DnaTokenizer, EnformerConfig, EnformerForTokenPrediction >>> config = EnformerConfig( ... sequence_length=256, hidden_size=12, num_hidden_layers=1, num_attention_heads=2, ... attention_head_size=4, num_downsamples=3, dim_divisible_by=2, target_length=16, ... num_labels=4, ... ) >>> model = EnformerForTokenPrediction(config) >>> output = model(torch.randint(config.vocab_size, (1, 256))) >>> output.logits.shape torch.Size([1, 16, 4]) ``` The binned positional axis is treated as the "token" axis: each output position corresponds to one genomic bin rather than a single nucleotide. The `species` configuration option selects the `human` (5,313 tracks) or `mouse` (1,643 tracks) output head. ## Training Details Enformer was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) from the human and mouse reference genomes. ### Training Data The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; for each window the per-128-bp coverage of every experiment served as the regression target. ### Training Procedure The model was trained to minimize a Poisson regression loss between predicted and observed coverage, using a softplus output activation to keep the predicted coverage non-negative. ## Citation ```bibtex @article{avsec2021effective, author = {Avsec, {\v{Z}}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.}, title = {Effective gene expression prediction from sequence by integrating long-range interactions}, journal = {Nature Methods}, year = 2021, volume = 18, number = 10, pages = {1196--1203}, doi = {10.1038/s41592-021-01252-x}, publisher = {Nature Publishing Group} } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [Enformer paper](https://doi.org/10.1038/s41592-021-01252-x) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```