--- datasets: - multimolecule/encode - multimolecule/fantom5 - multimolecule/gtex language: dna library_name: multimolecule license: agpl-3.0 pipeline: regulatory-track pipeline_tag: other tags: - Biology - DNA widget: - example_title: tumor protein p53 pipeline_tag: regulatory-track sequence_type: DNA task: regulatory-track text: ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG - example_title: BRCA1 DNA repair associated pipeline_tag: regulatory-track sequence_type: DNA task: regulatory-track text: TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG - example_title: hemoglobin subunit beta pipeline_tag: regulatory-track sequence_type: DNA task: regulatory-track text: CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG - example_title: CF transmembrane conductance regulator pipeline_tag: regulatory-track sequence_type: DNA task: regulatory-track text: ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG - example_title: telomerase reverse transcriptase pipeline_tag: regulatory-track sequence_type: DNA task: regulatory-track text: CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG - example_title: KRAS proto-oncogene pipeline_tag: regulatory-track sequence_type: DNA task: regulatory-track text: GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG - example_title: prion protein (Kanno blood group) pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC - example_title: interleukin 10 pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC - example_title: Zaire ebolavirus pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT - example_title: SARS coronavirus pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT - example_title: insulin pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG - example_title: cyclin dependent kinase inhibitor 2A pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA - example_title: human papillomavirus type 16 E6 pipeline_tag: regulatory-track sequence_type: cDNA task: regulatory-track text: ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA --- # Borzoi Sequence-to-coverage neural network for predicting RNA-seq and chromatin tracks across 524 kb DNA windows at 32 bp resolution. ## Disclaimer This is an UNOFFICIAL implementation of [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1038/s41588-024-02053-6) by Johannes Linder, Divyanshi Srivastava, Han Yuan, et al. The OFFICIAL repository of Borzoi is at [calico/borzoi](https://github.com/calico/borzoi). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing Borzoi did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details Borzoi is the successor of [Enformer](https://huggingface.co/multimolecule/enformer). It extends the Enformer recipe (convolution stem + Transformer trunk + binned multi-track output) to a 524,288 bp input window and 32 bp output bins, and adds a U-Net style upsampling tail so the binned positional axis matches a higher-resolution coverage prediction. A long DNA window of 524 kb is downsampled by a convolution stem and a width-growing residual convolution tower, projected to 1,536 channels by a U-Net bottleneck, processed by 8 Transformer blocks with Transformer-XL style relative positional encoding, then upsampled by two skip-connected U-Net stages with depthwise-separable convolutions, center-cropped to 6,144 bins, and projected to per-species coverage tracks with a softplus activation. The output is **binned**: it has shape `(batch_size, target_length, num_tracks)` where each bin summarizes 32 bp of sequence and `num_tracks` is the number of genomic coverage experiments for the selected species. Borzoi was trained jointly on RNA-seq, CAGE, ATAC-seq, DNase-seq, and ChIP-seq tracks. Please refer to the [Training Details](#training-details) section for more information on the training process. ### Variants Borzoi releases separate human and mouse checkpoints for the corresponding species track sets. - **[multimolecule/borzoi-human](https://huggingface.co/multimolecule/borzoi-human)**: human checkpoint with 7,611 human genomic coverage tracks. - **[multimolecule/borzoi-mouse](https://huggingface.co/multimolecule/borzoi-mouse)**: mouse checkpoint with 2,608 mouse genomic coverage tracks. ### Model Specification | Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) | FLOPs (P) | MACs (P) | | ------------ | -------- | ----------- | ----------- | ------ | ----- | ---------- | ------------------ | --------- | -------- | | 524288 | 32 | 6144 | 1536 | 8 | 8 | 7611 | 185.90 | 13.57 | 6.76 | The table reports the human checkpoint. The mouse checkpoint predicts 2,608 tracks. FLOPs and MACs are measured on the canonical 524,288 bp Borzoi input window. ### Links - **Code**: [multimolecule.borzoi](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/borzoi) - **Data**: ENCODE, GTEx, FANTOM5 RNA-seq / CAGE / ATAC-seq / DNase-seq / ChIP-seq tracks aligned to human and mouse genomes - **Paper**: [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1038/s41588-024-02053-6) - **Developed by**: Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, David R. Kelley - **Model type**: Convolutional stem followed by Transformer trunk and U-Net upsampling tail for binned multi-track RNA-seq and chromatin coverage prediction - **Original Repository**: [calico/borzoi](https://github.com/calico/borzoi) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### Genomic Coverage Prediction You can use this model to predict binned RNA-seq and chromatin coverage tracks from a DNA sequence: ```python >>> import torch >>> from multimolecule import DnaTokenizer, BorzoiConfig, BorzoiForTokenPrediction >>> config = BorzoiConfig( ... sequence_length=512, hidden_size=16, num_hidden_layers=1, num_attention_heads=2, ... attention_head_size=4, attention_value_size=4, num_rel_pos_features=4, ... stem_channels=8, conv_tower_channels=[12], head_hidden_size=8, target_length=16, ... num_labels=4, ... ) >>> model = BorzoiForTokenPrediction(config) >>> output = model(torch.randint(config.vocab_size, (1, 512))) >>> output.logits.shape torch.Size([1, 16, 4]) ``` The binned positional axis is treated as the "token" axis: each output position corresponds to one genomic bin rather than a single nucleotide. The `species` configuration option selects the `human` (7,611 tracks) or `mouse` (2,608 tracks) species track set for the converted checkpoint. ### Interface - **Input length**: fixed 524,288 bp DNA window - **Output binning**: 32 bp per output bin; 6,144 output bins per window (after center-cropping the U-Net upsampling tail) - **Species track set**: select `human` (7,611 tracks) or `mouse` (2,608 tracks) via the `species` config option - **Output**: `(batch_size, target_length, num_tracks)` ## Training Details Borzoi was trained to predict bulk RNA-seq coverage together with chromatin tracks (DNase-seq, ATAC-seq, ChIP-seq) and CAGE from the human and mouse reference genomes. ### Training Data The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into 524 kb windows; for each window the per-32-bp coverage of every experiment served as the regression target. The training set is dominated by RNA-seq coverage (the modality Borzoi extends over Enformer); the remaining tracks include the chromatin and CAGE modalities used by Enformer. ### Training Procedure #### Pre-training The model was trained to minimize a Poisson-multinomial regression loss between predicted and observed coverage, using a softplus output activation to keep the predicted coverage non-negative. Training used the Adam optimizer with a warmup schedule and global gradient-norm clipping; reverse-complement and small genomic-shift data augmentations were applied during training. ## Citation ```bibtex @article{linder2025predicting, author = {Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R.}, title = {Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation}, journal = {Nature Genetics}, year = 2025, volume = 57, number = 4, pages = {949--961}, doi = {10.1038/s41588-024-02053-6}, publisher = {Nature Publishing Group} } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [Borzoi paper](https://doi.org/10.1038/s41588-024-02053-6) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```