Other
MultiMolecule
PyTorch
Safetensors
Upper Grand Valley Dani
borzoi
Biology
DNA
borzoi-human / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
34211bf verified
metadata
datasets:
  - multimolecule/encode
  - multimolecule/fantom5
  - multimolecule/gtex
language: dna
library_name: multimolecule
license: agpl-3.0
pipeline: regulatory-track
pipeline_tag: other
tags:
  - Biology
  - DNA
widget:
  - example_title: tumor protein p53
    pipeline_tag: regulatory-track
    sequence_type: DNA
    task: regulatory-track
    text: >-
      ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG
  - example_title: BRCA1 DNA repair associated
    pipeline_tag: regulatory-track
    sequence_type: DNA
    task: regulatory-track
    text: >-
      TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG
  - example_title: hemoglobin subunit beta
    pipeline_tag: regulatory-track
    sequence_type: DNA
    task: regulatory-track
    text: >-
      CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
  - example_title: CF transmembrane conductance regulator
    pipeline_tag: regulatory-track
    sequence_type: DNA
    task: regulatory-track
    text: >-
      ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
  - example_title: telomerase reverse transcriptase
    pipeline_tag: regulatory-track
    sequence_type: DNA
    task: regulatory-track
    text: >-
      CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG
  - example_title: KRAS proto-oncogene
    pipeline_tag: regulatory-track
    sequence_type: DNA
    task: regulatory-track
    text: >-
      GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG
  - example_title: prion protein (Kanno blood group)
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC
  - example_title: interleukin 10
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC
  - example_title: Zaire ebolavirus
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: >-
      AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT
  - example_title: SARS coronavirus
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: >-
      ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT
  - example_title: insulin
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: >-
      ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG
  - example_title: cyclin dependent kinase inhibitor 2A
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: >-
      ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA
  - example_title: human papillomavirus type 16 E6
    pipeline_tag: regulatory-track
    sequence_type: cDNA
    task: regulatory-track
    text: >-
      ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA

Borzoi

Sequence-to-coverage neural network for predicting RNA-seq and chromatin tracks across 524 kb DNA windows at 32 bp resolution.

Disclaimer

This is an UNOFFICIAL implementation of Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation by Johannes Linder, Divyanshi Srivastava, Han Yuan, et al.

The OFFICIAL repository of Borzoi is at calico/borzoi.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing Borzoi did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

Borzoi is the successor of Enformer. It extends the Enformer recipe (convolution stem + Transformer trunk + binned multi-track output) to a 524,288 bp input window and 32 bp output bins, and adds a U-Net style upsampling tail so the binned positional axis matches a higher-resolution coverage prediction. A long DNA window of 524 kb is downsampled by a convolution stem and a width-growing residual convolution tower, projected to 1,536 channels by a U-Net bottleneck, processed by 8 Transformer blocks with Transformer-XL style relative positional encoding, then upsampled by two skip-connected U-Net stages with depthwise-separable convolutions, center-cropped to 6,144 bins, and projected to per-species coverage tracks with a softplus activation. The output is binned: it has shape (batch_size, target_length, num_tracks) where each bin summarizes 32 bp of sequence and num_tracks is the number of genomic coverage experiments for the selected species. Borzoi was trained jointly on RNA-seq, CAGE, ATAC-seq, DNase-seq, and ChIP-seq tracks. Please refer to the Training Details section for more information on the training process.

Variants

Borzoi releases separate human and mouse checkpoints for the corresponding species track sets.

Model Specification

Input Length Bin Size Output Bins Hidden Size Layers Heads Num Labels Num Parameters (M) FLOPs (P) MACs (P)
524288 32 6144 1536 8 8 7611 185.90 13.57 6.76

The table reports the human checkpoint. The mouse checkpoint predicts 2,608 tracks. FLOPs and MACs are measured on the canonical 524,288 bp Borzoi input window.

Links

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

Genomic Coverage Prediction

You can use this model to predict binned RNA-seq and chromatin coverage tracks from a DNA sequence:

>>> import torch
>>> from multimolecule import DnaTokenizer, BorzoiConfig, BorzoiForTokenPrediction

>>> config = BorzoiConfig(
...     sequence_length=512, hidden_size=16, num_hidden_layers=1, num_attention_heads=2,
...     attention_head_size=4, attention_value_size=4, num_rel_pos_features=4,
...     stem_channels=8, conv_tower_channels=[12], head_hidden_size=8, target_length=16,
...     num_labels=4,
... )
>>> model = BorzoiForTokenPrediction(config)
>>> output = model(torch.randint(config.vocab_size, (1, 512)))
>>> output.logits.shape
torch.Size([1, 16, 4])

The binned positional axis is treated as the "token" axis: each output position corresponds to one genomic bin rather than a single nucleotide. The species configuration option selects the human (7,611 tracks) or mouse (2,608 tracks) species track set for the converted checkpoint.

Interface

  • Input length: fixed 524,288 bp DNA window
  • Output binning: 32 bp per output bin; 6,144 output bins per window (after center-cropping the U-Net upsampling tail)
  • Species track set: select human (7,611 tracks) or mouse (2,608 tracks) via the species config option
  • Output: (batch_size, target_length, num_tracks)

Training Details

Borzoi was trained to predict bulk RNA-seq coverage together with chromatin tracks (DNase-seq, ATAC-seq, ChIP-seq) and CAGE from the human and mouse reference genomes.

Training Data

The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into 524 kb windows; for each window the per-32-bp coverage of every experiment served as the regression target. The training set is dominated by RNA-seq coverage (the modality Borzoi extends over Enformer); the remaining tracks include the chromatin and CAGE modalities used by Enformer.

Training Procedure

Pre-training

The model was trained to minimize a Poisson-multinomial regression loss between predicted and observed coverage, using a softplus output activation to keep the predicted coverage non-negative. Training used the Adam optimizer with a warmup schedule and global gradient-norm clipping; reverse-complement and small genomic-shift data augmentations were applied during training.

Citation

@article{linder2025predicting,
  author    = {Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R.},
  title     = {Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation},
  journal   = {Nature Genetics},
  year      = 2025,
  volume    = 57,
  number    = 4,
  pages     = {949--961},
  doi       = {10.1038/s41588-024-02053-6},
  publisher = {Nature Publishing Group}
}

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the Borzoi paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later