Instructions to use multimolecule/borzoi-human with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/borzoi-human with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/borzoi-human") model = AutoModel.from_pretrained("multimolecule/borzoi-human") inputs = tokenizer("ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state - Notebooks
- Google Colab
- Kaggle
datasets:
- multimolecule/encode
- multimolecule/fantom5
- multimolecule/gtex
language: dna
library_name: multimolecule
license: agpl-3.0
pipeline: regulatory-track
pipeline_tag: other
tags:
- Biology
- DNA
widget:
- example_title: tumor protein p53
pipeline_tag: regulatory-track
sequence_type: DNA
task: regulatory-track
text: >-
ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG
- example_title: BRCA1 DNA repair associated
pipeline_tag: regulatory-track
sequence_type: DNA
task: regulatory-track
text: >-
TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG
- example_title: hemoglobin subunit beta
pipeline_tag: regulatory-track
sequence_type: DNA
task: regulatory-track
text: >-
CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
- example_title: CF transmembrane conductance regulator
pipeline_tag: regulatory-track
sequence_type: DNA
task: regulatory-track
text: >-
ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
- example_title: telomerase reverse transcriptase
pipeline_tag: regulatory-track
sequence_type: DNA
task: regulatory-track
text: >-
CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG
- example_title: KRAS proto-oncogene
pipeline_tag: regulatory-track
sequence_type: DNA
task: regulatory-track
text: >-
GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG
- example_title: prion protein (Kanno blood group)
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC
- example_title: interleukin 10
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC
- example_title: Zaire ebolavirus
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: >-
AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT
- example_title: SARS coronavirus
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: >-
ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT
- example_title: insulin
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: >-
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG
- example_title: cyclin dependent kinase inhibitor 2A
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: >-
ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA
- example_title: human papillomavirus type 16 E6
pipeline_tag: regulatory-track
sequence_type: cDNA
task: regulatory-track
text: >-
ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA
Borzoi
Sequence-to-coverage neural network for predicting RNA-seq and chromatin tracks across 524 kb DNA windows at 32 bp resolution.
Disclaimer
This is an UNOFFICIAL implementation of Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation by Johannes Linder, Divyanshi Srivastava, Han Yuan, et al.
The OFFICIAL repository of Borzoi is at calico/borzoi.
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Borzoi did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details
Borzoi is the successor of Enformer. It extends the Enformer recipe (convolution stem + Transformer trunk + binned multi-track output) to a 524,288 bp input window and 32 bp output bins, and adds a U-Net style upsampling tail so the binned positional axis matches a higher-resolution coverage prediction. A long DNA window of 524 kb is downsampled by a convolution stem and a width-growing residual convolution tower, projected to 1,536 channels by a U-Net bottleneck, processed by 8 Transformer blocks with Transformer-XL style relative positional encoding, then upsampled by two skip-connected U-Net stages with depthwise-separable convolutions, center-cropped to 6,144 bins, and projected to per-species coverage tracks with a softplus activation. The output is binned: it has shape (batch_size, target_length, num_tracks) where each bin summarizes 32 bp of sequence and num_tracks is the number of genomic coverage experiments for the selected species. Borzoi was trained jointly on RNA-seq, CAGE, ATAC-seq, DNase-seq, and ChIP-seq tracks. Please refer to the Training Details section for more information on the training process.
Variants
Borzoi releases separate human and mouse checkpoints for the corresponding species track sets.
- multimolecule/borzoi-human: human checkpoint with 7,611 human genomic coverage tracks.
- multimolecule/borzoi-mouse: mouse checkpoint with 2,608 mouse genomic coverage tracks.
Model Specification
| Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) | FLOPs (P) | MACs (P) |
|---|---|---|---|---|---|---|---|---|---|
| 524288 | 32 | 6144 | 1536 | 8 | 8 | 7611 | 185.90 | 13.57 | 6.76 |
The table reports the human checkpoint. The mouse checkpoint predicts 2,608 tracks. FLOPs and MACs are measured on the canonical 524,288 bp Borzoi input window.
Links
- Code: multimolecule.borzoi
- Data: ENCODE, GTEx, FANTOM5 RNA-seq / CAGE / ATAC-seq / DNase-seq / ChIP-seq tracks aligned to human and mouse genomes
- Paper: Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation
- Developed by: Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, David R. Kelley
- Model type: Convolutional stem followed by Transformer trunk and U-Net upsampling tail for binned multi-track RNA-seq and chromatin coverage prediction
- Original Repository: calico/borzoi
Usage
The model file depends on the multimolecule library. You can install it using pip:
pip install multimolecule
Direct Use
Genomic Coverage Prediction
You can use this model to predict binned RNA-seq and chromatin coverage tracks from a DNA sequence:
>>> import torch
>>> from multimolecule import DnaTokenizer, BorzoiConfig, BorzoiForTokenPrediction
>>> config = BorzoiConfig(
... sequence_length=512, hidden_size=16, num_hidden_layers=1, num_attention_heads=2,
... attention_head_size=4, attention_value_size=4, num_rel_pos_features=4,
... stem_channels=8, conv_tower_channels=[12], head_hidden_size=8, target_length=16,
... num_labels=4,
... )
>>> model = BorzoiForTokenPrediction(config)
>>> output = model(torch.randint(config.vocab_size, (1, 512)))
>>> output.logits.shape
torch.Size([1, 16, 4])
The binned positional axis is treated as the "token" axis: each output position corresponds to one
genomic bin rather than a single nucleotide. The species configuration option selects the
human (7,611 tracks) or mouse (2,608 tracks) species track set for the converted checkpoint.
Interface
- Input length: fixed 524,288 bp DNA window
- Output binning: 32 bp per output bin; 6,144 output bins per window (after center-cropping the U-Net upsampling tail)
- Species track set: select
human(7,611 tracks) ormouse(2,608 tracks) via thespeciesconfig option - Output:
(batch_size, target_length, num_tracks)
Training Details
Borzoi was trained to predict bulk RNA-seq coverage together with chromatin tracks (DNase-seq, ATAC-seq, ChIP-seq) and CAGE from the human and mouse reference genomes.
Training Data
The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into 524 kb windows; for each window the per-32-bp coverage of every experiment served as the regression target. The training set is dominated by RNA-seq coverage (the modality Borzoi extends over Enformer); the remaining tracks include the chromatin and CAGE modalities used by Enformer.
Training Procedure
Pre-training
The model was trained to minimize a Poisson-multinomial regression loss between predicted and observed coverage, using a softplus output activation to keep the predicted coverage non-negative. Training used the Adam optimizer with a warmup schedule and global gradient-norm clipping; reverse-complement and small genomic-shift data augmentations were applied during training.
Citation
@article{linder2025predicting,
author = {Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R.},
title = {Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation},
journal = {Nature Genetics},
year = 2025,
volume = 57,
number = 4,
pages = {949--961},
doi = {10.1038/s41588-024-02053-6},
publisher = {Nature Publishing Group}
}
The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
Contact
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Borzoi paper for questions or comments on the paper/model.
License
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
SPDX-License-Identifier: AGPL-3.0-or-later