Instructions to use multimolecule/borzoi-human with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/borzoi-human with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/borzoi-human") model = AutoModel.from_pretrained("multimolecule/borzoi-human") inputs = tokenizer("ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state - Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - multimolecule/encode | |
| - multimolecule/fantom5 | |
| - multimolecule/gtex | |
| language: dna | |
| library_name: multimolecule | |
| license: agpl-3.0 | |
| pipeline: regulatory-track | |
| pipeline_tag: other | |
| tags: | |
| - Biology | |
| - DNA | |
| widget: | |
| - example_title: tumor protein p53 | |
| pipeline_tag: regulatory-track | |
| sequence_type: DNA | |
| task: regulatory-track | |
| text: ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG | |
| - example_title: BRCA1 DNA repair associated | |
| pipeline_tag: regulatory-track | |
| sequence_type: DNA | |
| task: regulatory-track | |
| text: TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG | |
| - example_title: hemoglobin subunit beta | |
| pipeline_tag: regulatory-track | |
| sequence_type: DNA | |
| task: regulatory-track | |
| text: CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG | |
| - example_title: CF transmembrane conductance regulator | |
| pipeline_tag: regulatory-track | |
| sequence_type: DNA | |
| task: regulatory-track | |
| text: ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG | |
| - example_title: telomerase reverse transcriptase | |
| pipeline_tag: regulatory-track | |
| sequence_type: DNA | |
| task: regulatory-track | |
| text: CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG | |
| - example_title: KRAS proto-oncogene | |
| pipeline_tag: regulatory-track | |
| sequence_type: DNA | |
| task: regulatory-track | |
| text: GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG | |
| - example_title: prion protein (Kanno blood group) | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC | |
| - example_title: interleukin 10 | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC | |
| - example_title: Zaire ebolavirus | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT | |
| - example_title: SARS coronavirus | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT | |
| - example_title: insulin | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG | |
| - example_title: cyclin dependent kinase inhibitor 2A | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA | |
| - example_title: human papillomavirus type 16 E6 | |
| pipeline_tag: regulatory-track | |
| sequence_type: cDNA | |
| task: regulatory-track | |
| text: ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA | |
| # Borzoi | |
| Sequence-to-coverage neural network for predicting RNA-seq and chromatin tracks across 524 kb DNA windows at 32 bp resolution. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1038/s41588-024-02053-6) by Johannes Linder, Divyanshi Srivastava, Han Yuan, et al. | |
| The OFFICIAL repository of Borzoi is at [calico/borzoi](https://github.com/calico/borzoi). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing Borzoi did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| Borzoi is the successor of [Enformer](https://huggingface.co/multimolecule/enformer). It extends the Enformer recipe (convolution stem + Transformer trunk + binned multi-track output) to a 524,288 bp input window and 32 bp output bins, and adds a U-Net style upsampling tail so the binned positional axis matches a higher-resolution coverage prediction. A long DNA window of 524 kb is downsampled by a convolution stem and a width-growing residual convolution tower, projected to 1,536 channels by a U-Net bottleneck, processed by 8 Transformer blocks with Transformer-XL style relative positional encoding, then upsampled by two skip-connected U-Net stages with depthwise-separable convolutions, center-cropped to 6,144 bins, and projected to per-species coverage tracks with a softplus activation. The output is **binned**: it has shape `(batch_size, target_length, num_tracks)` where each bin summarizes 32 bp of sequence and `num_tracks` is the number of genomic coverage experiments for the selected species. Borzoi was trained jointly on RNA-seq, CAGE, ATAC-seq, DNase-seq, and ChIP-seq tracks. Please refer to the [Training Details](#training-details) section for more information on the training process. | |
| ### Variants | |
| Borzoi releases separate human and mouse checkpoints for the corresponding species track sets. | |
| - **[multimolecule/borzoi-human](https://huggingface.co/multimolecule/borzoi-human)**: human checkpoint with 7,611 human genomic coverage tracks. | |
| - **[multimolecule/borzoi-mouse](https://huggingface.co/multimolecule/borzoi-mouse)**: mouse checkpoint with 2,608 mouse genomic coverage tracks. | |
| ### Model Specification | |
| | Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) | FLOPs (P) | MACs (P) | | |
| | ------------ | -------- | ----------- | ----------- | ------ | ----- | ---------- | ------------------ | --------- | -------- | | |
| | 524288 | 32 | 6144 | 1536 | 8 | 8 | 7611 | 185.90 | 13.57 | 6.76 | | |
| The table reports the human checkpoint. The mouse checkpoint predicts 2,608 tracks. | |
| FLOPs and MACs are measured on the canonical 524,288 bp Borzoi input window. | |
| ### Links | |
| - **Code**: [multimolecule.borzoi](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/borzoi) | |
| - **Data**: ENCODE, GTEx, FANTOM5 RNA-seq / CAGE / ATAC-seq / DNase-seq / ChIP-seq tracks aligned to human and mouse genomes | |
| - **Paper**: [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1038/s41588-024-02053-6) | |
| - **Developed by**: Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, David R. Kelley | |
| - **Model type**: Convolutional stem followed by Transformer trunk and U-Net upsampling tail for binned multi-track RNA-seq and chromatin coverage prediction | |
| - **Original Repository**: [calico/borzoi](https://github.com/calico/borzoi) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### Genomic Coverage Prediction | |
| You can use this model to predict binned RNA-seq and chromatin coverage tracks from a DNA sequence: | |
| ```python | |
| >>> import torch | |
| >>> from multimolecule import DnaTokenizer, BorzoiConfig, BorzoiForTokenPrediction | |
| >>> config = BorzoiConfig( | |
| ... sequence_length=512, hidden_size=16, num_hidden_layers=1, num_attention_heads=2, | |
| ... attention_head_size=4, attention_value_size=4, num_rel_pos_features=4, | |
| ... stem_channels=8, conv_tower_channels=[12], head_hidden_size=8, target_length=16, | |
| ... num_labels=4, | |
| ... ) | |
| >>> model = BorzoiForTokenPrediction(config) | |
| >>> output = model(torch.randint(config.vocab_size, (1, 512))) | |
| >>> output.logits.shape | |
| torch.Size([1, 16, 4]) | |
| ``` | |
| The binned positional axis is treated as the "token" axis: each output position corresponds to one | |
| genomic bin rather than a single nucleotide. The `species` configuration option selects the | |
| `human` (7,611 tracks) or `mouse` (2,608 tracks) species track set for the converted checkpoint. | |
| ### Interface | |
| - **Input length**: fixed 524,288 bp DNA window | |
| - **Output binning**: 32 bp per output bin; 6,144 output bins per window (after center-cropping the U-Net upsampling tail) | |
| - **Species track set**: select `human` (7,611 tracks) or `mouse` (2,608 tracks) via the `species` config option | |
| - **Output**: `(batch_size, target_length, num_tracks)` | |
| ## Training Details | |
| Borzoi was trained to predict bulk RNA-seq coverage together with chromatin tracks (DNase-seq, ATAC-seq, ChIP-seq) and CAGE from the human and mouse reference genomes. | |
| ### Training Data | |
| The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into 524 kb windows; for each window the per-32-bp coverage of every experiment served as the regression target. The training set is dominated by RNA-seq coverage (the modality Borzoi extends over Enformer); the remaining tracks include the chromatin and CAGE modalities used by Enformer. | |
| ### Training Procedure | |
| #### Pre-training | |
| The model was trained to minimize a Poisson-multinomial regression loss between predicted and observed coverage, using a softplus output activation to keep the predicted coverage non-negative. Training used the Adam optimizer with a warmup schedule and global gradient-norm clipping; reverse-complement and small genomic-shift data augmentations were applied during training. | |
| ## Citation | |
| ```bibtex | |
| @article{linder2025predicting, | |
| author = {Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R.}, | |
| title = {Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation}, | |
| journal = {Nature Genetics}, | |
| year = 2025, | |
| volume = 57, | |
| number = 4, | |
| pages = {949--961}, | |
| doi = {10.1038/s41588-024-02053-6}, | |
| publisher = {Nature Publishing Group} | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [Borzoi paper](https://doi.org/10.1038/s41588-024-02053-6) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` |