Other
MultiMolecule
PyTorch
Safetensors
Upper Grand Valley Dani
borzoi
Biology
DNA
File size: 13,090 Bytes
34211bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
datasets:
- multimolecule/encode
- multimolecule/fantom5
- multimolecule/gtex
language: dna
library_name: multimolecule
license: agpl-3.0
pipeline: regulatory-track
pipeline_tag: other
tags:
- Biology
- DNA
widget:
- example_title: tumor protein p53
  pipeline_tag: regulatory-track
  sequence_type: DNA
  task: regulatory-track
  text: ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG
- example_title: BRCA1 DNA repair associated
  pipeline_tag: regulatory-track
  sequence_type: DNA
  task: regulatory-track
  text: TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG
- example_title: hemoglobin subunit beta
  pipeline_tag: regulatory-track
  sequence_type: DNA
  task: regulatory-track
  text: CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
- example_title: CF transmembrane conductance regulator
  pipeline_tag: regulatory-track
  sequence_type: DNA
  task: regulatory-track
  text: ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
- example_title: telomerase reverse transcriptase
  pipeline_tag: regulatory-track
  sequence_type: DNA
  task: regulatory-track
  text: CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG
- example_title: KRAS proto-oncogene
  pipeline_tag: regulatory-track
  sequence_type: DNA
  task: regulatory-track
  text: GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG
- example_title: prion protein (Kanno blood group)
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC
- example_title: interleukin 10
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC
- example_title: Zaire ebolavirus
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT
- example_title: SARS coronavirus
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT
- example_title: insulin
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG
- example_title: cyclin dependent kinase inhibitor 2A
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA
- example_title: human papillomavirus type 16 E6
  pipeline_tag: regulatory-track
  sequence_type: cDNA
  task: regulatory-track
  text: ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA
---

# Borzoi

Sequence-to-coverage neural network for predicting RNA-seq and chromatin tracks across 524 kb DNA windows at 32 bp resolution.

## Disclaimer

This is an UNOFFICIAL implementation of [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1038/s41588-024-02053-6) by Johannes Linder, Divyanshi Srivastava, Han Yuan, et al.

The OFFICIAL repository of Borzoi is at [calico/borzoi](https://github.com/calico/borzoi).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing Borzoi did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

Borzoi is the successor of [Enformer](https://huggingface.co/multimolecule/enformer). It extends the Enformer recipe (convolution stem + Transformer trunk + binned multi-track output) to a 524,288 bp input window and 32 bp output bins, and adds a U-Net style upsampling tail so the binned positional axis matches a higher-resolution coverage prediction. A long DNA window of 524 kb is downsampled by a convolution stem and a width-growing residual convolution tower, projected to 1,536 channels by a U-Net bottleneck, processed by 8 Transformer blocks with Transformer-XL style relative positional encoding, then upsampled by two skip-connected U-Net stages with depthwise-separable convolutions, center-cropped to 6,144 bins, and projected to per-species coverage tracks with a softplus activation. The output is **binned**: it has shape `(batch_size, target_length, num_tracks)` where each bin summarizes 32 bp of sequence and `num_tracks` is the number of genomic coverage experiments for the selected species. Borzoi was trained jointly on RNA-seq, CAGE, ATAC-seq, DNase-seq, and ChIP-seq tracks. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Variants

Borzoi releases separate human and mouse checkpoints for the corresponding species track sets.

- **[multimolecule/borzoi-human](https://huggingface.co/multimolecule/borzoi-human)**: human checkpoint with 7,611 human genomic coverage tracks.
- **[multimolecule/borzoi-mouse](https://huggingface.co/multimolecule/borzoi-mouse)**: mouse checkpoint with 2,608 mouse genomic coverage tracks.

### Model Specification

| Input Length | Bin Size | Output Bins | Hidden Size | Layers | Heads | Num Labels | Num Parameters (M) | FLOPs (P) | MACs (P) |
| ------------ | -------- | ----------- | ----------- | ------ | ----- | ---------- | ------------------ | --------- | -------- |
| 524288       | 32       | 6144        | 1536        | 8      | 8     | 7611       | 185.90             | 13.57     | 6.76     |

The table reports the human checkpoint. The mouse checkpoint predicts 2,608 tracks.
FLOPs and MACs are measured on the canonical 524,288 bp Borzoi input window.

### Links

- **Code**: [multimolecule.borzoi](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/borzoi)
- **Data**: ENCODE, GTEx, FANTOM5 RNA-seq / CAGE / ATAC-seq / DNase-seq / ChIP-seq tracks aligned to human and mouse genomes
- **Paper**: [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1038/s41588-024-02053-6)
- **Developed by**: Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, David R. Kelley
- **Model type**: Convolutional stem followed by Transformer trunk and U-Net upsampling tail for binned multi-track RNA-seq and chromatin coverage prediction
- **Original Repository**: [calico/borzoi](https://github.com/calico/borzoi)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### Genomic Coverage Prediction

You can use this model to predict binned RNA-seq and chromatin coverage tracks from a DNA sequence:

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, BorzoiConfig, BorzoiForTokenPrediction

>>> config = BorzoiConfig(
...     sequence_length=512, hidden_size=16, num_hidden_layers=1, num_attention_heads=2,
...     attention_head_size=4, attention_value_size=4, num_rel_pos_features=4,
...     stem_channels=8, conv_tower_channels=[12], head_hidden_size=8, target_length=16,
...     num_labels=4,
... )
>>> model = BorzoiForTokenPrediction(config)
>>> output = model(torch.randint(config.vocab_size, (1, 512)))
>>> output.logits.shape
torch.Size([1, 16, 4])
```

The binned positional axis is treated as the "token" axis: each output position corresponds to one
genomic bin rather than a single nucleotide. The `species` configuration option selects the
`human` (7,611 tracks) or `mouse` (2,608 tracks) species track set for the converted checkpoint.

### Interface

- **Input length**: fixed 524,288 bp DNA window
- **Output binning**: 32 bp per output bin; 6,144 output bins per window (after center-cropping the U-Net upsampling tail)
- **Species track set**: select `human` (7,611 tracks) or `mouse` (2,608 tracks) via the `species` config option
- **Output**: `(batch_size, target_length, num_tracks)`

## Training Details

Borzoi was trained to predict bulk RNA-seq coverage together with chromatin tracks (DNase-seq, ATAC-seq, ChIP-seq) and CAGE from the human and mouse reference genomes.

### Training Data

The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into 524 kb windows; for each window the per-32-bp coverage of every experiment served as the regression target. The training set is dominated by RNA-seq coverage (the modality Borzoi extends over Enformer); the remaining tracks include the chromatin and CAGE modalities used by Enformer.

### Training Procedure

#### Pre-training

The model was trained to minimize a Poisson-multinomial regression loss between predicted and observed coverage, using a softplus output activation to keep the predicted coverage non-negative. Training used the Adam optimizer with a warmup schedule and global gradient-norm clipping; reverse-complement and small genomic-shift data augmentations were applied during training.

## Citation

```bibtex
@article{linder2025predicting,
  author    = {Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R.},
  title     = {Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation},
  journal   = {Nature Genetics},
  year      = 2025,
  volume    = 57,
  number    = 4,
  pages     = {949--961},
  doi       = {10.1038/s41588-024-02053-6},
  publisher = {Nature Publishing Group}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [Borzoi paper](https://doi.org/10.1038/s41588-024-02053-6) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```