basenji / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
535e94b verified
---
language: dna
tags:
- Biology
- DNA
license: agpl-3.0
datasets:
- multimolecule/gencode
library_name: multimolecule
---
# Basenji
Deep convolutional neural network for predicting genomic coverage tracks across chromosomes.
## Disclaimer
This is an UNOFFICIAL implementation of [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117) by David R. Kelley, Yakir A. Reshef et al.
The OFFICIAL repository of Basenji is at [calico/basenji](https://github.com/calico/basenji).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing Basenji did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
Basenji is a deep convolutional neural network trained to predict genomic regulatory activity from long DNA sequences. It consumes a long DNA window (~131 kb), passes it through a convolution + pooling stem that downsamples the sequence, and then through a tower of dilated residual convolutional blocks that expand the receptive field. A pointwise output head predicts a vector of genomic coverage tracks for each output bin. Because the stem downsamples the input, the prediction is **binned**: the output has shape `(batch_size, num_bins, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments.
### Model Specification
- Input window: 131,072 bp
- Bin size: 128 bp (`stem_pool_size ** num_pool_layers`, 7 pooling stages)
- Pre-crop bins: 1,024
- `Cropping1D`: 64 bins per side
- Output bins: 896
- Stem channels: 288
- Convolution tower channels (growing): 339 -> 399 -> 470 -> 554 -> 652 -> 768
- Dilated residual stream: 768 channels with a 384-channel bottleneck
- Dilated residual blocks: 11 (dilation 1, 2, 3, 4, 6, 9, 14, 21, 32, 48, 72)
- Final pointwise block: 1,536 channels
- Activation: tanh-approximation GELU (`gelu_new`); output activation: `softplus`
- Coverage tracks (`num_labels`): 5,313 (default; the human track set released with Basenji2)
### Links
- **Code**: [multimolecule.basenji](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basenji)
- **Weights**: [multimolecule/basenji](https://huggingface.co/multimolecule/basenji)
- **Paper**: [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117)
- **Developed by**: David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, Jasper Snoek
- **Original Repository**: [calico/basenji](https://github.com/calico/basenji)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
You can use this model to predict binned genomic coverage tracks from a DNA sequence:
```python
>>> import torch
>>> from multimolecule import DnaTokenizer, BasenjiConfig, BasenjiForTokenPrediction
>>> config = BasenjiConfig(
... sequence_length=256, stem_channels=8, conv_tower_channels=[8],
... stem_pool_size=2, head_hidden_size=8, crop_bins=2, num_labels=4,
... blocks={"num_blocks": 1, "kernel_size": 3, "bottleneck_size": 4},
... )
>>> model = BasenjiForTokenPrediction(config)
>>> output = model(torch.randint(config.vocab_size, (1, 256)))
>>> output.logits.shape
torch.Size([1, 60, 4])
```
The binned positional axis is treated as the "token" axis: each output position corresponds to one
genomic bin rather than a single nucleotide.
## Training Details
Basenji was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) from
the human and mouse reference genomes.
### Training Data
The model was trained on a large compendium of functional genomics experiments aligned to the human
(hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; for each
window the per-128-bp coverage of every experiment served as the regression target.
### Training Procedure
The model was trained to minimize a Poisson regression loss between predicted and observed coverage.
## Known Limitations
- This implementation targets the upstream Basenji2 human graph
(`https://storage.googleapis.com/basenji_barnyard2/model_human.h5`;
`manuscripts/cross2020/params_human.json`): pre-activation convolution blocks
(GELU β†’ Conv β†’ BatchNorm), bias-free convolutions, a growing-width convolution tower
(288 β†’ 339 β†’ 399 β†’ 470 β†’ 554 β†’ 652 β†’ 768), dilated residual blocks on a 768-channel stream with a
384-channel bottleneck, a `Cropping1D(64)`, a final 1,536-channel pointwise block, and a
`Dense (1536, 5313)` track head. The converter reads the weights directly as raw `h5py`
datasets, transposes convolution kernels and `Dense` weights from TensorFlow layout to PyTorch,
and reorders the first convolution's input channels into MultiMolecule DNA token order.
- **`softplus` placement**: upstream applies `softplus` as the activation of the final `Dense`
layer. The shared `TokenPredictionHead` computes the unactivated `Dense (1536 β†’ 5313)` projection
and `BasenjiForTokenPrediction.forward` applies `config.head_act` (`softplus` by default) to the
head output as the model's output transform.
## Citation
```bibtex
@article{kelley2018sequential,
author = {Kelley, David R. and Reshef, Yakir A. and Bileschi, Maxwell and Belanger, David and McLean, Cory Y. and Snoek, Jasper},
title = {Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks},
journal = {Genome Research},
year = 2018,
volume = 28,
number = 5,
pages = {739--750},
doi = {10.1101/gr.227819.117},
publisher = {Cold Spring Harbor Laboratory}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [Basenji paper](https://doi.org/10.1101/gr.227819.117) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```