File size: 7,102 Bytes

535e94b

---
language: dna
tags:
  - Biology
  - DNA
license: agpl-3.0
datasets:
  - multimolecule/gencode
library_name: multimolecule
---

# Basenji

Deep convolutional neural network for predicting genomic coverage tracks across chromosomes.

## Disclaimer

This is an UNOFFICIAL implementation of [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117) by David R. Kelley, Yakir A. Reshef et al.

The OFFICIAL repository of Basenji is at [calico/basenji](https://github.com/calico/basenji).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing Basenji did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

Basenji is a deep convolutional neural network trained to predict genomic regulatory activity from long DNA sequences. It consumes a long DNA window (~131 kb), passes it through a convolution + pooling stem that downsamples the sequence, and then through a tower of dilated residual convolutional blocks that expand the receptive field. A pointwise output head predicts a vector of genomic coverage tracks for each output bin. Because the stem downsamples the input, the prediction is **binned**: the output has shape `(batch_size, num_bins, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments.

### Model Specification

- Input window: 131,072 bp
- Bin size: 128 bp (`stem_pool_size ** num_pool_layers`, 7 pooling stages)
- Pre-crop bins: 1,024
- `Cropping1D`: 64 bins per side
- Output bins: 896
- Stem channels: 288
- Convolution tower channels (growing): 339 -> 399 -> 470 -> 554 -> 652 -> 768
- Dilated residual stream: 768 channels with a 384-channel bottleneck
- Dilated residual blocks: 11 (dilation 1, 2, 3, 4, 6, 9, 14, 21, 32, 48, 72)
- Final pointwise block: 1,536 channels
- Activation: tanh-approximation GELU (`gelu_new`); output activation: `softplus`
- Coverage tracks (`num_labels`): 5,313 (default; the human track set released with Basenji2)

### Links

- **Code**: [multimolecule.basenji](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basenji)
- **Weights**: [multimolecule/basenji](https://huggingface.co/multimolecule/basenji)
- **Paper**: [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117)
- **Developed by**: David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, Jasper Snoek
- **Original Repository**: [calico/basenji](https://github.com/calico/basenji)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

You can use this model to predict binned genomic coverage tracks from a DNA sequence:

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, BasenjiConfig, BasenjiForTokenPrediction

>>> config = BasenjiConfig(
...     sequence_length=256, stem_channels=8, conv_tower_channels=[8],
...     stem_pool_size=2, head_hidden_size=8, crop_bins=2, num_labels=4,
...     blocks={"num_blocks": 1, "kernel_size": 3, "bottleneck_size": 4},
... )
>>> model = BasenjiForTokenPrediction(config)
>>> output = model(torch.randint(config.vocab_size, (1, 256)))
>>> output.logits.shape
torch.Size([1, 60, 4])
```

The binned positional axis is treated as the "token" axis: each output position corresponds to one
genomic bin rather than a single nucleotide.

## Training Details

Basenji was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) from
the human and mouse reference genomes.

### Training Data

The model was trained on a large compendium of functional genomics experiments aligned to the human
(hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; for each
window the per-128-bp coverage of every experiment served as the regression target.

### Training Procedure

The model was trained to minimize a Poisson regression loss between predicted and observed coverage.

## Known Limitations

- This implementation targets the upstream Basenji2 human graph
  (`https://storage.googleapis.com/basenji_barnyard2/model_human.h5`;
  `manuscripts/cross2020/params_human.json`): pre-activation convolution blocks
  (GELU → Conv → BatchNorm), bias-free convolutions, a growing-width convolution tower
  (288 → 339 → 399 → 470 → 554 → 652 → 768), dilated residual blocks on a 768-channel stream with a
  384-channel bottleneck, a `Cropping1D(64)`, a final 1,536-channel pointwise block, and a
  `Dense (1536, 5313)` track head. The converter reads the weights directly as raw `h5py`
  datasets, transposes convolution kernels and `Dense` weights from TensorFlow layout to PyTorch,
  and reorders the first convolution's input channels into MultiMolecule DNA token order.
- **`softplus` placement**: upstream applies `softplus` as the activation of the final `Dense`
  layer. The shared `TokenPredictionHead` computes the unactivated `Dense (1536 → 5313)` projection
  and `BasenjiForTokenPrediction.forward` applies `config.head_act` (`softplus` by default) to the
  head output as the model's output transform.

## Citation

```bibtex
@article{kelley2018sequential,
  author    = {Kelley, David R. and Reshef, Yakir A. and Bileschi, Maxwell and Belanger, David and McLean, Cory Y. and Snoek, Jasper},
  title     = {Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks},
  journal   = {Genome Research},
  year      = 2018,
  volume    = 28,
  number    = 5,
  pages     = {739--750},
  doi       = {10.1101/gr.227819.117},
  publisher = {Cold Spring Harbor Laboratory}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [Basenji paper](https://doi.org/10.1101/gr.227819.117) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```