--- language: dna tags: - Biology - DNA license: agpl-3.0 datasets: - multimolecule/gencode library_name: multimolecule --- # Basenji Deep convolutional neural network for predicting genomic coverage tracks across chromosomes. ## Disclaimer This is an UNOFFICIAL implementation of [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117) by David R. Kelley, Yakir A. Reshef et al. The OFFICIAL repository of Basenji is at [calico/basenji](https://github.com/calico/basenji). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing Basenji did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details Basenji is a deep convolutional neural network trained to predict genomic regulatory activity from long DNA sequences. It consumes a long DNA window (~131 kb), passes it through a convolution + pooling stem that downsamples the sequence, and then through a tower of dilated residual convolutional blocks that expand the receptive field. A pointwise output head predicts a vector of genomic coverage tracks for each output bin. Because the stem downsamples the input, the prediction is **binned**: the output has shape `(batch_size, num_bins, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments. ### Model Specification - Input window: 131,072 bp - Bin size: 128 bp (`stem_pool_size ** num_pool_layers`, 7 pooling stages) - Pre-crop bins: 1,024 - `Cropping1D`: 64 bins per side - Output bins: 896 - Stem channels: 288 - Convolution tower channels (growing): 339 -> 399 -> 470 -> 554 -> 652 -> 768 - Dilated residual stream: 768 channels with a 384-channel bottleneck - Dilated residual blocks: 11 (dilation 1, 2, 3, 4, 6, 9, 14, 21, 32, 48, 72) - Final pointwise block: 1,536 channels - Activation: tanh-approximation GELU (`gelu_new`); output activation: `softplus` - Coverage tracks (`num_labels`): 5,313 (default; the human track set released with Basenji2) ### Links - **Code**: [multimolecule.basenji](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basenji) - **Weights**: [multimolecule/basenji](https://huggingface.co/multimolecule/basenji) - **Paper**: [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117) - **Developed by**: David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, Jasper Snoek - **Original Repository**: [calico/basenji](https://github.com/calico/basenji) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use You can use this model to predict binned genomic coverage tracks from a DNA sequence: ```python >>> import torch >>> from multimolecule import DnaTokenizer, BasenjiConfig, BasenjiForTokenPrediction >>> config = BasenjiConfig( ... sequence_length=256, stem_channels=8, conv_tower_channels=[8], ... stem_pool_size=2, head_hidden_size=8, crop_bins=2, num_labels=4, ... blocks={"num_blocks": 1, "kernel_size": 3, "bottleneck_size": 4}, ... ) >>> model = BasenjiForTokenPrediction(config) >>> output = model(torch.randint(config.vocab_size, (1, 256))) >>> output.logits.shape torch.Size([1, 60, 4]) ``` The binned positional axis is treated as the "token" axis: each output position corresponds to one genomic bin rather than a single nucleotide. ## Training Details Basenji was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) from the human and mouse reference genomes. ### Training Data The model was trained on a large compendium of functional genomics experiments aligned to the human (hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; for each window the per-128-bp coverage of every experiment served as the regression target. ### Training Procedure The model was trained to minimize a Poisson regression loss between predicted and observed coverage. ## Known Limitations - This implementation targets the upstream Basenji2 human graph (`https://storage.googleapis.com/basenji_barnyard2/model_human.h5`; `manuscripts/cross2020/params_human.json`): pre-activation convolution blocks (GELU → Conv → BatchNorm), bias-free convolutions, a growing-width convolution tower (288 → 339 → 399 → 470 → 554 → 652 → 768), dilated residual blocks on a 768-channel stream with a 384-channel bottleneck, a `Cropping1D(64)`, a final 1,536-channel pointwise block, and a `Dense (1536, 5313)` track head. The converter reads the weights directly as raw `h5py` datasets, transposes convolution kernels and `Dense` weights from TensorFlow layout to PyTorch, and reorders the first convolution's input channels into MultiMolecule DNA token order. - **`softplus` placement**: upstream applies `softplus` as the activation of the final `Dense` layer. The shared `TokenPredictionHead` computes the unactivated `Dense (1536 → 5313)` projection and `BasenjiForTokenPrediction.forward` applies `config.head_act` (`softplus` by default) to the head output as the model's output transform. ## Citation ```bibtex @article{kelley2018sequential, author = {Kelley, David R. and Reshef, Yakir A. and Bileschi, Maxwell and Belanger, David and McLean, Cory Y. and Snoek, Jasper}, title = {Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks}, journal = {Genome Research}, year = 2018, volume = 28, number = 5, pages = {739--750}, doi = {10.1101/gr.227819.117}, publisher = {Cold Spring Harbor Laboratory} } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [Basenji paper](https://doi.org/10.1101/gr.227819.117) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```