Instructions to use multimolecule/basenji with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/basenji with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/basenji") model = AutoModel.from_pretrained("multimolecule/basenji") - Notebooks
- Google Colab
- Kaggle
| language: dna | |
| tags: | |
| - Biology | |
| - DNA | |
| license: agpl-3.0 | |
| datasets: | |
| - multimolecule/gencode | |
| library_name: multimolecule | |
| # Basenji | |
| Deep convolutional neural network for predicting genomic coverage tracks across chromosomes. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117) by David R. Kelley, Yakir A. Reshef et al. | |
| The OFFICIAL repository of Basenji is at [calico/basenji](https://github.com/calico/basenji). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing Basenji did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| Basenji is a deep convolutional neural network trained to predict genomic regulatory activity from long DNA sequences. It consumes a long DNA window (~131 kb), passes it through a convolution + pooling stem that downsamples the sequence, and then through a tower of dilated residual convolutional blocks that expand the receptive field. A pointwise output head predicts a vector of genomic coverage tracks for each output bin. Because the stem downsamples the input, the prediction is **binned**: the output has shape `(batch_size, num_bins, num_tracks)` where each bin summarizes 128 bp of sequence and `num_tracks` is the number of genomic coverage experiments. | |
| ### Model Specification | |
| - Input window: 131,072 bp | |
| - Bin size: 128 bp (`stem_pool_size ** num_pool_layers`, 7 pooling stages) | |
| - Pre-crop bins: 1,024 | |
| - `Cropping1D`: 64 bins per side | |
| - Output bins: 896 | |
| - Stem channels: 288 | |
| - Convolution tower channels (growing): 339 -> 399 -> 470 -> 554 -> 652 -> 768 | |
| - Dilated residual stream: 768 channels with a 384-channel bottleneck | |
| - Dilated residual blocks: 11 (dilation 1, 2, 3, 4, 6, 9, 14, 21, 32, 48, 72) | |
| - Final pointwise block: 1,536 channels | |
| - Activation: tanh-approximation GELU (`gelu_new`); output activation: `softplus` | |
| - Coverage tracks (`num_labels`): 5,313 (default; the human track set released with Basenji2) | |
| ### Links | |
| - **Code**: [multimolecule.basenji](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basenji) | |
| - **Weights**: [multimolecule/basenji](https://huggingface.co/multimolecule/basenji) | |
| - **Paper**: [Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks](https://doi.org/10.1101/gr.227819.117) | |
| - **Developed by**: David R. Kelley, Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, Jasper Snoek | |
| - **Original Repository**: [calico/basenji](https://github.com/calico/basenji) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| You can use this model to predict binned genomic coverage tracks from a DNA sequence: | |
| ```python | |
| >>> import torch | |
| >>> from multimolecule import DnaTokenizer, BasenjiConfig, BasenjiForTokenPrediction | |
| >>> config = BasenjiConfig( | |
| ... sequence_length=256, stem_channels=8, conv_tower_channels=[8], | |
| ... stem_pool_size=2, head_hidden_size=8, crop_bins=2, num_labels=4, | |
| ... blocks={"num_blocks": 1, "kernel_size": 3, "bottleneck_size": 4}, | |
| ... ) | |
| >>> model = BasenjiForTokenPrediction(config) | |
| >>> output = model(torch.randint(config.vocab_size, (1, 256))) | |
| >>> output.logits.shape | |
| torch.Size([1, 60, 4]) | |
| ``` | |
| The binned positional axis is treated as the "token" axis: each output position corresponds to one | |
| genomic bin rather than a single nucleotide. | |
| ## Training Details | |
| Basenji was trained to predict genomic coverage tracks (DNase-seq, ATAC-seq, ChIP-seq and CAGE) from | |
| the human and mouse reference genomes. | |
| ### Training Data | |
| The model was trained on a large compendium of functional genomics experiments aligned to the human | |
| (hg38) and mouse (mm10) reference genomes. The genome was divided into overlapping windows; for each | |
| window the per-128-bp coverage of every experiment served as the regression target. | |
| ### Training Procedure | |
| The model was trained to minimize a Poisson regression loss between predicted and observed coverage. | |
| ## Known Limitations | |
| - This implementation targets the upstream Basenji2 human graph | |
| (`https://storage.googleapis.com/basenji_barnyard2/model_human.h5`; | |
| `manuscripts/cross2020/params_human.json`): pre-activation convolution blocks | |
| (GELU β Conv β BatchNorm), bias-free convolutions, a growing-width convolution tower | |
| (288 β 339 β 399 β 470 β 554 β 652 β 768), dilated residual blocks on a 768-channel stream with a | |
| 384-channel bottleneck, a `Cropping1D(64)`, a final 1,536-channel pointwise block, and a | |
| `Dense (1536, 5313)` track head. The converter reads the weights directly as raw `h5py` | |
| datasets, transposes convolution kernels and `Dense` weights from TensorFlow layout to PyTorch, | |
| and reorders the first convolution's input channels into MultiMolecule DNA token order. | |
| - **`softplus` placement**: upstream applies `softplus` as the activation of the final `Dense` | |
| layer. The shared `TokenPredictionHead` computes the unactivated `Dense (1536 β 5313)` projection | |
| and `BasenjiForTokenPrediction.forward` applies `config.head_act` (`softplus` by default) to the | |
| head output as the model's output transform. | |
| ## Citation | |
| ```bibtex | |
| @article{kelley2018sequential, | |
| author = {Kelley, David R. and Reshef, Yakir A. and Bileschi, Maxwell and Belanger, David and McLean, Cory Y. and Snoek, Jasper}, | |
| title = {Sequential regulatory activity prediction across chromosomes with deep convolutional and recurrent neural networks}, | |
| journal = {Genome Research}, | |
| year = 2018, | |
| volume = 28, | |
| number = 5, | |
| pages = {739--750}, | |
| doi = {10.1101/gr.227819.117}, | |
| publisher = {Cold Spring Harbor Laboratory} | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [Basenji paper](https://doi.org/10.1101/gr.227819.117) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` | |