Instructions to use multimolecule/basset with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/basset with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/basset") model = AutoModel.from_pretrained("multimolecule/basset") - Notebooks
- Google Colab
- Kaggle
| language: dna | |
| tags: | |
| - Biology | |
| - DNA | |
| license: agpl-3.0 | |
| datasets: | |
| - multimolecule/encode | |
| library_name: multimolecule | |
| # Basset | |
| Deep convolutional neural network for predicting chromatin accessibility (DNase I hypersensitivity) from DNA sequence. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks](https://doi.org/10.1101/gr.200535.115) by David R. Kelley et al. | |
| The OFFICIAL repository of Basset is at [davek44/Basset](https://github.com/davek44/Basset). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing Basset did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| Basset is a convolutional neural network (CNN) trained to predict the chromatin accessibility (DNase I hypersensitivity) of a DNA sequence across 164 cell types. The model consumes a fixed-length 600 bp one-hot encoded DNA sequence and applies three convolutional blocks (convolution, batch normalization, ReLU, and max pooling) followed by two fully-connected blocks before a multi-label binary classification head. Please refer to the [Training Details](#training-details) section for more information on the training process. | |
| ### Model Specification | |
| - Architecture: 3 convolutional layers + 2 fully-connected layers | |
| - Convolution filters: 300, 200, 200 | |
| - Convolution kernel sizes: 19, 11, 7 | |
| - Max-pool sizes: 3, 4, 4 | |
| - Fully-connected sizes: 1000, 1000 | |
| - Input length: 600 bp | |
| - Number of labels: 164 (DNase I hypersensitivity, multi-label binary) | |
| | Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | | |
| | --------------- | ------------- | ----------- | ------------------ | --------- | -------- | -------------- | | |
| | 3 | 2 | 1000 | 4.14 | 0.30 | 0.15 | 600 | | |
| ### Links | |
| - **Code**: [multimolecule.basset](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basset) | |
| - **Weights**: [multimolecule/basset](https://huggingface.co/multimolecule/basset) | |
| - **Paper**: [Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks](https://doi.org/10.1101/gr.200535.115) | |
| - **Developed by**: David R. Kelley, Jasper Snoek, John L. Rinn | |
| - **Original Repository**: [davek44/Basset](https://github.com/davek44/Basset) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### Chromatin Accessibility Prediction | |
| You can use this model directly to predict the DNase I hypersensitivity of a DNA sequence: | |
| ```python | |
| >>> import torch | |
| >>> from multimolecule import DnaTokenizer, BassetForSequencePrediction | |
| >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/basset") | |
| >>> model = BassetForSequencePrediction.from_pretrained("multimolecule/basset") | |
| >>> input = tokenizer("ACGT" * 150, return_tensors="pt") | |
| >>> output = model(**input) | |
| >>> output.logits.shape | |
| torch.Size([1, 164]) | |
| ``` | |
| ## Training Details | |
| Basset was trained to predict the chromatin accessibility of DNA sequences across a panel of cell types. | |
| ### Training Data | |
| Basset was trained on DNase I hypersensitivity peaks from [ENCODE](https://www.encodeproject.org) and the [Roadmap Epigenomics](https://www.roadmapepigenomics.org) project, covering 164 cell types. | |
| Each 600 bp genomic interval is labeled with a binary vector indicating which of the 164 cell types show an accessibility peak overlapping that interval. | |
| ### Training Procedure | |
| #### Pre-training | |
| The model was trained to minimize a multi-label binary cross-entropy loss, comparing its predicted per-cell-type accessibility probabilities against the observed DNase I hypersensitivity labels. | |
| - Optimizer: RMSprop | |
| - Loss: Multi-label binary cross-entropy | |
| - Regularization: Batch normalization and dropout | |
| ## Citation | |
| ```bibtex | |
| @article{kelley2016basset, | |
| author = {Kelley, David R. and Snoek, Jasper and Rinn, John L.}, | |
| title = {Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks}, | |
| journal = {Genome Research}, | |
| volume = 26, | |
| number = 7, | |
| pages = {990--999}, | |
| year = 2016, | |
| publisher = {Cold Spring Harbor Laboratory Press}, | |
| doi = {10.1101/gr.200535.115} | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [Basset paper](https://doi.org/10.1101/gr.200535.115) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` | |