File size: 5,668 Bytes

e93b3bf

---
language: dna
tags:
  - Biology
  - DNA
license: agpl-3.0
datasets:
  - multimolecule/encode
library_name: multimolecule
---

# Basset

Deep convolutional neural network for predicting chromatin accessibility (DNase I hypersensitivity) from DNA sequence.

## Disclaimer

This is an UNOFFICIAL implementation of [Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks](https://doi.org/10.1101/gr.200535.115) by David R. Kelley et al.

The OFFICIAL repository of Basset is at [davek44/Basset](https://github.com/davek44/Basset).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing Basset did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

Basset is a convolutional neural network (CNN) trained to predict the chromatin accessibility (DNase I hypersensitivity) of a DNA sequence across 164 cell types. The model consumes a fixed-length 600 bp one-hot encoded DNA sequence and applies three convolutional blocks (convolution, batch normalization, ReLU, and max pooling) followed by two fully-connected blocks before a multi-label binary classification head. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

- Architecture: 3 convolutional layers + 2 fully-connected layers
- Convolution filters: 300, 200, 200
- Convolution kernel sizes: 19, 11, 7
- Max-pool sizes: 3, 4, 4
- Fully-connected sizes: 1000, 1000
- Input length: 600 bp
- Number of labels: 164 (DNase I hypersensitivity, multi-label binary)

| Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| --------------- | ------------- | ----------- | ------------------ | --------- | -------- | -------------- |
| 3               | 2             | 1000        | 4.14               | 0.30      | 0.15     | 600            |

### Links

- **Code**: [multimolecule.basset](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basset)
- **Weights**: [multimolecule/basset](https://huggingface.co/multimolecule/basset)
- **Paper**: [Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks](https://doi.org/10.1101/gr.200535.115)
- **Developed by**: David R. Kelley, Jasper Snoek, John L. Rinn
- **Original Repository**: [davek44/Basset](https://github.com/davek44/Basset)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### Chromatin Accessibility Prediction

You can use this model directly to predict the DNase I hypersensitivity of a DNA sequence:

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, BassetForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/basset")
>>> model = BassetForSequencePrediction.from_pretrained("multimolecule/basset")
>>> input = tokenizer("ACGT" * 150, return_tensors="pt")
>>> output = model(**input)

>>> output.logits.shape
torch.Size([1, 164])
```

## Training Details

Basset was trained to predict the chromatin accessibility of DNA sequences across a panel of cell types.

### Training Data

Basset was trained on DNase I hypersensitivity peaks from [ENCODE](https://www.encodeproject.org) and the [Roadmap Epigenomics](https://www.roadmapepigenomics.org) project, covering 164 cell types.
Each 600 bp genomic interval is labeled with a binary vector indicating which of the 164 cell types show an accessibility peak overlapping that interval.

### Training Procedure

#### Pre-training

The model was trained to minimize a multi-label binary cross-entropy loss, comparing its predicted per-cell-type accessibility probabilities against the observed DNase I hypersensitivity labels.

- Optimizer: RMSprop
- Loss: Multi-label binary cross-entropy
- Regularization: Batch normalization and dropout

## Citation

```bibtex
@article{kelley2016basset,
  author    = {Kelley, David R. and Snoek, Jasper and Rinn, John L.},
  title     = {Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks},
  journal   = {Genome Research},
  volume    = 26,
  number    = 7,
  pages     = {990--999},
  year      = 2016,
  publisher = {Cold Spring Harbor Laboratory Press},
  doi       = {10.1101/gr.200535.115}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [Basset paper](https://doi.org/10.1101/gr.200535.115) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```