File size: 5,668 Bytes
e93b3bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
language: dna
tags:
  - Biology
  - DNA
license: agpl-3.0
datasets:
  - multimolecule/encode
library_name: multimolecule
---

# Basset

Deep convolutional neural network for predicting chromatin accessibility (DNase I hypersensitivity) from DNA sequence.

## Disclaimer

This is an UNOFFICIAL implementation of [Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks](https://doi.org/10.1101/gr.200535.115) by David R. Kelley et al.

The OFFICIAL repository of Basset is at [davek44/Basset](https://github.com/davek44/Basset).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing Basset did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

Basset is a convolutional neural network (CNN) trained to predict the chromatin accessibility (DNase I hypersensitivity) of a DNA sequence across 164 cell types. The model consumes a fixed-length 600 bp one-hot encoded DNA sequence and applies three convolutional blocks (convolution, batch normalization, ReLU, and max pooling) followed by two fully-connected blocks before a multi-label binary classification head. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

- Architecture: 3 convolutional layers + 2 fully-connected layers
- Convolution filters: 300, 200, 200
- Convolution kernel sizes: 19, 11, 7
- Max-pool sizes: 3, 4, 4
- Fully-connected sizes: 1000, 1000
- Input length: 600 bp
- Number of labels: 164 (DNase I hypersensitivity, multi-label binary)

| Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| --------------- | ------------- | ----------- | ------------------ | --------- | -------- | -------------- |
| 3               | 2             | 1000        | 4.14               | 0.30      | 0.15     | 600            |

### Links

- **Code**: [multimolecule.basset](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/basset)
- **Weights**: [multimolecule/basset](https://huggingface.co/multimolecule/basset)
- **Paper**: [Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks](https://doi.org/10.1101/gr.200535.115)
- **Developed by**: David R. Kelley, Jasper Snoek, John L. Rinn
- **Original Repository**: [davek44/Basset](https://github.com/davek44/Basset)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### Chromatin Accessibility Prediction

You can use this model directly to predict the DNase I hypersensitivity of a DNA sequence:

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, BassetForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/basset")
>>> model = BassetForSequencePrediction.from_pretrained("multimolecule/basset")
>>> input = tokenizer("ACGT" * 150, return_tensors="pt")
>>> output = model(**input)

>>> output.logits.shape
torch.Size([1, 164])
```

## Training Details

Basset was trained to predict the chromatin accessibility of DNA sequences across a panel of cell types.

### Training Data

Basset was trained on DNase I hypersensitivity peaks from [ENCODE](https://www.encodeproject.org) and the [Roadmap Epigenomics](https://www.roadmapepigenomics.org) project, covering 164 cell types.
Each 600 bp genomic interval is labeled with a binary vector indicating which of the 164 cell types show an accessibility peak overlapping that interval.

### Training Procedure

#### Pre-training

The model was trained to minimize a multi-label binary cross-entropy loss, comparing its predicted per-cell-type accessibility probabilities against the observed DNase I hypersensitivity labels.

- Optimizer: RMSprop
- Loss: Multi-label binary cross-entropy
- Regularization: Batch normalization and dropout

## Citation

```bibtex
@article{kelley2016basset,
  author    = {Kelley, David R. and Snoek, Jasper and Rinn, John L.},
  title     = {Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks},
  journal   = {Genome Research},
  volume    = 26,
  number    = 7,
  pages     = {990--999},
  year      = 2016,
  publisher = {Cold Spring Harbor Laboratory Press},
  doi       = {10.1101/gr.200535.115}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [Basset paper](https://doi.org/10.1101/gr.200535.115) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```