File size: 6,021 Bytes
a3f9ad6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
language: dna
tags:
  - Biology
  - DNA
license: agpl-3.0
datasets:
  - multimolecule/deepstarr
library_name: multimolecule
---

# DeepSTARR

Convolutional neural network for predicting enhancer activity directly from DNA sequence.

## Disclaimer

This is an UNOFFICIAL implementation of [DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers](https://doi.org/10.1038/s41588-022-01048-5) by Bernardo P. de Almeida, Franziska Reiter, et al.

The OFFICIAL repository of DeepSTARR is at [bernardo-de-almeida/DeepSTARR](https://github.com/bernardo-de-almeida/DeepSTARR).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing DeepSTARR did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

DeepSTARR is a convolutional neural network (CNN) trained to quantitatively predict enhancer activity from 249 bp DNA sequences. The model was trained on genome-wide STARR-seq data from _Drosophila melanogaster_ S2 cells and predicts two regression outputs: developmental and housekeeping enhancer activity. The architecture consists of four convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool) followed by two fully-connected layers. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

- Architecture: 4 convolutional layers + 2 fully-connected layers
- Convolution filters: 256, 60, 60, 120
- Convolution kernel sizes: 7, 3, 5, 3
- Max-pool size: 2
- Fully-connected sizes: 256, 256
- Input length: 249 bp
- Number of labels: 2 (developmental and housekeeping enhancer activity, regression)

| Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens |
| --------------- | ------------- | ----------- | ------------------ | --------- | -------- | -------------- |
| 4               | 2             | 256         | 0.62               | 21.03     | 10.26    | 249            |

### Links

- **Code**: [multimolecule.deepstarr](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/deepstarr)
- **Weights**: [multimolecule/deepstarr](https://huggingface.co/multimolecule/deepstarr)
- **Paper**: [DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers](https://doi.org/10.1038/s41588-022-01048-5)
- **Developed by**: Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark
- **Original Repository**: [bernardo-de-almeida/DeepSTARR](https://github.com/bernardo-de-almeida/DeepSTARR)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### Enhancer Activity Prediction

You can use this model directly to predict the developmental and housekeeping enhancer activity of a 249 bp DNA sequence:

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, DeepStarrForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr")
>>> model = DeepStarrForSequencePrediction.from_pretrained("multimolecule/deepstarr")
>>> sequence = "ACGT" * 62 + "A"
>>> output = model(**tokenizer(sequence, return_tensors="pt"))

>>> output.logits.shape
torch.Size([1, 2])
```

## Training Details

DeepSTARR was trained to predict quantitative enhancer activity from DNA sequence.

### Training Data

DeepSTARR was trained on genome-wide UMI-STARR-seq data from _Drosophila melanogaster_ S2 cells, measuring enhancer activity under two transcriptional programs: a developmental program (driven by a developmental core promoter) and a housekeeping program (driven by a housekeeping core promoter).

Each training example is a 249 bp genomic sequence with two continuous activity values (developmental and housekeeping, log2 enrichment over input).
Chromosomes were split into training, validation, and test sets to avoid sequence leakage.

### Training Procedure

#### Pre-training

The model was trained to minimize a mean-squared-error loss between predicted and measured enhancer activities.

- Optimizer: Adam
- Learning rate: 2e-3
- Loss: Mean Squared Error
- Input length: 249 bp
- Early stopping on validation loss

## Citation

```bibtex
@article{deAlmeida2022deepstarr,
  author    = {de Almeida, Bernardo P. and Reiter, Franziska and Pagani, Michaela and Stark, Alexander},
  journal   = {Nature Genetics},
  month     = may,
  number    = 5,
  pages     = {613--624},
  publisher = {Springer Science and Business Media LLC},
  title     = {{DeepSTARR} predicts enhancer activity from {DNA} sequence and enables the de novo design of synthetic enhancers},
  volume    = 54,
  year      = 2022
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [DeepSTARR paper](https://doi.org/10.1038/s41588-022-01048-5) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```