| --- |
| language: dna |
| tags: |
| - Biology |
| - DNA |
| license: agpl-3.0 |
| datasets: |
| - multimolecule/deepstarr |
| library_name: multimolecule |
| --- |
| |
| # DeepSTARR |
|
|
| Convolutional neural network for predicting enhancer activity directly from DNA sequence. |
|
|
| ## Disclaimer |
|
|
| This is an UNOFFICIAL implementation of [DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers](https://doi.org/10.1038/s41588-022-01048-5) by Bernardo P. de Almeida, Franziska Reiter, et al. |
|
|
| The OFFICIAL repository of DeepSTARR is at [bernardo-de-almeida/DeepSTARR](https://github.com/bernardo-de-almeida/DeepSTARR). |
|
|
| > [!TIP] |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. |
|
|
| **The team releasing DeepSTARR did not write this model card for this model so this model card has been written by the MultiMolecule team.** |
|
|
| ## Model Details |
|
|
| DeepSTARR is a convolutional neural network (CNN) trained to quantitatively predict enhancer activity from 249 bp DNA sequences. The model was trained on genome-wide STARR-seq data from _Drosophila melanogaster_ S2 cells and predicts two regression outputs: developmental and housekeeping enhancer activity. The architecture consists of four convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool) followed by two fully-connected layers. Please refer to the [Training Details](#training-details) section for more information on the training process. |
|
|
| ### Model Specification |
|
|
| - Architecture: 4 convolutional layers + 2 fully-connected layers |
| - Convolution filters: 256, 60, 60, 120 |
| - Convolution kernel sizes: 7, 3, 5, 3 |
| - Max-pool size: 2 |
| - Fully-connected sizes: 256, 256 |
| - Input length: 249 bp |
| - Number of labels: 2 (developmental and housekeeping enhancer activity, regression) |
|
|
| | Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (M) | MACs (M) | Max Num Tokens | |
| | --------------- | ------------- | ----------- | ------------------ | --------- | -------- | -------------- | |
| | 4 | 2 | 256 | 0.62 | 21.03 | 10.26 | 249 | |
|
|
| ### Links |
|
|
| - **Code**: [multimolecule.deepstarr](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/deepstarr) |
| - **Weights**: [multimolecule/deepstarr](https://huggingface.co/multimolecule/deepstarr) |
| - **Paper**: [DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers](https://doi.org/10.1038/s41588-022-01048-5) |
| - **Developed by**: Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark |
| - **Original Repository**: [bernardo-de-almeida/DeepSTARR](https://github.com/bernardo-de-almeida/DeepSTARR) |
|
|
| ## Usage |
|
|
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: |
|
|
| ```bash |
| pip install multimolecule |
| ``` |
|
|
| ### Direct Use |
|
|
| #### Enhancer Activity Prediction |
|
|
| You can use this model directly to predict the developmental and housekeeping enhancer activity of a 249 bp DNA sequence: |
|
|
| ```python |
| >>> import torch |
| >>> from multimolecule import DnaTokenizer, DeepStarrForSequencePrediction |
| |
| >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepstarr") |
| >>> model = DeepStarrForSequencePrediction.from_pretrained("multimolecule/deepstarr") |
| >>> sequence = "ACGT" * 62 + "A" |
| >>> output = model(**tokenizer(sequence, return_tensors="pt")) |
| |
| >>> output.logits.shape |
| torch.Size([1, 2]) |
| ``` |
|
|
| ## Training Details |
|
|
| DeepSTARR was trained to predict quantitative enhancer activity from DNA sequence. |
|
|
| ### Training Data |
|
|
| DeepSTARR was trained on genome-wide UMI-STARR-seq data from _Drosophila melanogaster_ S2 cells, measuring enhancer activity under two transcriptional programs: a developmental program (driven by a developmental core promoter) and a housekeeping program (driven by a housekeeping core promoter). |
|
|
| Each training example is a 249 bp genomic sequence with two continuous activity values (developmental and housekeeping, log2 enrichment over input). |
| Chromosomes were split into training, validation, and test sets to avoid sequence leakage. |
|
|
| ### Training Procedure |
|
|
| #### Pre-training |
|
|
| The model was trained to minimize a mean-squared-error loss between predicted and measured enhancer activities. |
|
|
| - Optimizer: Adam |
| - Learning rate: 2e-3 |
| - Loss: Mean Squared Error |
| - Input length: 249 bp |
| - Early stopping on validation loss |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{deAlmeida2022deepstarr, |
| author = {de Almeida, Bernardo P. and Reiter, Franziska and Pagani, Michaela and Stark, Alexander}, |
| journal = {Nature Genetics}, |
| month = may, |
| number = 5, |
| pages = {613--624}, |
| publisher = {Springer Science and Business Media LLC}, |
| title = {{DeepSTARR} predicts enhancer activity from {DNA} sequence and enables the de novo design of synthetic enhancers}, |
| volume = 54, |
| year = 2022 |
| } |
| ``` |
|
|
| > [!NOTE] |
| > The artifacts distributed in this repository are part of the MultiMolecule project. |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: |
|
|
| ```bibtex |
| @software{chen_2024_12638419, |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, |
| title = {MultiMolecule}, |
| doi = {10.5281/zenodo.12638419}, |
| publisher = {Zenodo}, |
| url = {https://doi.org/10.5281/zenodo.12638419}, |
| year = 2024, |
| month = may, |
| day = 4 |
| } |
| ``` |
|
|
| ## Contact |
|
|
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. |
|
|
| Please contact the authors of the [DeepSTARR paper](https://doi.org/10.1038/s41588-022-01048-5) for questions or comments on the paper/model. |
|
|
| ## License |
|
|
| This model implementation is licensed under the [GNU Affero General Public License](license.md). |
|
|
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). |
|
|
| ```spdx |
| SPDX-License-Identifier: AGPL-3.0-or-later |
| ``` |
|
|