Instructions to use multimolecule/deepsea with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/deepsea with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/deepsea") model = AutoModel.from_pretrained("multimolecule/deepsea") inputs = tokenizer("ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state - Notebooks
- Google Colab
- Kaggle
| language: dna | |
| library_name: multimolecule | |
| license: agpl-3.0 | |
| pipeline: regulatory-variant-effect | |
| pipeline_tag: other | |
| tags: | |
| - Biology | |
| - DNA | |
| widget: | |
| - example_title: tumor protein p53 | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: DNA | |
| task: regulatory-variant-effect | |
| text: ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG | |
| - example_title: BRCA1 DNA repair associated | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: DNA | |
| task: regulatory-variant-effect | |
| text: TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG | |
| - example_title: hemoglobin subunit beta | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: DNA | |
| task: regulatory-variant-effect | |
| text: CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG | |
| - example_title: CF transmembrane conductance regulator | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: DNA | |
| task: regulatory-variant-effect | |
| text: ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG | |
| - example_title: telomerase reverse transcriptase | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: DNA | |
| task: regulatory-variant-effect | |
| text: CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG | |
| - example_title: KRAS proto-oncogene | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: DNA | |
| task: regulatory-variant-effect | |
| text: GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG | |
| - example_title: prion protein (Kanno blood group) | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC | |
| - example_title: interleukin 10 | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC | |
| - example_title: Zaire ebolavirus | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT | |
| - example_title: SARS coronavirus | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT | |
| - example_title: insulin | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG | |
| - example_title: cyclin dependent kinase inhibitor 2A | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA | |
| - example_title: human papillomavirus type 16 E6 | |
| pipeline_tag: regulatory-variant-effect | |
| sequence_type: cDNA | |
| task: regulatory-variant-effect | |
| text: ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA | |
| # DeepSEA | |
| Deep convolutional neural network that predicts noncoding chromatin features (DNase I hypersensitivity, transcription-factor binding, and histone marks) from DNA sequence, used to score the regulatory impact of noncoding variants. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [Predicting effects of noncoding variants with deep learning-based sequence model](https://doi.org/10.1038/nmeth.3547) by Jian Zhou and Olga G. Troyanskaya. | |
| The OFFICIAL repository of DeepSEA is at [jisraeli/DeepSEA](http://deepsea.princeton.edu). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing DeepSEA did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| DeepSEA is a convolutional neural network (CNN) trained to predict 919 chromatin features—DNase I hypersensitivity peaks, transcription-factor binding peaks, and histone-mark peaks—across multiple human cell types from a fixed-length 1000 bp DNA sequence. The model applies three convolutional blocks (convolution, ReLU, max pooling, and dropout) followed by a single fully-connected layer and a multi-label sigmoid output. The sequence-prediction model averages forward and reverse-complement probabilities. The trained model is then used to score the regulatory impact of noncoding single-nucleotide variants by computing the difference between reference- and alternate-allele predictions. Please refer to the [Training Details](#training-details) section for more information on the training process. | |
| ### Model Specification | |
| | Num Conv Layers | Num FC Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | | |
| | --------------- | ------------- | ----------- | ------------------ | --------- | -------- | -------------- | | |
| | 3 | 1 | 925 | 52.84 | 1.10 | 0.55 | 1000 | | |
| ### Links | |
| - **Code**: [multimolecule.deepsea](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/deepsea) | |
| - **Data**: ENCODE and Roadmap Epigenomics chromatin-feature peak compendium covering 690 transcription-factor binding profiles, 125 DNase I hypersensitivity profiles, and 104 histone-mark profiles (919 chromatin features in total) | |
| - **Paper**: [Predicting effects of noncoding variants with deep learning-based sequence model](https://doi.org/10.1038/nmeth.3547) | |
| - **Developed by**: Jian Zhou, Olga G. Troyanskaya | |
| - **Model type**: Three-layer 1D CNN over 1000 bp DNA for multi-task chromatin-feature prediction | |
| - **Original Repository**: [DeepSEA](http://deepsea.princeton.edu) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### Chromatin Feature Prediction | |
| You can use this model directly to predict the chromatin features of a DNA sequence: | |
| ```python | |
| >>> import torch | |
| >>> from multimolecule import DnaTokenizer, DeepSeaForSequencePrediction | |
| >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/deepsea") | |
| >>> model = DeepSeaForSequencePrediction.from_pretrained("multimolecule/deepsea") | |
| >>> input = tokenizer("ACGT" * 250, return_tensors="pt") | |
| >>> output = model(**input) | |
| >>> output.logits.shape | |
| torch.Size([1, 919]) | |
| ``` | |
| ### Interface | |
| - **Input length**: fixed 1000 bp DNA window | |
| - **Output**: 919 chromatin-feature logits (multi-label binary), covering DNase I hypersensitivity, transcription-factor binding, and histone-mark peaks across multiple cell types | |
| ## Training Details | |
| DeepSEA was trained to predict the chromatin features of DNA sequences across a panel of human cell types and then used to score the regulatory impact of noncoding variants. | |
| ### Training Data | |
| DeepSEA was trained on chromatin profiling data from [ENCODE](https://www.encodeproject.org) and the [Roadmap Epigenomics](https://www.roadmapepigenomics.org) project, comprising 690 transcription-factor ChIP-seq profiles, 125 DNase I hypersensitivity profiles, and 104 histone-mark ChIP-seq profiles for a total of 919 chromatin features. Each 1000 bp genomic interval centered on a 200 bp bin is labeled with a binary vector indicating which of the 919 chromatin features have a peak overlapping the central bin. | |
| ### Training Procedure | |
| #### Pre-training | |
| The model was trained to minimize a multi-label binary cross-entropy loss, comparing its predicted per-feature probabilities against the observed chromatin-feature labels. | |
| - Optimizer: Stochastic gradient descent with momentum | |
| - Loss: Multi-label binary cross-entropy | |
| - Regularization: Dropout (0.2 after the first two convolutions, 0.5 after the third convolution) and L2 weight decay | |
| ## Citation | |
| ```bibtex | |
| @article{zhou2015deepsea, | |
| author = {Zhou, Jian and Troyanskaya, Olga G.}, | |
| title = {Predicting effects of noncoding variants with deep learning-based sequence model}, | |
| journal = {Nature Methods}, | |
| volume = 12, | |
| number = 10, | |
| pages = {931--934}, | |
| year = 2015, | |
| publisher = {Nature Publishing Group}, | |
| doi = {10.1038/nmeth.3547} | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [DeepSEA paper](https://doi.org/10.1038/nmeth.3547) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` |