--- language: dna library_name: multimolecule license: agpl-3.0 pipeline: methylation pipeline_tag: other tags: - Biology - DNA widget: - example_title: tumor protein p53 pipeline_tag: methylation sequence_type: DNA task: methylation text: ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG - example_title: BRCA1 DNA repair associated pipeline_tag: methylation sequence_type: DNA task: methylation text: TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG - example_title: hemoglobin subunit beta pipeline_tag: methylation sequence_type: DNA task: methylation text: CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG - example_title: CF transmembrane conductance regulator pipeline_tag: methylation sequence_type: DNA task: methylation text: ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG - example_title: telomerase reverse transcriptase pipeline_tag: methylation sequence_type: DNA task: methylation text: CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG - example_title: KRAS proto-oncogene pipeline_tag: methylation sequence_type: DNA task: methylation text: GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG - example_title: prion protein (Kanno blood group) pipeline_tag: methylation sequence_type: cDNA task: methylation text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC - example_title: interleukin 10 pipeline_tag: methylation sequence_type: cDNA task: methylation text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC - example_title: Zaire ebolavirus pipeline_tag: methylation sequence_type: cDNA task: methylation text: AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT - example_title: SARS coronavirus pipeline_tag: methylation sequence_type: cDNA task: methylation text: ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT - example_title: insulin pipeline_tag: methylation sequence_type: cDNA task: methylation text: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG - example_title: cyclin dependent kinase inhibitor 2A pipeline_tag: methylation sequence_type: cDNA task: methylation text: ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA - example_title: human papillomavirus type 16 E6 pipeline_tag: methylation sequence_type: cDNA task: methylation text: ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA --- # DeepCpG-DNA DNA-only convolutional neural network from DeepCpG for predicting per-cell single-cell DNA methylation states from a CpG-centered sequence window. ## Disclaimer This is an UNOFFICIAL implementation of [DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning](https://doi.org/10.1186/s13059-017-1189-z) by Christof Angermueller, et al. The OFFICIAL repository of DeepCpG is at [cangermueller/deepcpg](https://github.com/cangermueller/deepcpg). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing DeepCpG-DNA did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details DeepCpG-DNA is the DNA submodule of the DeepCpG joint model. It is a 1D convolutional neural network that predicts the per-cell methylation state of a CpG site from a fixed-length 1001 bp DNA window centered on the site. The model consumes a one-hot encoded sequence and applies `valid`-padded convolutional blocks (Conv1D + ReLU + MaxPool) followed by a dense bottleneck and one binary classification head per single cell in the training dataset. Please refer to the [Training Details](#training-details) section for more information on the training process. The full DeepCpG model combines this DNA submodule with a recurrent CpG-context submodule and a joint head; this model card covers the DNA submodule only. ### Variants The DeepCpG-DNA module is trained per single-cell dataset, so each variant predicts a different number of output cells. | Dataset | Architecture | Cells | Hub repository | | ------------------------- | ------------ | ----- | ------------------------------------------------------------------------------------------------------- | | Smallwood 2014 serum mESC | CnnL2h128 | 18 | [`deepcpgdna-smallwood2014-serum`](https://huggingface.co/multimolecule/deepcpgdna-smallwood2014-serum) | | Smallwood 2014 2i mESC | CnnL3h128 | 12 | [`deepcpgdna-smallwood2014-2i`](https://huggingface.co/multimolecule/deepcpgdna-smallwood2014-2i) | | Hou 2016 HCC | CnnL2h128 | 25 | [`deepcpgdna-hou2016-hcc`](https://huggingface.co/multimolecule/deepcpgdna-hou2016-hcc) | | Hou 2016 HepG2 | CnnL3h128 | 6 | [`deepcpgdna-hou2016-hepg2`](https://huggingface.co/multimolecule/deepcpgdna-hou2016-hepg2) | | Hou 2016 mESC | CnnL2h128 | 6 | [`deepcpgdna-hou2016-mesc`](https://huggingface.co/multimolecule/deepcpgdna-hou2016-mesc) | ### Model Specification
Architecture Num Conv Layers Hidden Size Num Cells Num Parameters (M) FLOPs (M) MACs (M) Max Num Tokens
CnnL2h128 2 128 18 4.11 70.63 35.06 1001
CnnL3h128 3 12 4.43 165.02 82.18
### Links - **Code**: [multimolecule.deepcpgdna](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/deepcpgdna) - **Data**: scBS-seq (Smallwood 2014) and scRRBS-seq (Hou 2016) single-cell bisulfite sequencing datasets - **Paper**: [DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning](https://doi.org/10.1186/s13059-017-1189-z) - **Developed by**: Christof Angermueller, Heather J. Lee, Wolf Reik, Oliver Stegle - **Model type**: Two- or three-layer 1D CNN over a 1001 bp CpG-centered DNA window for per-cell binary methylation prediction - **Original Repository**: [cangermueller/deepcpg](https://github.com/cangermueller/deepcpg) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### Single-Cell Methylation Prediction You can use this model directly to predict the per-cell methylation state of a 1001 bp DNA window centered on a CpG site: ```python >>> from multimolecule import DnaTokenizer, DeepCpgDnaForSequencePrediction >>> model_id = "multimolecule/deepcpgdna-hou2016-mesc" >>> tokenizer = DnaTokenizer.from_pretrained(model_id) >>> model = DeepCpgDnaForSequencePrediction.from_pretrained(model_id) >>> input = tokenizer("ACGT" * 250 + "A", return_tensors="pt") >>> output = model(**input) >>> output.logits.shape torch.Size([1, 18]) ``` Each logit is a per-cell methylation score for one of the single cells in the chosen training dataset; apply a sigmoid to obtain methylation probabilities. ### Interface - **Input length**: fixed 1001 bp DNA window centered on a CpG site - **Padding**: not supported; pad or crop genomic windows so they match `sequence_length` exactly - **Alphabet**: DNA (`A`, `C`, `G`, `T`); `N` is encoded as an all-zero channel - **Output**: per-cell methylation logits; the number of cells is dataset-specific (see Variants table) ## Training Details DeepCpG-DNA was trained to predict the per-cell methylation state of CpG sites from their flanking DNA context. ### Training Data DeepCpG-DNA was trained on single-cell bisulfite sequencing datasets: - **Smallwood 2014**: scBS-seq profiles of mouse embryonic stem cells, with 18 serum and 12 2i mESCs (excluding two serum cells whose methylation pattern deviated strongly from the remainder). - **Hou 2016**: scRRBS-seq profiles of 25 human hepatocellular carcinoma (HCC) cells, 6 human heptoplastoma-derived (HepG2) cells, and 6 mESCs, restricted to CpG sites covered by at least four reads. Each training example is a 1001 bp DNA window centered on a CpG site, with a per-cell binary methylation label (methylated, unmethylated, or missing). Chromosomes were split into training, validation, and test sets to avoid sequence leakage. ### Training Procedure #### Pre-training The model was trained to minimize a per-cell binary cross-entropy loss, comparing its predicted per-cell methylation probabilities (sigmoid of the per-cell logits) against the observed single-cell bisulfite labels. Missing labels are masked out during training. - Optimizer: Adam - Loss: Per-cell binary cross-entropy - Regularization: Dropout and L2 weight decay ## Citation ```bibtex @article{angermueller2017deepcpg, author = {Angermueller, Christof and Lee, Heather J. and Reik, Wolf and Stegle, Oliver}, title = {{DeepCpG}: accurate prediction of single-cell {DNA} methylation states using deep learning}, journal = {Genome Biology}, volume = 18, number = 1, pages = {67}, year = 2017, publisher = {BioMed Central}, doi = {10.1186/s13059-017-1189-z} } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [DeepCpG paper](https://doi.org/10.1186/s13059-017-1189-z) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```