CpGPT Human Dependencies
DNA sequence embeddings and metadata required to run CpGPT on human Illumina methylation arrays.
Contents
dna_embeddings/
homo_sapiens/
nucleotide-transformer-v2-500m-multi-species/
2001bp_dna_embeddings.mmap # ~5 GB โ pre-computed DNA sequence embeddings
ensembl_metadata.db # ~930 MB โ Ensembl genome annotations
illumina_metadata.db # ~30 MB โ Illumina array probe metadata
Download
# Install huggingface_hub
pip install huggingface_hub
# Download all human dependencies
huggingface-cli download lucascamillomd/cpgpt-human-dependencies --local-dir dependencies/human
What are these files?
- DNA embeddings: Pre-computed 1024-dimensional embeddings from the Nucleotide Transformer v2 (500M) model for 2001bp windows centered on each CpG site in the human genome. These provide the sequence context for CpGPT.
- Ensembl metadata: Genome annotations including gene locations, CpG island boundaries, and regulatory elements.
- Illumina metadata: Probe-level metadata for Illumina methylation arrays (27k, 450k, EPIC, EPICv2, MSA).
Related Repositories
- Model weights: lucascamillomd/cpgpt-models
- Mammalian dependencies: lucascamillomd/cpgpt-mammalian-dependencies
- Code & tutorials: CpGPT GitHub
Citation
@article{camillo2024cpgpt,
title={CpGPT: A Foundation Model for DNA Methylation},
author={de Lima Camillo, Lucas Paulo et al.},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.10.24.619766}
}
License
MIT License โ see the GitHub repository for details.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support