lucascamillomd
/

cpgpt-human-dependencies

DNA-methylation

foundation-model

Model card Files Files and versions

CpGPT Human Dependencies

DNA sequence embeddings and metadata required to run CpGPT on human Illumina methylation arrays.

Contents

dna_embeddings/
  homo_sapiens/
    nucleotide-transformer-v2-500m-multi-species/
      2001bp_dna_embeddings.mmap    # ~5 GB — pre-computed DNA sequence embeddings
ensembl_metadata.db                  # ~930 MB — Ensembl genome annotations
illumina_metadata.db                 # ~30 MB — Illumina array probe metadata

Download

# Install huggingface_hub
pip install huggingface_hub

# Download all human dependencies
huggingface-cli download lucascamillomd/cpgpt-human-dependencies --local-dir dependencies/human

What are these files?

DNA embeddings: Pre-computed 1024-dimensional embeddings from the Nucleotide Transformer v2 (500M) model for 2001bp windows centered on each CpG site in the human genome. These provide the sequence context for CpGPT.
Ensembl metadata: Genome annotations including gene locations, CpG island boundaries, and regulatory elements.
Illumina metadata: Probe-level metadata for Illumina methylation arrays (27k, 450k, EPIC, EPICv2, MSA).

Related Repositories

Model weights: lucascamillomd/cpgpt-models
Mammalian dependencies: lucascamillomd/cpgpt-mammalian-dependencies
Code & tutorials: CpGPT GitHub

Citation

@article{camillo2024cpgpt,
  title={CpGPT: A Foundation Model for DNA Methylation},
  author={de Lima Camillo, Lucas Paulo et al.},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.24.619766}
}

License

MIT License — see the GitHub repository for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support