CpGPT Human Dependencies

DNA sequence embeddings and metadata required to run CpGPT on human Illumina methylation arrays.

Contents

dna_embeddings/
  homo_sapiens/
    nucleotide-transformer-v2-500m-multi-species/
      2001bp_dna_embeddings.mmap    # ~5 GB โ€” pre-computed DNA sequence embeddings
ensembl_metadata.db                  # ~930 MB โ€” Ensembl genome annotations
illumina_metadata.db                 # ~30 MB โ€” Illumina array probe metadata

Download

# Install huggingface_hub
pip install huggingface_hub

# Download all human dependencies
huggingface-cli download lucascamillomd/cpgpt-human-dependencies --local-dir dependencies/human

What are these files?

  • DNA embeddings: Pre-computed 1024-dimensional embeddings from the Nucleotide Transformer v2 (500M) model for 2001bp windows centered on each CpG site in the human genome. These provide the sequence context for CpGPT.
  • Ensembl metadata: Genome annotations including gene locations, CpG island boundaries, and regulatory elements.
  • Illumina metadata: Probe-level metadata for Illumina methylation arrays (27k, 450k, EPIC, EPICv2, MSA).

Related Repositories

Citation

@article{camillo2024cpgpt,
  title={CpGPT: A Foundation Model for DNA Methylation},
  author={de Lima Camillo, Lucas Paulo et al.},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.24.619766}
}

License

MIT License โ€” see the GitHub repository for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support