CpGPT Mammalian Dependencies

Pre-computed DNA sequence embeddings for 300+ species, required to run CpGPT on the Horvath Mammalian Methylation Array and cross-species analyses.

Contents

dna_embeddings/
  {species_name}/
    nucleotide-transformer-v2-500m-multi-species/
      2001bp_dna_embeddings.mmap    # Per-species DNA sequence embeddings

Each species has pre-computed 1024-dimensional embeddings from the Nucleotide Transformer v2 (500M) model for 2001bp windows centered on each CpG site covered by the mammalian methylation array.

Total size: ~37 GB across 309 species.

Download

# Install huggingface_hub
pip install huggingface_hub

# Download all mammalian dependencies (~37 GB)
huggingface-cli download lucascamillomd/cpgpt-mammalian-dependencies --local-dir dependencies/mammalian

# Or download a specific species
huggingface-cli download lucascamillomd/cpgpt-mammalian-dependencies --include "dna_embeddings/homo_sapiens/*" --local-dir dependencies/mammalian

Related Repositories

Citation

@article{camillo2024cpgpt,
  title={CpGPT: A Foundation Model for DNA Methylation},
  author={de Lima Camillo, Lucas Paulo et al.},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.24.619766}
}

License

MIT License — see the GitHub repository for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support