| --- |
| license: mit |
| --- |
| |
| # ProtCompass Embeddings |
|
|
| Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs. |
|
|
| ## Dataset Structure |
|
|
| ``` |
| embeddings/ # Compressed embeddings (~150GB) |
| βββ contact_prediction/ # Per-encoder compressed (300GB β ~60GB) |
| β βββ esm2.tar.gz |
| β βββ gearnet.tar.gz |
| β βββ ... (36 encoders) |
| βββ secondary_structure/ # Per-encoder compressed (129GB β ~30GB) |
| βββ ppi_site/ # Per-encoder compressed (80GB β ~20GB) |
| βββ metal_binding/ # Per-encoder compressed (41GB β ~10GB) |
| βββ mutation_effect.tar.gz # Per-task compressed (27GB β ~7GB) |
| βββ go_bp.tar.gz # Per-task compressed (7.9GB β ~2GB) |
| βββ stability.tar.gz # Per-task compressed (4.1GB β ~1GB) |
| βββ solubility.tar.gz # Per-task compressed (3.6GB β ~900MB) |
| βββ go_mf.tar.gz # Per-task compressed (3.1GB β ~800MB) |
| βββ fluorescence.tar.gz # Per-task compressed (3.0GB β ~800MB) |
| βββ ec_classification.tar.gz # Per-task compressed (1.9GB β ~500MB) |
| βββ subcellular_localization.tar.gz # Per-task compressed (1.4GB β ~400MB) |
| βββ membrane_soluble.tar.gz # Per-task compressed (1.4GB β ~400MB) |
| βββ remote_homology.tar.gz # Per-task compressed (805MB β ~200MB) |
| βββ ppi_affinity.tar.gz # Per-task compressed (169MB β ~50MB) |
| |
| probing_IF/ # Probing results (2.8GB) |
| βββ probing_embeddings/ # Invariant family embeddings (12 encoders) |
| βββ probing_results_architecture_full/ # Full probing results (195 files) |
| |
| results/ # Evaluation results (6.9MB) |
| βββ {encoder}/{task}/ # Per-encoder, per-task results |
| |
| outputs/ # Analysis outputs (12MB) |
| βββ alignment_analysis/ # Alignment analysis figures |
| βββ paper_figures_v12/ # Final paper figures |
| βββ uncertainty_appendix/ # Uncertainty analysis |
| ``` |
|
|
| ## Decompression Instructions |
|
|
| All embeddings are compressed with gzip. Decompress before use: |
|
|
| ### Large Tasks (per-encoder compression) |
| For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`: |
|
|
| ```bash |
| # Decompress all encoders in a task |
| cd embeddings/contact_prediction/ |
| for f in *.tar.gz; do tar -xzf "$f"; done |
| |
| # Or decompress specific encoder |
| tar -xzf esm2.tar.gz |
| ``` |
|
|
| ### Medium/Small Tasks (per-task compression) |
| For all other tasks: |
|
|
| ```bash |
| # Decompress entire task |
| cd embeddings/ |
| tar -xzf mutation_effect.tar.gz |
| tar -xzf secondary_structure.tar.gz |
| # etc. |
| |
| # Or decompress all tasks at once |
| for f in *.tar.gz; do tar -xzf "$f"; done |
| ``` |
|
|
| ## File Format |
|
|
| After decompression, each encoder directory contains: |
| - `train_embeddings.npy`: Training set embeddings (N Γ D) |
| - `test_embeddings.npy`: Test set embeddings (M Γ D) |
| - `train_labels.npy`: Training labels |
| - `test_labels.npy`: Test labels |
| - `train_ids.txt`: Protein IDs for training set |
| - `test_ids.txt`: Protein IDs for test set |
| - `meta.json`: Metadata (encoder name, dimensions, dataset info) |
|
|
| ## Usage |
|
|
| ```python |
| import numpy as np |
| from huggingface_hub import hf_hub_download |
| import tarfile |
| |
| # Download and decompress embeddings |
| tar_path = hf_hub_download( |
| repo_id="Anonymoususer2223/ProtCompass_Embeddings", |
| filename="embeddings/mutation_effect.tar.gz", |
| repo_type="dataset" |
| ) |
| |
| # Extract |
| with tarfile.open(tar_path, 'r:gz') as tar: |
| tar.extractall(path="./embeddings/") |
| |
| # Load embeddings |
| train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy") |
| test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy") |
| train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy") |
| test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy") |
| |
| # Use for downstream tasks |
| from sklearn.linear_model import Ridge |
| model = Ridge() |
| model.fit(train_emb, train_labels) |
| score = model.score(test_emb, test_labels) |
| print(f"Test RΒ²: {score:.3f}") |
| ``` |
|
|
| ## Encoders Included |
|
|
| ### Sequence Encoders (8) |
| ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh |
|
|
| ### Structure Encoders (50+) |
| GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more |
|
|
| ### Multimodal Encoders (5) |
| SaProt, ESM-IF, FoldVision |
|
|
| ### Baselines (5) |
| Random, Length, Torsion, One-hot, BLOSUM |
|
|
| ## Dataset Statistics |
|
|
| - **Compressed size**: ~150GB |
| - **Uncompressed size**: ~600GB |
| - **Total encoders**: 70+ |
| - **Total tasks**: 15 |
| - **Total proteins**: ~500K across all tasks |
| - **Compression ratio**: ~4x (gzip) |
|
|
| ## Compression Details |
|
|
| - **Large tasks** (>30GB): Per-encoder compression for flexibility |
| - Users can download only specific encoders |
| - Enables parallel decompression |
| |
| - **Medium/Small tasks** (<30GB): Per-task compression |
| - Single archive per task |
| - Faster download for complete task data |
|
|
| ## Citation |
|
|
| If you use these embeddings, please cite: |
|
|
| ```bibtex |
| @article{protcompass2026, |
| title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders}, |
| author={Your Name et al.}, |
| journal={NeurIPS}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Related Resources |
|
|
| - **Code Repository**: [GitHub](https://github.com/yourusername/protcompass) |
| - **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv) |
| - **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx) |
|
|
| ## License |
|
|
| MIT License |
|
|
| ## Contact |
|
|
| For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass). |
|
|