ProtCompass Embeddings

Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.

Dataset Structure

embeddings/                          # Compressed embeddings (~150GB)
β”œβ”€β”€ contact_prediction/              # Per-encoder compressed (300GB β†’ ~60GB)
β”‚   β”œβ”€β”€ esm2.tar.gz
β”‚   β”œβ”€β”€ gearnet.tar.gz
β”‚   └── ... (36 encoders)
β”œβ”€β”€ secondary_structure/             # Per-encoder compressed (129GB β†’ ~30GB)
β”œβ”€β”€ ppi_site/                        # Per-encoder compressed (80GB β†’ ~20GB)
β”œβ”€β”€ metal_binding/                   # Per-encoder compressed (41GB β†’ ~10GB)
β”œβ”€β”€ mutation_effect.tar.gz           # Per-task compressed (27GB β†’ ~7GB)
β”œβ”€β”€ go_bp.tar.gz                     # Per-task compressed (7.9GB β†’ ~2GB)
β”œβ”€β”€ stability.tar.gz                 # Per-task compressed (4.1GB β†’ ~1GB)
β”œβ”€β”€ solubility.tar.gz                # Per-task compressed (3.6GB β†’ ~900MB)
β”œβ”€β”€ go_mf.tar.gz                     # Per-task compressed (3.1GB β†’ ~800MB)
β”œβ”€β”€ fluorescence.tar.gz              # Per-task compressed (3.0GB β†’ ~800MB)
β”œβ”€β”€ ec_classification.tar.gz         # Per-task compressed (1.9GB β†’ ~500MB)
β”œβ”€β”€ subcellular_localization.tar.gz  # Per-task compressed (1.4GB β†’ ~400MB)
β”œβ”€β”€ membrane_soluble.tar.gz          # Per-task compressed (1.4GB β†’ ~400MB)
β”œβ”€β”€ remote_homology.tar.gz           # Per-task compressed (805MB β†’ ~200MB)
└── ppi_affinity.tar.gz              # Per-task compressed (169MB β†’ ~50MB)

probing_IF/                          # Probing results (2.8GB)
β”œβ”€β”€ probing_embeddings/              # Invariant family embeddings (12 encoders)
└── probing_results_architecture_full/  # Full probing results (195 files)

results/                             # Evaluation results (6.9MB)
└── {encoder}/{task}/                # Per-encoder, per-task results

outputs/                             # Analysis outputs (12MB)
β”œβ”€β”€ alignment_analysis/              # Alignment analysis figures
β”œβ”€β”€ paper_figures_v12/               # Final paper figures
└── uncertainty_appendix/            # Uncertainty analysis

Decompression Instructions

All embeddings are compressed with gzip. Decompress before use:

Large Tasks (per-encoder compression)

For contact_prediction, secondary_structure, ppi_site, metal_binding:

# Decompress all encoders in a task
cd embeddings/contact_prediction/
for f in *.tar.gz; do tar -xzf "$f"; done

# Or decompress specific encoder
tar -xzf esm2.tar.gz

Medium/Small Tasks (per-task compression)

For all other tasks:

# Decompress entire task
cd embeddings/
tar -xzf mutation_effect.tar.gz
tar -xzf secondary_structure.tar.gz
# etc.

# Or decompress all tasks at once
for f in *.tar.gz; do tar -xzf "$f"; done

File Format

After decompression, each encoder directory contains:

  • train_embeddings.npy: Training set embeddings (N Γ— D)
  • test_embeddings.npy: Test set embeddings (M Γ— D)
  • train_labels.npy: Training labels
  • test_labels.npy: Test labels
  • train_ids.txt: Protein IDs for training set
  • test_ids.txt: Protein IDs for test set
  • meta.json: Metadata (encoder name, dimensions, dataset info)

Usage

import numpy as np
from huggingface_hub import hf_hub_download
import tarfile

# Download and decompress embeddings
tar_path = hf_hub_download(
    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
    filename="embeddings/mutation_effect.tar.gz",
    repo_type="dataset"
)

# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
    tar.extractall(path="./embeddings/")

# Load embeddings
train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")

# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
print(f"Test RΒ²: {score:.3f}")

Encoders Included

Sequence Encoders (8)

ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh

Structure Encoders (50+)

GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more

Multimodal Encoders (5)

SaProt, ESM-IF, FoldVision

Baselines (5)

Random, Length, Torsion, One-hot, BLOSUM

Dataset Statistics

  • Compressed size: ~150GB
  • Uncompressed size: ~600GB
  • Total encoders: 70+
  • Total tasks: 15
  • Total proteins: ~500K across all tasks
  • Compression ratio: ~4x (gzip)

Compression Details

  • Large tasks (>30GB): Per-encoder compression for flexibility

    • Users can download only specific encoders
    • Enables parallel decompression
  • Medium/Small tasks (<30GB): Per-task compression

    • Single archive per task
    • Faster download for complete task data

Citation

If you use these embeddings, please cite:

@article{protcompass2026,
  title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
  author={Your Name et al.},
  journal={NeurIPS},
  year={2026}
}

Related Resources

License

MIT License

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Anonymoususer2223/ProtCompass_Embeddings