ProtCompass Embeddings
Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.
Dataset Structure
embeddings/ # Compressed embeddings (~150GB)
βββ contact_prediction/ # Per-encoder compressed (300GB β ~60GB)
β βββ esm2.tar.gz
β βββ gearnet.tar.gz
β βββ ... (36 encoders)
βββ secondary_structure/ # Per-encoder compressed (129GB β ~30GB)
βββ ppi_site/ # Per-encoder compressed (80GB β ~20GB)
βββ metal_binding/ # Per-encoder compressed (41GB β ~10GB)
βββ mutation_effect.tar.gz # Per-task compressed (27GB β ~7GB)
βββ go_bp.tar.gz # Per-task compressed (7.9GB β ~2GB)
βββ stability.tar.gz # Per-task compressed (4.1GB β ~1GB)
βββ solubility.tar.gz # Per-task compressed (3.6GB β ~900MB)
βββ go_mf.tar.gz # Per-task compressed (3.1GB β ~800MB)
βββ fluorescence.tar.gz # Per-task compressed (3.0GB β ~800MB)
βββ ec_classification.tar.gz # Per-task compressed (1.9GB β ~500MB)
βββ subcellular_localization.tar.gz # Per-task compressed (1.4GB β ~400MB)
βββ membrane_soluble.tar.gz # Per-task compressed (1.4GB β ~400MB)
βββ remote_homology.tar.gz # Per-task compressed (805MB β ~200MB)
βββ ppi_affinity.tar.gz # Per-task compressed (169MB β ~50MB)
probing_IF/ # Probing results (2.8GB)
βββ probing_embeddings/ # Invariant family embeddings (12 encoders)
βββ probing_results_architecture_full/ # Full probing results (195 files)
results/ # Evaluation results (6.9MB)
βββ {encoder}/{task}/ # Per-encoder, per-task results
outputs/ # Analysis outputs (12MB)
βββ alignment_analysis/ # Alignment analysis figures
βββ paper_figures_v12/ # Final paper figures
βββ uncertainty_appendix/ # Uncertainty analysis
Decompression Instructions
All embeddings are compressed with gzip. Decompress before use:
Large Tasks (per-encoder compression)
For contact_prediction, secondary_structure, ppi_site, metal_binding:
# Decompress all encoders in a task
cd embeddings/contact_prediction/
for f in *.tar.gz; do tar -xzf "$f"; done
# Or decompress specific encoder
tar -xzf esm2.tar.gz
Medium/Small Tasks (per-task compression)
For all other tasks:
# Decompress entire task
cd embeddings/
tar -xzf mutation_effect.tar.gz
tar -xzf secondary_structure.tar.gz
# etc.
# Or decompress all tasks at once
for f in *.tar.gz; do tar -xzf "$f"; done
File Format
After decompression, each encoder directory contains:
train_embeddings.npy: Training set embeddings (N Γ D)test_embeddings.npy: Test set embeddings (M Γ D)train_labels.npy: Training labelstest_labels.npy: Test labelstrain_ids.txt: Protein IDs for training settest_ids.txt: Protein IDs for test setmeta.json: Metadata (encoder name, dimensions, dataset info)
Usage
import numpy as np
from huggingface_hub import hf_hub_download
import tarfile
# Download and decompress embeddings
tar_path = hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect.tar.gz",
repo_type="dataset"
)
# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
tar.extractall(path="./embeddings/")
# Load embeddings
train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")
# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
print(f"Test RΒ²: {score:.3f}")
Encoders Included
Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh
Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more
Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision
Baselines (5)
Random, Length, Torsion, One-hot, BLOSUM
Dataset Statistics
- Compressed size: ~150GB
- Uncompressed size: ~600GB
- Total encoders: 70+
- Total tasks: 15
- Total proteins: ~500K across all tasks
- Compression ratio: ~4x (gzip)
Compression Details
Large tasks (>30GB): Per-encoder compression for flexibility
- Users can download only specific encoders
- Enables parallel decompression
Medium/Small tasks (<30GB): Per-task compression
- Single archive per task
- Faster download for complete task data
Citation
If you use these embeddings, please cite:
@article{protcompass2026,
title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
author={Your Name et al.},
journal={NeurIPS},
year={2026}
}
Related Resources
- Code Repository: GitHub
- Raw Datasets: ProtEnv on HuggingFace
- Paper: arXiv
License
MIT License
Contact
For questions or issues, please open an issue on the GitHub repository.