ProtCompass Embeddings

Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.

Dataset Structure

embeddings/                          # Compressed embeddings (~150GB)
├── contact_prediction/              # Per-encoder compressed (300GB → ~60GB)
│   ├── esm2.tar.gz
│   ├── gearnet.tar.gz
│   └── ... (36 encoders)
├── secondary_structure/             # Per-encoder compressed (129GB → ~30GB)
├── ppi_site/                        # Per-encoder compressed (80GB → ~20GB)
├── metal_binding/                   # Per-encoder compressed (41GB → ~10GB)
├── mutation_effect.tar.gz           # Per-task compressed (27GB → ~7GB)
├── go_bp.tar.gz                     # Per-task compressed (7.9GB → ~2GB)
├── stability.tar.gz                 # Per-task compressed (4.1GB → ~1GB)
├── solubility.tar.gz                # Per-task compressed (3.6GB → ~900MB)
├── go_mf.tar.gz                     # Per-task compressed (3.1GB → ~800MB)
├── fluorescence.tar.gz              # Per-task compressed (3.0GB → ~800MB)
├── ec_classification.tar.gz         # Per-task compressed (1.9GB → ~500MB)
├── subcellular_localization.tar.gz  # Per-task compressed (1.4GB → ~400MB)
├── membrane_soluble.tar.gz          # Per-task compressed (1.4GB → ~400MB)
├── remote_homology.tar.gz           # Per-task compressed (805MB → ~200MB)
└── ppi_affinity.tar.gz              # Per-task compressed (169MB → ~50MB)

probing_IF/                          # Probing results (2.8GB)
├── probing_embeddings/              # Invariant family embeddings (12 encoders)
└── probing_results_architecture_full/  # Full probing results (195 files)

results/                             # Evaluation results (6.9MB)
└── {encoder}/{task}/                # Per-encoder, per-task results

outputs/                             # Analysis outputs (12MB)
├── alignment_analysis/              # Alignment analysis figures
├── paper_figures_v12/               # Final paper figures
└── uncertainty_appendix/            # Uncertainty analysis

Decompression Instructions

All embeddings are compressed with gzip. Decompress before use:

Large Tasks (per-encoder compression)

For contact_prediction, secondary_structure, ppi_site, metal_binding:

# Decompress all encoders in a task
cd embeddings/contact_prediction/
for f in *.tar.gz; do tar -xzf "$f"; done

# Or decompress specific encoder
tar -xzf esm2.tar.gz

Medium/Small Tasks (per-task compression)

For all other tasks:

# Decompress entire task
cd embeddings/
tar -xzf mutation_effect.tar.gz
tar -xzf secondary_structure.tar.gz
# etc.

# Or decompress all tasks at once
for f in *.tar.gz; do tar -xzf "$f"; done

File Format

After decompression, each encoder directory contains:

train_embeddings.npy: Training set embeddings (N × D)
test_embeddings.npy: Test set embeddings (M × D)
train_labels.npy: Training labels
test_labels.npy: Test labels
train_ids.txt: Protein IDs for training set
test_ids.txt: Protein IDs for test set
meta.json: Metadata (encoder name, dimensions, dataset info)

Usage

import numpy as np
from huggingface_hub import hf_hub_download
import tarfile

# Download and decompress embeddings
tar_path = hf_hub_download(
    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
    filename="embeddings/mutation_effect.tar.gz",
    repo_type="dataset"
)

# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
    tar.extractall(path="./embeddings/")

# Load embeddings
train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")

# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
print(f"Test R²: {score:.3f}")

Encoders Included

Sequence Encoders (8)

ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh

Structure Encoders (50+)

GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more

Multimodal Encoders (5)

SaProt, ESM-IF, FoldVision

Baselines (5)

Random, Length, Torsion, One-hot, BLOSUM

Dataset Statistics

Compressed size: ~150GB
Uncompressed size: ~600GB
Total encoders: 70+
Total tasks: 15
Total proteins: ~500K across all tasks
Compression ratio: ~4x (gzip)

Compression Details

Large tasks (>30GB): Per-encoder compression for flexibility
- Users can download only specific encoders
- Enables parallel decompression
Medium/Small tasks (<30GB): Per-task compression
- Single archive per task
- Faster download for complete task data

Citation

If you use these embeddings, please cite:

@article{protcompass2026,
  title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
  author={Your Name et al.},
  journal={NeurIPS},
  year={2026}
}

Related Resources

Code Repository: GitHub
Raw Datasets: ProtEnv on HuggingFace
Paper: arXiv

License

MIT License

Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Anonymoususer2223/ProtCompass_Embeddings

ProtCompass

Collection

This is for NeuroIPS 2026 paper • 2 items • Updated 13 days ago • 1