--- license: mit --- # ProtCompass Embeddings Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs. ## Dataset Structure ``` embeddings/ # Compressed embeddings (~150GB) ├── contact_prediction/ # Per-encoder compressed (300GB → ~60GB) │ ├── esm2.tar.gz │ ├── gearnet.tar.gz │ └── ... (36 encoders) ├── secondary_structure/ # Per-encoder compressed (129GB → ~30GB) ├── ppi_site/ # Per-encoder compressed (80GB → ~20GB) ├── metal_binding/ # Per-encoder compressed (41GB → ~10GB) ├── mutation_effect.tar.gz # Per-task compressed (27GB → ~7GB) ├── go_bp.tar.gz # Per-task compressed (7.9GB → ~2GB) ├── stability.tar.gz # Per-task compressed (4.1GB → ~1GB) ├── solubility.tar.gz # Per-task compressed (3.6GB → ~900MB) ├── go_mf.tar.gz # Per-task compressed (3.1GB → ~800MB) ├── fluorescence.tar.gz # Per-task compressed (3.0GB → ~800MB) ├── ec_classification.tar.gz # Per-task compressed (1.9GB → ~500MB) ├── subcellular_localization.tar.gz # Per-task compressed (1.4GB → ~400MB) ├── membrane_soluble.tar.gz # Per-task compressed (1.4GB → ~400MB) ├── remote_homology.tar.gz # Per-task compressed (805MB → ~200MB) └── ppi_affinity.tar.gz # Per-task compressed (169MB → ~50MB) probing_IF/ # Probing results (2.8GB) ├── probing_embeddings/ # Invariant family embeddings (12 encoders) └── probing_results_architecture_full/ # Full probing results (195 files) results/ # Evaluation results (6.9MB) └── {encoder}/{task}/ # Per-encoder, per-task results outputs/ # Analysis outputs (12MB) ├── alignment_analysis/ # Alignment analysis figures ├── paper_figures_v12/ # Final paper figures └── uncertainty_appendix/ # Uncertainty analysis ``` ## Decompression Instructions All embeddings are compressed with gzip. Decompress before use: ### Large Tasks (per-encoder compression) For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`: ```bash # Decompress all encoders in a task cd embeddings/contact_prediction/ for f in *.tar.gz; do tar -xzf "$f"; done # Or decompress specific encoder tar -xzf esm2.tar.gz ``` ### Medium/Small Tasks (per-task compression) For all other tasks: ```bash # Decompress entire task cd embeddings/ tar -xzf mutation_effect.tar.gz tar -xzf secondary_structure.tar.gz # etc. # Or decompress all tasks at once for f in *.tar.gz; do tar -xzf "$f"; done ``` ## File Format After decompression, each encoder directory contains: - `train_embeddings.npy`: Training set embeddings (N × D) - `test_embeddings.npy`: Test set embeddings (M × D) - `train_labels.npy`: Training labels - `test_labels.npy`: Test labels - `train_ids.txt`: Protein IDs for training set - `test_ids.txt`: Protein IDs for test set - `meta.json`: Metadata (encoder name, dimensions, dataset info) ## Usage ```python import numpy as np from huggingface_hub import hf_hub_download import tarfile # Download and decompress embeddings tar_path = hf_hub_download( repo_id="Anonymoususer2223/ProtCompass_Embeddings", filename="embeddings/mutation_effect.tar.gz", repo_type="dataset" ) # Extract with tarfile.open(tar_path, 'r:gz') as tar: tar.extractall(path="./embeddings/") # Load embeddings train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy") test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy") train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy") test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy") # Use for downstream tasks from sklearn.linear_model import Ridge model = Ridge() model.fit(train_emb, train_labels) score = model.score(test_emb, test_labels) print(f"Test R²: {score:.3f}") ``` ## Encoders Included ### Sequence Encoders (8) ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh ### Structure Encoders (50+) GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more ### Multimodal Encoders (5) SaProt, ESM-IF, FoldVision ### Baselines (5) Random, Length, Torsion, One-hot, BLOSUM ## Dataset Statistics - **Compressed size**: ~150GB - **Uncompressed size**: ~600GB - **Total encoders**: 70+ - **Total tasks**: 15 - **Total proteins**: ~500K across all tasks - **Compression ratio**: ~4x (gzip) ## Compression Details - **Large tasks** (>30GB): Per-encoder compression for flexibility - Users can download only specific encoders - Enables parallel decompression - **Medium/Small tasks** (<30GB): Per-task compression - Single archive per task - Faster download for complete task data ## Citation If you use these embeddings, please cite: ```bibtex @article{protcompass2026, title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders}, author={Your Name et al.}, journal={NeurIPS}, year={2026} } ``` ## Related Resources - **Code Repository**: [GitHub](https://github.com/yourusername/protcompass) - **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv) - **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx) ## License MIT License ## Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass).