Anonymous Researcher
update readme
a4845c4
---
license: mit
---
# ProtCompass Embeddings
Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.
## Dataset Structure
```
embeddings/ # Compressed embeddings (~150GB)
β”œβ”€β”€ contact_prediction/ # Per-encoder compressed (300GB β†’ ~60GB)
β”‚ β”œβ”€β”€ esm2.tar.gz
β”‚ β”œβ”€β”€ gearnet.tar.gz
β”‚ └── ... (36 encoders)
β”œβ”€β”€ secondary_structure/ # Per-encoder compressed (129GB β†’ ~30GB)
β”œβ”€β”€ ppi_site/ # Per-encoder compressed (80GB β†’ ~20GB)
β”œβ”€β”€ metal_binding/ # Per-encoder compressed (41GB β†’ ~10GB)
β”œβ”€β”€ mutation_effect.tar.gz # Per-task compressed (27GB β†’ ~7GB)
β”œβ”€β”€ go_bp.tar.gz # Per-task compressed (7.9GB β†’ ~2GB)
β”œβ”€β”€ stability.tar.gz # Per-task compressed (4.1GB β†’ ~1GB)
β”œβ”€β”€ solubility.tar.gz # Per-task compressed (3.6GB β†’ ~900MB)
β”œβ”€β”€ go_mf.tar.gz # Per-task compressed (3.1GB β†’ ~800MB)
β”œβ”€β”€ fluorescence.tar.gz # Per-task compressed (3.0GB β†’ ~800MB)
β”œβ”€β”€ ec_classification.tar.gz # Per-task compressed (1.9GB β†’ ~500MB)
β”œβ”€β”€ subcellular_localization.tar.gz # Per-task compressed (1.4GB β†’ ~400MB)
β”œβ”€β”€ membrane_soluble.tar.gz # Per-task compressed (1.4GB β†’ ~400MB)
β”œβ”€β”€ remote_homology.tar.gz # Per-task compressed (805MB β†’ ~200MB)
└── ppi_affinity.tar.gz # Per-task compressed (169MB β†’ ~50MB)
probing_IF/ # Probing results (2.8GB)
β”œβ”€β”€ probing_embeddings/ # Invariant family embeddings (12 encoders)
└── probing_results_architecture_full/ # Full probing results (195 files)
results/ # Evaluation results (6.9MB)
└── {encoder}/{task}/ # Per-encoder, per-task results
outputs/ # Analysis outputs (12MB)
β”œβ”€β”€ alignment_analysis/ # Alignment analysis figures
β”œβ”€β”€ paper_figures_v12/ # Final paper figures
└── uncertainty_appendix/ # Uncertainty analysis
```
## Decompression Instructions
All embeddings are compressed with gzip. Decompress before use:
### Large Tasks (per-encoder compression)
For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`:
```bash
# Decompress all encoders in a task
cd embeddings/contact_prediction/
for f in *.tar.gz; do tar -xzf "$f"; done
# Or decompress specific encoder
tar -xzf esm2.tar.gz
```
### Medium/Small Tasks (per-task compression)
For all other tasks:
```bash
# Decompress entire task
cd embeddings/
tar -xzf mutation_effect.tar.gz
tar -xzf secondary_structure.tar.gz
# etc.
# Or decompress all tasks at once
for f in *.tar.gz; do tar -xzf "$f"; done
```
## File Format
After decompression, each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N Γ— D)
- `test_embeddings.npy`: Test set embeddings (M Γ— D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)
## Usage
```python
import numpy as np
from huggingface_hub import hf_hub_download
import tarfile
# Download and decompress embeddings
tar_path = hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect.tar.gz",
repo_type="dataset"
)
# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
tar.extractall(path="./embeddings/")
# Load embeddings
train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")
# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
print(f"Test RΒ²: {score:.3f}")
```
## Encoders Included
### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh
### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more
### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision
### Baselines (5)
Random, Length, Torsion, One-hot, BLOSUM
## Dataset Statistics
- **Compressed size**: ~150GB
- **Uncompressed size**: ~600GB
- **Total encoders**: 70+
- **Total tasks**: 15
- **Total proteins**: ~500K across all tasks
- **Compression ratio**: ~4x (gzip)
## Compression Details
- **Large tasks** (>30GB): Per-encoder compression for flexibility
- Users can download only specific encoders
- Enables parallel decompression
- **Medium/Small tasks** (<30GB): Per-task compression
- Single archive per task
- Faster download for complete task data
## Citation
If you use these embeddings, please cite:
```bibtex
@article{protcompass2026,
title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
author={Your Name et al.},
journal={NeurIPS},
year={2026}
}
```
## Related Resources
- **Code Repository**: [GitHub](https://github.com/yourusername/protcompass)
- **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv)
- **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx)
## License
MIT License
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass).