File size: 5,829 Bytes
a4845c4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | ---
license: mit
---
# ProtCompass Embeddings
Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.
## Dataset Structure
```
embeddings/ # Compressed embeddings (~150GB)
βββ contact_prediction/ # Per-encoder compressed (300GB β ~60GB)
β βββ esm2.tar.gz
β βββ gearnet.tar.gz
β βββ ... (36 encoders)
βββ secondary_structure/ # Per-encoder compressed (129GB β ~30GB)
βββ ppi_site/ # Per-encoder compressed (80GB β ~20GB)
βββ metal_binding/ # Per-encoder compressed (41GB β ~10GB)
βββ mutation_effect.tar.gz # Per-task compressed (27GB β ~7GB)
βββ go_bp.tar.gz # Per-task compressed (7.9GB β ~2GB)
βββ stability.tar.gz # Per-task compressed (4.1GB β ~1GB)
βββ solubility.tar.gz # Per-task compressed (3.6GB β ~900MB)
βββ go_mf.tar.gz # Per-task compressed (3.1GB β ~800MB)
βββ fluorescence.tar.gz # Per-task compressed (3.0GB β ~800MB)
βββ ec_classification.tar.gz # Per-task compressed (1.9GB β ~500MB)
βββ subcellular_localization.tar.gz # Per-task compressed (1.4GB β ~400MB)
βββ membrane_soluble.tar.gz # Per-task compressed (1.4GB β ~400MB)
βββ remote_homology.tar.gz # Per-task compressed (805MB β ~200MB)
βββ ppi_affinity.tar.gz # Per-task compressed (169MB β ~50MB)
probing_IF/ # Probing results (2.8GB)
βββ probing_embeddings/ # Invariant family embeddings (12 encoders)
βββ probing_results_architecture_full/ # Full probing results (195 files)
results/ # Evaluation results (6.9MB)
βββ {encoder}/{task}/ # Per-encoder, per-task results
outputs/ # Analysis outputs (12MB)
βββ alignment_analysis/ # Alignment analysis figures
βββ paper_figures_v12/ # Final paper figures
βββ uncertainty_appendix/ # Uncertainty analysis
```
## Decompression Instructions
All embeddings are compressed with gzip. Decompress before use:
### Large Tasks (per-encoder compression)
For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`:
```bash
# Decompress all encoders in a task
cd embeddings/contact_prediction/
for f in *.tar.gz; do tar -xzf "$f"; done
# Or decompress specific encoder
tar -xzf esm2.tar.gz
```
### Medium/Small Tasks (per-task compression)
For all other tasks:
```bash
# Decompress entire task
cd embeddings/
tar -xzf mutation_effect.tar.gz
tar -xzf secondary_structure.tar.gz
# etc.
# Or decompress all tasks at once
for f in *.tar.gz; do tar -xzf "$f"; done
```
## File Format
After decompression, each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N Γ D)
- `test_embeddings.npy`: Test set embeddings (M Γ D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)
## Usage
```python
import numpy as np
from huggingface_hub import hf_hub_download
import tarfile
# Download and decompress embeddings
tar_path = hf_hub_download(
repo_id="Anonymoususer2223/ProtCompass_Embeddings",
filename="embeddings/mutation_effect.tar.gz",
repo_type="dataset"
)
# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
tar.extractall(path="./embeddings/")
# Load embeddings
train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")
# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
print(f"Test RΒ²: {score:.3f}")
```
## Encoders Included
### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh
### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more
### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision
### Baselines (5)
Random, Length, Torsion, One-hot, BLOSUM
## Dataset Statistics
- **Compressed size**: ~150GB
- **Uncompressed size**: ~600GB
- **Total encoders**: 70+
- **Total tasks**: 15
- **Total proteins**: ~500K across all tasks
- **Compression ratio**: ~4x (gzip)
## Compression Details
- **Large tasks** (>30GB): Per-encoder compression for flexibility
- Users can download only specific encoders
- Enables parallel decompression
- **Medium/Small tasks** (<30GB): Per-task compression
- Single archive per task
- Faster download for complete task data
## Citation
If you use these embeddings, please cite:
```bibtex
@article{protcompass2026,
title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
author={Your Name et al.},
journal={NeurIPS},
year={2026}
}
```
## Related Resources
- **Code Repository**: [GitHub](https://github.com/yourusername/protcompass)
- **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv)
- **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx)
## License
MIT License
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass).
|