File size: 5,829 Bytes
a4845c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: mit
---

# ProtCompass Embeddings

Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.

## Dataset Structure

```
embeddings/                          # Compressed embeddings (~150GB)
β”œβ”€β”€ contact_prediction/              # Per-encoder compressed (300GB β†’ ~60GB)
β”‚   β”œβ”€β”€ esm2.tar.gz
β”‚   β”œβ”€β”€ gearnet.tar.gz
β”‚   └── ... (36 encoders)
β”œβ”€β”€ secondary_structure/             # Per-encoder compressed (129GB β†’ ~30GB)
β”œβ”€β”€ ppi_site/                        # Per-encoder compressed (80GB β†’ ~20GB)
β”œβ”€β”€ metal_binding/                   # Per-encoder compressed (41GB β†’ ~10GB)
β”œβ”€β”€ mutation_effect.tar.gz           # Per-task compressed (27GB β†’ ~7GB)
β”œβ”€β”€ go_bp.tar.gz                     # Per-task compressed (7.9GB β†’ ~2GB)
β”œβ”€β”€ stability.tar.gz                 # Per-task compressed (4.1GB β†’ ~1GB)
β”œβ”€β”€ solubility.tar.gz                # Per-task compressed (3.6GB β†’ ~900MB)
β”œβ”€β”€ go_mf.tar.gz                     # Per-task compressed (3.1GB β†’ ~800MB)
β”œβ”€β”€ fluorescence.tar.gz              # Per-task compressed (3.0GB β†’ ~800MB)
β”œβ”€β”€ ec_classification.tar.gz         # Per-task compressed (1.9GB β†’ ~500MB)
β”œβ”€β”€ subcellular_localization.tar.gz  # Per-task compressed (1.4GB β†’ ~400MB)
β”œβ”€β”€ membrane_soluble.tar.gz          # Per-task compressed (1.4GB β†’ ~400MB)
β”œβ”€β”€ remote_homology.tar.gz           # Per-task compressed (805MB β†’ ~200MB)
└── ppi_affinity.tar.gz              # Per-task compressed (169MB β†’ ~50MB)

probing_IF/                          # Probing results (2.8GB)
β”œβ”€β”€ probing_embeddings/              # Invariant family embeddings (12 encoders)
└── probing_results_architecture_full/  # Full probing results (195 files)

results/                             # Evaluation results (6.9MB)
└── {encoder}/{task}/                # Per-encoder, per-task results

outputs/                             # Analysis outputs (12MB)
β”œβ”€β”€ alignment_analysis/              # Alignment analysis figures
β”œβ”€β”€ paper_figures_v12/               # Final paper figures
└── uncertainty_appendix/            # Uncertainty analysis
```

## Decompression Instructions

All embeddings are compressed with gzip. Decompress before use:

### Large Tasks (per-encoder compression)
For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`:

```bash
# Decompress all encoders in a task
cd embeddings/contact_prediction/
for f in *.tar.gz; do tar -xzf "$f"; done

# Or decompress specific encoder
tar -xzf esm2.tar.gz
```

### Medium/Small Tasks (per-task compression)
For all other tasks:

```bash
# Decompress entire task
cd embeddings/
tar -xzf mutation_effect.tar.gz
tar -xzf secondary_structure.tar.gz
# etc.

# Or decompress all tasks at once
for f in *.tar.gz; do tar -xzf "$f"; done
```

## File Format

After decompression, each encoder directory contains:
- `train_embeddings.npy`: Training set embeddings (N Γ— D)
- `test_embeddings.npy`: Test set embeddings (M Γ— D)
- `train_labels.npy`: Training labels
- `test_labels.npy`: Test labels
- `train_ids.txt`: Protein IDs for training set
- `test_ids.txt`: Protein IDs for test set
- `meta.json`: Metadata (encoder name, dimensions, dataset info)

## Usage

```python
import numpy as np
from huggingface_hub import hf_hub_download
import tarfile

# Download and decompress embeddings
tar_path = hf_hub_download(
    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
    filename="embeddings/mutation_effect.tar.gz",
    repo_type="dataset"
)

# Extract
with tarfile.open(tar_path, 'r:gz') as tar:
    tar.extractall(path="./embeddings/")

# Load embeddings
train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")

# Use for downstream tasks
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(train_emb, train_labels)
score = model.score(test_emb, test_labels)
print(f"Test RΒ²: {score:.3f}")
```

## Encoders Included

### Sequence Encoders (8)
ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh

### Structure Encoders (50+)
GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more

### Multimodal Encoders (5)
SaProt, ESM-IF, FoldVision

### Baselines (5)
Random, Length, Torsion, One-hot, BLOSUM

## Dataset Statistics

- **Compressed size**: ~150GB
- **Uncompressed size**: ~600GB
- **Total encoders**: 70+
- **Total tasks**: 15
- **Total proteins**: ~500K across all tasks
- **Compression ratio**: ~4x (gzip)

## Compression Details

- **Large tasks** (>30GB): Per-encoder compression for flexibility
  - Users can download only specific encoders
  - Enables parallel decompression
  
- **Medium/Small tasks** (<30GB): Per-task compression
  - Single archive per task
  - Faster download for complete task data

## Citation

If you use these embeddings, please cite:

```bibtex
@article{protcompass2026,
  title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
  author={Your Name et al.},
  journal={NeurIPS},
  year={2026}
}
```

## Related Resources

- **Code Repository**: [GitHub](https://github.com/yourusername/protcompass)
- **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv)
- **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx)

## License

MIT License

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass).