Anonymous Researcher commited on
Commit
a4845c4
Β·
1 Parent(s): b68e2a2

update readme

Browse files
Files changed (1) hide show
  1. README.md +176 -105
README.md CHANGED
@@ -1,105 +1,176 @@
1
- ---
2
- license: mit
3
- ---
4
-
5
- # ProtCompass Embeddings
6
-
7
- Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
8
-
9
- ## Dataset Structure
10
-
11
- ```
12
- embeddings/
13
- β”œβ”€β”€ secondary_structure/ # CB513 dataset (29 GB)
14
- β”œβ”€β”€ mutation_effect/ # ProteinGym DMS assays (4.5 GB)
15
- β”œβ”€β”€ contact_prediction/ # ProteinNet (2.9 GB)
16
- β”œβ”€β”€ stability/ # TAPE stability (1.6 GB)
17
- β”œβ”€β”€ ppi_site/ # PPI site prediction (1.4 GB)
18
- β”œβ”€β”€ fluorescence/ # GFP fluorescence (841 MB)
19
- β”œβ”€β”€ metal_binding/ # Metal binding sites (570 MB)
20
- β”œβ”€β”€ go_bp/ # GO Biological Process (214 MB)
21
- β”œβ”€β”€ go_mf/ # GO Molecular Function (68 MB)
22
- β”œβ”€β”€ remote_homology/ # SCOPe fold classification (20 MB)
23
- β”œβ”€β”€ ec_classification/ # Enzyme classification (18 MB)
24
- β”œβ”€β”€ membrane_soluble/ # Membrane/soluble (17 MB)
25
- └── subcellular_localization/ # Subcellular location (17 MB)
26
- ```
27
-
28
- ## File Format
29
-
30
- Each encoder directory contains:
31
- - `train_embeddings.npy`: Training set embeddings (N Γ— D)
32
- - `test_embeddings.npy`: Test set embeddings (M Γ— D)
33
- - `train_labels.npy`: Training labels
34
- - `test_labels.npy`: Test labels
35
- - `train_ids.txt`: Protein IDs for training set
36
- - `test_ids.txt`: Protein IDs for test set
37
- - `meta.json`: Metadata (encoder name, dimensions, dataset info)
38
-
39
- ## Usage
40
-
41
- ```python
42
- import numpy as np
43
- from huggingface_hub import hf_hub_download
44
-
45
- # Download specific encoder embeddings
46
- train_emb = np.load(hf_hub_download(
47
- repo_id="Anonymoususer2223/ProtCompass_Embeddings",
48
- filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
49
- repo_type="dataset"
50
- ))
51
-
52
- test_emb = np.load(hf_hub_download(
53
- repo_id="Anonymoususer2223/ProtCompass_Embeddings",
54
- filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
55
- repo_type="dataset"
56
- ))
57
-
58
- # Use for downstream tasks
59
- from sklearn.linear_model import Ridge
60
- model = Ridge()
61
- model.fit(train_emb, train_labels)
62
- score = model.score(test_emb, test_labels)
63
- ```
64
-
65
- ## Encoders Included
66
-
67
- ### Sequence Encoders (8)
68
- ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
69
-
70
- ### Structure Encoders (50+)
71
- GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
72
-
73
- ### Multimodal Encoders (5)
74
- SaProt, ESM-IF, FoldVision
75
-
76
- ### Baselines
77
- Random, Length, Torsion, One-hot, BLOSUM
78
-
79
- ## Dataset Statistics
80
-
81
- - **Total size**: 41 GB
82
- - **Total encoders**: 70+
83
- - **Total tasks**: 13
84
- - **Total proteins**: ~500K across all tasks
85
-
86
- ## Citation
87
-
88
- If you use these embeddings, please cite:
89
-
90
- ```bibtex
91
- @article{protcompass2026,
92
- title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
93
- author={Your Name et al.},
94
- journal={NeurIPS},
95
- year={2026}
96
- }
97
- ```
98
-
99
- ## License
100
-
101
- MIT License
102
-
103
- ## Contact
104
-
105
- For questions or issues, please open an issue on the repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # ProtCompass Embeddings
6
+
7
+ Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.
8
+
9
+ ## Dataset Structure
10
+
11
+ ```
12
+ embeddings/ # Compressed embeddings (~150GB)
13
+ β”œβ”€β”€ contact_prediction/ # Per-encoder compressed (300GB β†’ ~60GB)
14
+ β”‚ β”œβ”€β”€ esm2.tar.gz
15
+ β”‚ β”œβ”€β”€ gearnet.tar.gz
16
+ β”‚ └── ... (36 encoders)
17
+ β”œβ”€β”€ secondary_structure/ # Per-encoder compressed (129GB β†’ ~30GB)
18
+ β”œβ”€β”€ ppi_site/ # Per-encoder compressed (80GB β†’ ~20GB)
19
+ β”œβ”€β”€ metal_binding/ # Per-encoder compressed (41GB β†’ ~10GB)
20
+ β”œβ”€β”€ mutation_effect.tar.gz # Per-task compressed (27GB β†’ ~7GB)
21
+ β”œβ”€β”€ go_bp.tar.gz # Per-task compressed (7.9GB β†’ ~2GB)
22
+ β”œβ”€β”€ stability.tar.gz # Per-task compressed (4.1GB β†’ ~1GB)
23
+ β”œβ”€β”€ solubility.tar.gz # Per-task compressed (3.6GB β†’ ~900MB)
24
+ β”œβ”€β”€ go_mf.tar.gz # Per-task compressed (3.1GB β†’ ~800MB)
25
+ β”œβ”€β”€ fluorescence.tar.gz # Per-task compressed (3.0GB β†’ ~800MB)
26
+ β”œβ”€β”€ ec_classification.tar.gz # Per-task compressed (1.9GB β†’ ~500MB)
27
+ β”œβ”€β”€ subcellular_localization.tar.gz # Per-task compressed (1.4GB β†’ ~400MB)
28
+ β”œβ”€β”€ membrane_soluble.tar.gz # Per-task compressed (1.4GB β†’ ~400MB)
29
+ β”œβ”€β”€ remote_homology.tar.gz # Per-task compressed (805MB β†’ ~200MB)
30
+ └── ppi_affinity.tar.gz # Per-task compressed (169MB β†’ ~50MB)
31
+
32
+ probing_IF/ # Probing results (2.8GB)
33
+ β”œβ”€β”€ probing_embeddings/ # Invariant family embeddings (12 encoders)
34
+ └── probing_results_architecture_full/ # Full probing results (195 files)
35
+
36
+ results/ # Evaluation results (6.9MB)
37
+ └── {encoder}/{task}/ # Per-encoder, per-task results
38
+
39
+ outputs/ # Analysis outputs (12MB)
40
+ β”œβ”€β”€ alignment_analysis/ # Alignment analysis figures
41
+ β”œβ”€β”€ paper_figures_v12/ # Final paper figures
42
+ └── uncertainty_appendix/ # Uncertainty analysis
43
+ ```
44
+
45
+ ## Decompression Instructions
46
+
47
+ All embeddings are compressed with gzip. Decompress before use:
48
+
49
+ ### Large Tasks (per-encoder compression)
50
+ For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`:
51
+
52
+ ```bash
53
+ # Decompress all encoders in a task
54
+ cd embeddings/contact_prediction/
55
+ for f in *.tar.gz; do tar -xzf "$f"; done
56
+
57
+ # Or decompress specific encoder
58
+ tar -xzf esm2.tar.gz
59
+ ```
60
+
61
+ ### Medium/Small Tasks (per-task compression)
62
+ For all other tasks:
63
+
64
+ ```bash
65
+ # Decompress entire task
66
+ cd embeddings/
67
+ tar -xzf mutation_effect.tar.gz
68
+ tar -xzf secondary_structure.tar.gz
69
+ # etc.
70
+
71
+ # Or decompress all tasks at once
72
+ for f in *.tar.gz; do tar -xzf "$f"; done
73
+ ```
74
+
75
+ ## File Format
76
+
77
+ After decompression, each encoder directory contains:
78
+ - `train_embeddings.npy`: Training set embeddings (N Γ— D)
79
+ - `test_embeddings.npy`: Test set embeddings (M Γ— D)
80
+ - `train_labels.npy`: Training labels
81
+ - `test_labels.npy`: Test labels
82
+ - `train_ids.txt`: Protein IDs for training set
83
+ - `test_ids.txt`: Protein IDs for test set
84
+ - `meta.json`: Metadata (encoder name, dimensions, dataset info)
85
+
86
+ ## Usage
87
+
88
+ ```python
89
+ import numpy as np
90
+ from huggingface_hub import hf_hub_download
91
+ import tarfile
92
+
93
+ # Download and decompress embeddings
94
+ tar_path = hf_hub_download(
95
+ repo_id="Anonymoususer2223/ProtCompass_Embeddings",
96
+ filename="embeddings/mutation_effect.tar.gz",
97
+ repo_type="dataset"
98
+ )
99
+
100
+ # Extract
101
+ with tarfile.open(tar_path, 'r:gz') as tar:
102
+ tar.extractall(path="./embeddings/")
103
+
104
+ # Load embeddings
105
+ train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
106
+ test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
107
+ train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
108
+ test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")
109
+
110
+ # Use for downstream tasks
111
+ from sklearn.linear_model import Ridge
112
+ model = Ridge()
113
+ model.fit(train_emb, train_labels)
114
+ score = model.score(test_emb, test_labels)
115
+ print(f"Test RΒ²: {score:.3f}")
116
+ ```
117
+
118
+ ## Encoders Included
119
+
120
+ ### Sequence Encoders (8)
121
+ ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh
122
+
123
+ ### Structure Encoders (50+)
124
+ GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more
125
+
126
+ ### Multimodal Encoders (5)
127
+ SaProt, ESM-IF, FoldVision
128
+
129
+ ### Baselines (5)
130
+ Random, Length, Torsion, One-hot, BLOSUM
131
+
132
+ ## Dataset Statistics
133
+
134
+ - **Compressed size**: ~150GB
135
+ - **Uncompressed size**: ~600GB
136
+ - **Total encoders**: 70+
137
+ - **Total tasks**: 15
138
+ - **Total proteins**: ~500K across all tasks
139
+ - **Compression ratio**: ~4x (gzip)
140
+
141
+ ## Compression Details
142
+
143
+ - **Large tasks** (>30GB): Per-encoder compression for flexibility
144
+ - Users can download only specific encoders
145
+ - Enables parallel decompression
146
+
147
+ - **Medium/Small tasks** (<30GB): Per-task compression
148
+ - Single archive per task
149
+ - Faster download for complete task data
150
+
151
+ ## Citation
152
+
153
+ If you use these embeddings, please cite:
154
+
155
+ ```bibtex
156
+ @article{protcompass2026,
157
+ title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
158
+ author={Your Name et al.},
159
+ journal={NeurIPS},
160
+ year={2026}
161
+ }
162
+ ```
163
+
164
+ ## Related Resources
165
+
166
+ - **Code Repository**: [GitHub](https://github.com/yourusername/protcompass)
167
+ - **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv)
168
+ - **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx)
169
+
170
+ ## License
171
+
172
+ MIT License
173
+
174
+ ## Contact
175
+
176
+ For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass).