Anonymoususer2223
/

ProtCompass_Embeddings

Model card Files Files and versions

xet

Community

Anonymous Researcher commited on 15 days ago

Commit

a4845c4

1 Parent(s): b68e2a2

update readme

Browse files

Files changed (1) hide show

README.md +176 -105

README.md CHANGED Viewed

@@ -1,105 +1,176 @@
----
-license: mit
----
-# ProtCompass Embeddings
-Pre-computed protein embeddings from 70+ encoders across 13 downstream tasks.
-## Dataset Structure
-```
-embeddings/
-├── secondary_structure/     # CB513 dataset (29 GB)
-├── mutation_effect/         # ProteinGym DMS assays (4.5 GB)
-├── contact_prediction/      # ProteinNet (2.9 GB)
-├── stability/              # TAPE stability (1.6 GB)
-├── ppi_site/               # PPI site prediction (1.4 GB)
-├── fluorescence/           # GFP fluorescence (841 MB)
-├── metal_binding/          # Metal binding sites (570 MB)
-├── go_bp/                  # GO Biological Process (214 MB)
-├── go_mf/                  # GO Molecular Function (68 MB)
-├── remote_homology/        # SCOPe fold classification (20 MB)
-├── ec_classification/      # Enzyme classification (18 MB)
-├── membrane_soluble/       # Membrane/soluble (17 MB)
-└── subcellular_localization/ # Subcellular location (17 MB)
-```
-## File Format
-Each encoder directory contains:
-- `train_embeddings.npy`: Training set embeddings (N × D)
-- `test_embeddings.npy`: Test set embeddings (M × D)
-- `train_labels.npy`: Training labels
-- `test_labels.npy`: Test labels
-- `train_ids.txt`: Protein IDs for training set
-- `test_ids.txt`: Protein IDs for test set
-- `meta.json`: Metadata (encoder name, dimensions, dataset info)
-## Usage
-```python
-import numpy as np
-from huggingface_hub import hf_hub_download
-# Download specific encoder embeddings
-train_emb = np.load(hf_hub_download(
-    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
-    filename="embeddings/mutation_effect/esm2/train_embeddings.npy",
-    repo_type="dataset"
-))
-test_emb = np.load(hf_hub_download(
-    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
-    filename="embeddings/mutation_effect/esm2/test_embeddings.npy",
-    repo_type="dataset"
-))
-# Use for downstream tasks
-from sklearn.linear_model import Ridge
-model = Ridge()
-model.fit(train_emb, train_labels)
-score = model.score(test_emb, test_labels)
-```
-## Encoders Included
-### Sequence Encoders (8)
-ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProST-T5, ProteinBERT-BFD, Ankh
-### Structure Encoders (50+)
-GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMP, dMaSIF
-### Multimodal Encoders (5)
-SaProt, ESM-IF, FoldVision
-### Baselines
-Random, Length, Torsion, One-hot, BLOSUM
-## Dataset Statistics
-- **Total size**: 41 GB
-- **Total encoders**: 70+
-- **Total tasks**: 13
-- **Total proteins**: ~500K across all tasks
-## Citation
-If you use these embeddings, please cite:
-```bibtex
-@article{protcompass2026,
-  title={ProtCompass: Systematic Evaluation of Protein Structure Encoders},
-  author={Your Name et al.},
-  journal={NeurIPS},
-  year={2026}
-}
-```
-## License
-MIT License
-## Contact
-For questions or issues, please open an issue on the repository.

+---
+license: mit
+---
+# ProtCompass Embeddings
+Pre-computed protein embeddings from 70+ encoders across 15 downstream tasks, plus probing results and evaluation outputs.
+## Dataset Structure
+```
+embeddings/                          # Compressed embeddings (~150GB)
+├── contact_prediction/              # Per-encoder compressed (300GB → ~60GB)
+│   ├── esm2.tar.gz
+│   ├── gearnet.tar.gz
+│   └── ... (36 encoders)
+├── secondary_structure/             # Per-encoder compressed (129GB → ~30GB)
+├── ppi_site/                        # Per-encoder compressed (80GB → ~20GB)
+├── metal_binding/                   # Per-encoder compressed (41GB → ~10GB)
+├── mutation_effect.tar.gz           # Per-task compressed (27GB → ~7GB)
+├── go_bp.tar.gz                     # Per-task compressed (7.9GB → ~2GB)
+├── stability.tar.gz                 # Per-task compressed (4.1GB → ~1GB)
+├── solubility.tar.gz                # Per-task compressed (3.6GB → ~900MB)
+├── go_mf.tar.gz                     # Per-task compressed (3.1GB → ~800MB)
+├── fluorescence.tar.gz              # Per-task compressed (3.0GB → ~800MB)
+├── ec_classification.tar.gz         # Per-task compressed (1.9GB → ~500MB)
+├── subcellular_localization.tar.gz  # Per-task compressed (1.4GB → ~400MB)
+├── membrane_soluble.tar.gz          # Per-task compressed (1.4GB → ~400MB)
+├── remote_homology.tar.gz           # Per-task compressed (805MB → ~200MB)
+└── ppi_affinity.tar.gz              # Per-task compressed (169MB → ~50MB)
+probing_IF/                          # Probing results (2.8GB)
+├── probing_embeddings/              # Invariant family embeddings (12 encoders)
+└── probing_results_architecture_full/  # Full probing results (195 files)
+results/                             # Evaluation results (6.9MB)
+└── {encoder}/{task}/                # Per-encoder, per-task results
+outputs/                             # Analysis outputs (12MB)
+├── alignment_analysis/              # Alignment analysis figures
+├── paper_figures_v12/               # Final paper figures
+└── uncertainty_appendix/            # Uncertainty analysis
+```
+## Decompression Instructions
+All embeddings are compressed with gzip. Decompress before use:
+### Large Tasks (per-encoder compression)
+For `contact_prediction`, `secondary_structure`, `ppi_site`, `metal_binding`:
+```bash
+# Decompress all encoders in a task
+cd embeddings/contact_prediction/
+for f in *.tar.gz; do tar -xzf "$f"; done
+# Or decompress specific encoder
+tar -xzf esm2.tar.gz
+```
+### Medium/Small Tasks (per-task compression)
+For all other tasks:
+```bash
+# Decompress entire task
+cd embeddings/
+tar -xzf mutation_effect.tar.gz
+tar -xzf secondary_structure.tar.gz
+# etc.
+# Or decompress all tasks at once
+for f in *.tar.gz; do tar -xzf "$f"; done
+```
+## File Format
+After decompression, each encoder directory contains:
+- `train_embeddings.npy`: Training set embeddings (N × D)
+- `test_embeddings.npy`: Test set embeddings (M × D)
+- `train_labels.npy`: Training labels
+- `test_labels.npy`: Test labels
+- `train_ids.txt`: Protein IDs for training set
+- `test_ids.txt`: Protein IDs for test set
+- `meta.json`: Metadata (encoder name, dimensions, dataset info)
+## Usage
+```python
+import numpy as np
+from huggingface_hub import hf_hub_download
+import tarfile
+# Download and decompress embeddings
+tar_path = hf_hub_download(
+    repo_id="Anonymoususer2223/ProtCompass_Embeddings",
+    filename="embeddings/mutation_effect.tar.gz",
+    repo_type="dataset"
+)
+# Extract
+with tarfile.open(tar_path, 'r:gz') as tar:
+    tar.extractall(path="./embeddings/")
+# Load embeddings
+train_emb = np.load("embeddings/mutation_effect/esm2/train_embeddings.npy")
+test_emb = np.load("embeddings/mutation_effect/esm2/test_embeddings.npy")
+train_labels = np.load("embeddings/mutation_effect/esm2/train_labels.npy")
+test_labels = np.load("embeddings/mutation_effect/esm2/test_labels.npy")
+# Use for downstream tasks
+from sklearn.linear_model import Ridge
+model = Ridge()
+model.fit(train_emb, train_labels)
+score = model.score(test_emb, test_labels)
+print(f"Test R²: {score:.3f}")
+```
+## Encoders Included
+### Sequence Encoders (8)
+ESM2 (650M, 150M, 35M), ESM1b, ESM3, ProtTrans, ProstT5, ProteinBERT-BFD, Ankh
+### Structure Encoders (50+)
+GearNet, GCPNet, EGNN, GVP, IPA, TFN, SchNet, DimeNet, MACE, CDConv, ProteinMPNN, PottsMPNN, dMaSIF, and more
+### Multimodal Encoders (5)
+SaProt, ESM-IF, FoldVision
+### Baselines (5)
+Random, Length, Torsion, One-hot, BLOSUM
+## Dataset Statistics
+- **Compressed size**: ~150GB
+- **Uncompressed size**: ~600GB
+- **Total encoders**: 70+
+- **Total tasks**: 15
+- **Total proteins**: ~500K across all tasks
+- **Compression ratio**: ~4x (gzip)
+## Compression Details
+- **Large tasks** (>30GB): Per-encoder compression for flexibility
+  - Users can download only specific encoders
+  - Enables parallel decompression
+- **Medium/Small tasks** (<30GB): Per-task compression
+  - Single archive per task
+  - Faster download for complete task data
+## Citation
+If you use these embeddings, please cite:
+```bibtex
+@article{protcompass2026,
+  title={ProtCompass: Interpretable Benchmarking and Task-Aware Evaluation of Protein Encoders},
+  author={Your Name et al.},
+  journal={NeurIPS},
+  year={2026}
+}
+```
+## Related Resources
+- **Code Repository**: [GitHub](https://github.com/yourusername/protcompass)
+- **Raw Datasets**: [ProtEnv on HuggingFace](https://huggingface.co/datasets/Anonymoususer2223/ProtEnv)
+- **Paper**: [arXiv](https://arxiv.org/abs/xxxx.xxxxx)
+## License
+MIT License
+## Contact
+For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/protcompass).