biolm
/

human-proteome-esmc-embeddings

+---
+license: cc-by-4.0
+task_categories:
+- feature-extraction
+language:
+- en
+tags:
+- protein
+- biology
+- embeddings
+- esmc
+- human-proteome
+- transformer
+- protein-language-model
+- evolutionary-scale
+- competition
+size_categories:
+- 100K<n<1M
+pretty_name: Human Proteome ESMC Embeddings
+---
+# Human Proteome ESMC Embeddings
+<div align="center">
+**Complete layer-wise protein embeddings for 236,252 human proteins using ESMC models**
+[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
+[![ESMC Model](https://img.shields.io/badge/Model-ESMC%20by%20EvolutionaryScale-green)](https://github.com/evolutionaryscale/esm)
+[![BioLM.ai](https://img.shields.io/badge/BioLM.ai-Dataset-orange)](https://biolm.ai)
+</div>
+## 📊 Dataset Summary
+This dataset provides **pre-computed protein sequence embeddings** for the complete human proteome (Homo sapiens GRCh38, Ensembl) using EvolutionaryScale's ESMC protein language models. These embeddings capture evolutionary and structural information useful for protein function prediction, similarity search, and transfer learning tasks - **ready to use without requiring expensive inference**.
+**Created by [BioLM.ai](https://biolm.ai)** to support computational biology research and ML competitions.
+**Key Features:**
+- 🧬 **236,252 human proteins** from Ensembl GRCh38 reference genome
+- 🤖 **Two model sizes:** ESMC 300M (30 layers, 960 dims) and ESMC 600M (36 layers, 1152 dims)
+- 📐 **Layer-wise embeddings:** Mean-pooled representations from all transformer layers
+- ✨ **High quality:** Filtered invalid sequences, verified data integrity
+- 🚀 **Ready to use:** No inference needed - directly load and use for downstream tasks
+- 📦 **Efficient format:** Sharded parquet files with snappy compression (~26 GB total)
+- ⚡ **Optimized loading:** Files sharded to ~3.5 GB each for fast streaming and parallel loading
+## 🎯 Use Cases
+- **Protein function prediction:** Train classifiers for GO terms, localization, interactions
+- **Similarity search:** Find proteins with similar structure/function
+- **Transfer learning:** Use as pre-computed features for any protein task
+- **Competition features:** Drop-in features for computational biology competitions
+- **Visualization:** Explore protein space with dimensionality reduction
+- **Benchmark datasets:** Evaluate protein representation methods
+## 🗂️ Dataset Structure
+### Files
+**ESMC 300M Embeddings** (3 shards, 3.43 GB each):
+- `esmc_300m_embeddings-train-0000-of-0003.parquet`
+- `esmc_300m_embeddings-train-0001-of-0003.parquet`
+- `esmc_300m_embeddings-train-0002-of-0003.parquet`
+**ESMC 600M Embeddings** (4 shards, 3.71 GB each):
+- `esmc_600m_embeddings-train-0000-of-0004.parquet`
+- `esmc_600m_embeddings-train-0001-of-0004.parquet`
+- `esmc_600m_embeddings-train-0002-of-0004.parquet`
+- `esmc_600m_embeddings-train-0003-of-0004.parquet`
+**Supporting Files**:
+- `sequences.parquet` (32 MB) - Source protein sequences & metadata
+- `skipped_sequences.txt` (2.7 MB) - Filtered sequences log
+| Dataset | Shards | Size per Shard | Total Size | Total Rows |
+|---------|--------|----------------|------------|------------|
+| ESMC 300M | 3 | ~3.43 GB | ~10.3 GB | 7,087,560 |
+| ESMC 600M | 4 | ~3.71 GB | ~14.8 GB | 8,505,072 |
+| Sequences | 1 | 32 MB | 32 MB | 236,252 |
+| **Total** | **8** | - | **~25.7 GB** | - |
+### Why Sharded?
+Files are split into ~3.5 GB shards for optimal performance:
+- ✅ **Faster downloads:** Parallel shard downloads
+- ✅ **Memory efficient:** Stream one shard at a time
+- ✅ **HuggingFace optimized:** Automatic shard handling with `datasets` library
+- ✅ **Resumable transfers:** Failed downloads can resume individual shards
+### Schema
+**Embeddings files** (long format: one row per sequence-layer):
+```python
+{
+    'sequence_id': str,           # e.g., "ENSP00000269305.4" (TP53)
+    'layer_idx': int,             # 0-29 (300M) or 0-35 (600M)
+    'mean_embedding': List[float], # 960-dim (300M) or 1152-dim (600M)
+    'sequence_length': int        # Amino acids count
+}
+```
+**Sequences file:**
+```python
+{
+    'sequence_id': str,      # Ensembl protein ID
+    'sequence': str,         # Amino acid sequence (20 standard AAs)
+    'sequence_length': int,  # Length in amino acids
+    'description': str       # Full FASTA header with gene metadata
+}
+```
+## 🚀 Quick Start
+### Option 1: HuggingFace Datasets Library (Recommended)
+The `datasets` library automatically handles sharded files:
+```python
+from datasets import load_dataset
+import numpy as np
+# Load 600M embeddings (all shards loaded automatically)
+ds = load_dataset('biolm/human-proteome-esmc-embeddings', data_files='esmc_600m_embeddings-train-*.parquet')
+# Access as pandas DataFrame
+df = ds['train'].to_pandas()
+# Filter to last layer only
+last_layer = df[df['layer_idx'] == 35]
+print(f"Loaded {len(last_layer):,} proteins × 1152 dims")
+```
+### Option 2: PyArrow (Memory Efficient)
+Load specific shards or filter on-the-fly:
+```python
+import pyarrow.parquet as pq
+import pandas as pd
+from glob import glob
+# Load only last layer from all 600M shards
+dfs = []
+for shard_file in glob('esmc_600m_embeddings-train-*.parquet'):
+    table = pq.read_table(
+        shard_file,
+        filters=[('layer_idx', '==', 35)]  # Last layer only
+    )
+    dfs.append(table.to_pandas())
+df = pd.concat(dfs, ignore_index=True)
+print(f"Loaded {len(df):,} protein embeddings")  # 236,252 proteins
+```
+### Option 3: Polars (Fastest)
+```python
+import polars as pl
+# Lazy load all 600M shards with glob pattern
+df = pl.scan_parquet('esmc_600m_embeddings-train-*.parquet')
+# Filter and collect efficiently
+last_layer = df.filter(pl.col('layer_idx') == 35).collect()
+print(f"Shape: {last_layer.shape}")  # (236252, 4)
+```
+### Load Specific Proteins
+```python
+import pandas as pd
+# Load all shards and filter to specific proteins
+df = pd.concat([
+    pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
+    for i in range(4)
+], ignore_index=True)
+# Get TP53 tumor suppressor embeddings (all 36 layers)
+tp53_data = df[df['sequence_id'] == 'ENSP00000269305.4'].sort_values('layer_idx')
+tp53_embeddings = np.array(tp53_data['mean_embedding'].tolist())
+print(f"TP53 shape: {tp53_embeddings.shape}")  # (36, 1152)
+```
+### Train a Classifier (Last Layer Only)
+```python
+from sklearn.ensemble import RandomForestClassifier
+import numpy as np
+import pandas as pd
+# Load only last layer from all shards
+dfs = []
+for i in range(4):  # 4 shards for 600M
+    df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
+    dfs.append(df[df['layer_idx'] == 35])
+embeddings_df = pd.concat(dfs, ignore_index=True)
+# Extract features
+X = np.array(embeddings_df['mean_embedding'].tolist())  # (236252, 1152)
+# y = your_labels  # e.g., GO terms, subcellular localization
+clf = RandomForestClassifier()
+clf.fit(X, y)
+```
+### Protein Similarity Search
+```python
+from sklearn.metrics.pairwise import cosine_similarity
+import pandas as pd
+import numpy as np
+# Load last layer from all shards
+dfs = []
+for i in range(4):
+    df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
+    dfs.append(df[df['layer_idx'] == 35])
+df = pd.concat(dfs, ignore_index=True)
+# Query: Find proteins similar to TP53
+query_emb = df[df['sequence_id'] == 'ENSP00000269305.4']['mean_embedding'].iloc[0]
+all_embs = np.array(df['mean_embedding'].tolist())
+similarities = cosine_similarity([query_emb], all_embs)[0]
+top_10_indices = similarities.argsort()[-11:-1][::-1]
+print("Top 10 proteins similar to TP53:")
+for idx in top_10_indices:
+    seq_id = df.iloc[idx]['sequence_id']
+    sim = similarities[idx]
+    print(f"  {seq_id}: {sim:.4f}")
+```
+### Join with Sequences
+```python
+import pandas as pd
+# Load embeddings (last layer only)
+embeddings = pd.concat([
+    pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
+    for i in range(4)
+], ignore_index=True)
+embeddings = embeddings[embeddings['layer_idx'] == 35]
+# Load sequences
+sequences = pd.read_parquet('sequences.parquet')
+# Merge
+merged = embeddings.merge(sequences, on='sequence_id', how='left')
+print(f"Merged shape: {merged.shape}")
+print(f"Columns: {merged.columns.tolist()}")
+```
+## 📈 Dataset Statistics
+### Coverage
+- **Source:** Homo sapiens GRCh38 peptide sequences from [Ensembl](https://www.ensembl.org/)
+- **Total in source:** 245,535 sequences
+- **Processed:** 236,252 sequences (96.2%)
+- **Filtered:** 9,283 sequences (3.8% - containing ambiguous/invalid amino acids)
+### Sequence Characteristics
+- **Length range:** 1 - 35,991 amino acids
+- **Mean length:** ~460 AA
+- **Median length:** ~282 AA
+- **Valid amino acids:** 20 standard (ACDEFGHIKLMNPQRSTVWY)
+### Model Comparison
+| Model | Params | Layers | Embed Dim | Shards | Total Size | Total Rows |
+|-------|--------|--------|-----------|--------|------------|------------|
+| ESMC 300M | 300M | 30 | 960 | 3 | 10.3 GB | 7,087,560 |
+| ESMC 600M | 600M | 36 | 1152 | 4 | 14.8 GB | 8,505,072 |
+## 🔬 Generation Details
+### Models
+- **ESMC 300M:** `EvolutionaryScale/esmc-300m-2024-12` (revision: `a19d363`)
+- **ESMC 600M:** `EvolutionaryScale/esmc-600m-2024-12` (revision: `d11cc14`)
+- **Library:** ESMC v3.1.3 from [EvolutionaryScale](https://github.com/evolutionaryscale/esm)
+### Processing Pipeline
+1. ✅ Tokenize sequences with BOS/EOS tokens
+2. ✅ Forward pass through all layers (`model.eval()`, `torch.no_grad()`)
+3. ✅ Remove BOS/EOS tokens from outputs
+4. ✅ Mean pool across sequence length dimension
+5. ✅ Extract to CPU as float32
+### Configuration
+- **Batch size:** Adaptive (8 for ≤4096 AA, 1 for longer sequences)
+- **Max length:** 50,000 amino acids
+- **Random seed:** 42 (reproducible)
+- **Hardware:** NVIDIA RTX A6000 (48GB VRAM)
+- **Quality checks:** ✅ No missing values, ✅ Correct layer counts, ✅ No duplicates
+- **Sharding:** Split to ~3.5 GB per shard for optimal HuggingFace compatibility
+## ❓ FAQ
+**Q: Which layer should I use?**
+A: The **last layer** (29 for 300M, 35 for 600M) typically works best for downstream tasks. Some applications benefit from intermediate layers or combining multiple layers.
+**Q: How do I load all shards at once?**
+A: Use glob patterns with pandas/polars:
+```python
+import pandas as pd
+df = pd.concat([
+    pd.read_parquet(f) for f in glob('esmc_600m_embeddings-train-*.parquet')
+], ignore_index=True)
+```
+Or use HuggingFace `datasets` library which handles shards automatically.
+**Q: Can I load just one shard?**
+A: Yes! Each shard is independent and contains a subset of proteins. Useful for memory-constrained environments or parallel processing.
+**Q: 300M vs 600M - which to use?**
+A: **600M** is larger and may capture more nuanced patterns. **300M** is faster to work with. We recommend trying both!
+**Q: Are embeddings normalized?**
+A: No, these are raw mean-pooled embeddings. Apply L2 normalization if needed for cosine similarity.
+**Q: What sequences were filtered out?**
+A: 9,283 sequences (3.8%) containing non-standard amino acids:
+- X (ambiguous): 9,049 sequences
+- \* (stop codon): 152 sequences
+- U (selenocysteine): 89 sequences
+**Q: Can I use this commercially?**
+A: **Yes!** Under CC BY 4.0 license - free to use with attribution to BioLM.ai.
+**Q: How are proteins distributed across shards?**
+A: Proteins are split sequentially (by row order) across shards. To get all layers for a protein, you may need to check all shards (though typically a protein's layers are in the same shard).
+**Q: Which shard contains a specific protein?**
+A: Load the `sequences.parquet` file to see all sequence IDs, then search each shard. Or use the HuggingFace `datasets` library which handles this automatically.
+## 📚 Citation
+If you use this dataset in your work, please cite:
+```bibtex
+@dataset{biolm_human_proteome_esmc_2025,
+  title={Human Proteome ESMC Embeddings},
+  author={BioLM.ai},
+  year={2025},
+  month={October},
+  publisher={HuggingFace},
+  url={https://huggingface.co/datasets/biolm/human-proteome-esmc-embeddings}
+}
+```
+And the ESMC model:
+```bibtex
+@article{esmc2024,
+  title={Evolutionary Scale Modeling: Protein Language Models},
+  author={EvolutionaryScale},
+  year={2024},
+  url={https://github.com/evolutionaryscale/esm}
+}
+```
+## 📄 License
+**CC BY 4.0** - Free to use with attribution to BioLM.ai
+- **Source data** (Ensembl): Freely available
+- **ESMC models**: Apache 2.0
+- **This dataset**: CC BY 4.0
+## 🙏 Acknowledgments
+- **EvolutionaryScale** for developing and open-sourcing ESMC models
+- **Ensembl** for curating and maintaining the human proteome reference
+- **HuggingFace** for hosting and serving this dataset
+## 📞 Contact & Support
+- **Organization:** [BioLM.ai](https://biolm.ai)
+- **Python SDK:** [py-biolm](https://github.com/BioLM/py-biolm) - Run inference on ESMC and many other biosequence models via API
+- **HuggingFace Discussions:** Use the Community tab for questions and feedback
+- **Issues:** Report problems via HuggingFace Discussions
+---
+**Version:** 1.0.0 | **Last updated:** October 2025 | **Dataset size:** ~26 GB (8 sharded files)