atqamar commited on
Commit
94e4510
Β·
verified Β·
1 Parent(s): 9c58d78

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +390 -0
README.md ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ task_categories:
4
+ - feature-extraction
5
+ language:
6
+ - en
7
+ tags:
8
+ - protein
9
+ - biology
10
+ - embeddings
11
+ - esmc
12
+ - human-proteome
13
+ - transformer
14
+ - protein-language-model
15
+ - evolutionary-scale
16
+ - competition
17
+ size_categories:
18
+ - 100K<n<1M
19
+ pretty_name: Human Proteome ESMC Embeddings
20
+ ---
21
+
22
+ # Human Proteome ESMC Embeddings
23
+
24
+ <div align="center">
25
+
26
+ **Complete layer-wise protein embeddings for 236,252 human proteins using ESMC models**
27
+
28
+ [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
29
+ [![ESMC Model](https://img.shields.io/badge/Model-ESMC%20by%20EvolutionaryScale-green)](https://github.com/evolutionaryscale/esm)
30
+ [![BioLM.ai](https://img.shields.io/badge/BioLM.ai-Dataset-orange)](https://biolm.ai)
31
+
32
+ </div>
33
+
34
+ ## πŸ“Š Dataset Summary
35
+
36
+ This dataset provides **pre-computed protein sequence embeddings** for the complete human proteome (Homo sapiens GRCh38, Ensembl) using EvolutionaryScale's ESMC protein language models. These embeddings capture evolutionary and structural information useful for protein function prediction, similarity search, and transfer learning tasks - **ready to use without requiring expensive inference**.
37
+
38
+ **Created by [BioLM.ai](https://biolm.ai)** to support computational biology research and ML competitions.
39
+
40
+ **Key Features:**
41
+ - 🧬 **236,252 human proteins** from Ensembl GRCh38 reference genome
42
+ - πŸ€– **Two model sizes:** ESMC 300M (30 layers, 960 dims) and ESMC 600M (36 layers, 1152 dims)
43
+ - πŸ“ **Layer-wise embeddings:** Mean-pooled representations from all transformer layers
44
+ - ✨ **High quality:** Filtered invalid sequences, verified data integrity
45
+ - πŸš€ **Ready to use:** No inference needed - directly load and use for downstream tasks
46
+ - πŸ“¦ **Efficient format:** Sharded parquet files with snappy compression (~26 GB total)
47
+ - ⚑ **Optimized loading:** Files sharded to ~3.5 GB each for fast streaming and parallel loading
48
+
49
+ ## 🎯 Use Cases
50
+
51
+ - **Protein function prediction:** Train classifiers for GO terms, localization, interactions
52
+ - **Similarity search:** Find proteins with similar structure/function
53
+ - **Transfer learning:** Use as pre-computed features for any protein task
54
+ - **Competition features:** Drop-in features for computational biology competitions
55
+ - **Visualization:** Explore protein space with dimensionality reduction
56
+ - **Benchmark datasets:** Evaluate protein representation methods
57
+
58
+ ## πŸ—‚οΈ Dataset Structure
59
+
60
+ ### Files
61
+
62
+ **ESMC 300M Embeddings** (3 shards, 3.43 GB each):
63
+ - `esmc_300m_embeddings-train-0000-of-0003.parquet`
64
+ - `esmc_300m_embeddings-train-0001-of-0003.parquet`
65
+ - `esmc_300m_embeddings-train-0002-of-0003.parquet`
66
+
67
+ **ESMC 600M Embeddings** (4 shards, 3.71 GB each):
68
+ - `esmc_600m_embeddings-train-0000-of-0004.parquet`
69
+ - `esmc_600m_embeddings-train-0001-of-0004.parquet`
70
+ - `esmc_600m_embeddings-train-0002-of-0004.parquet`
71
+ - `esmc_600m_embeddings-train-0003-of-0004.parquet`
72
+
73
+ **Supporting Files**:
74
+ - `sequences.parquet` (32 MB) - Source protein sequences & metadata
75
+ - `skipped_sequences.txt` (2.7 MB) - Filtered sequences log
76
+
77
+ | Dataset | Shards | Size per Shard | Total Size | Total Rows |
78
+ |---------|--------|----------------|------------|------------|
79
+ | ESMC 300M | 3 | ~3.43 GB | ~10.3 GB | 7,087,560 |
80
+ | ESMC 600M | 4 | ~3.71 GB | ~14.8 GB | 8,505,072 |
81
+ | Sequences | 1 | 32 MB | 32 MB | 236,252 |
82
+ | **Total** | **8** | - | **~25.7 GB** | - |
83
+
84
+ ### Why Sharded?
85
+
86
+ Files are split into ~3.5 GB shards for optimal performance:
87
+ - βœ… **Faster downloads:** Parallel shard downloads
88
+ - βœ… **Memory efficient:** Stream one shard at a time
89
+ - βœ… **HuggingFace optimized:** Automatic shard handling with `datasets` library
90
+ - βœ… **Resumable transfers:** Failed downloads can resume individual shards
91
+
92
+ ### Schema
93
+
94
+ **Embeddings files** (long format: one row per sequence-layer):
95
+ ```python
96
+ {
97
+ 'sequence_id': str, # e.g., "ENSP00000269305.4" (TP53)
98
+ 'layer_idx': int, # 0-29 (300M) or 0-35 (600M)
99
+ 'mean_embedding': List[float], # 960-dim (300M) or 1152-dim (600M)
100
+ 'sequence_length': int # Amino acids count
101
+ }
102
+ ```
103
+
104
+ **Sequences file:**
105
+ ```python
106
+ {
107
+ 'sequence_id': str, # Ensembl protein ID
108
+ 'sequence': str, # Amino acid sequence (20 standard AAs)
109
+ 'sequence_length': int, # Length in amino acids
110
+ 'description': str # Full FASTA header with gene metadata
111
+ }
112
+ ```
113
+
114
+ ## πŸš€ Quick Start
115
+
116
+ ### Option 1: HuggingFace Datasets Library (Recommended)
117
+
118
+ The `datasets` library automatically handles sharded files:
119
+
120
+ ```python
121
+ from datasets import load_dataset
122
+ import numpy as np
123
+
124
+ # Load 600M embeddings (all shards loaded automatically)
125
+ ds = load_dataset('biolm/human-proteome-esmc-embeddings', data_files='esmc_600m_embeddings-train-*.parquet')
126
+
127
+ # Access as pandas DataFrame
128
+ df = ds['train'].to_pandas()
129
+
130
+ # Filter to last layer only
131
+ last_layer = df[df['layer_idx'] == 35]
132
+ print(f"Loaded {len(last_layer):,} proteins Γ— 1152 dims")
133
+ ```
134
+
135
+ ### Option 2: PyArrow (Memory Efficient)
136
+
137
+ Load specific shards or filter on-the-fly:
138
+
139
+ ```python
140
+ import pyarrow.parquet as pq
141
+ import pandas as pd
142
+ from glob import glob
143
+
144
+ # Load only last layer from all 600M shards
145
+ dfs = []
146
+ for shard_file in glob('esmc_600m_embeddings-train-*.parquet'):
147
+ table = pq.read_table(
148
+ shard_file,
149
+ filters=[('layer_idx', '==', 35)] # Last layer only
150
+ )
151
+ dfs.append(table.to_pandas())
152
+
153
+ df = pd.concat(dfs, ignore_index=True)
154
+ print(f"Loaded {len(df):,} protein embeddings") # 236,252 proteins
155
+ ```
156
+
157
+ ### Option 3: Polars (Fastest)
158
+
159
+ ```python
160
+ import polars as pl
161
+
162
+ # Lazy load all 600M shards with glob pattern
163
+ df = pl.scan_parquet('esmc_600m_embeddings-train-*.parquet')
164
+
165
+ # Filter and collect efficiently
166
+ last_layer = df.filter(pl.col('layer_idx') == 35).collect()
167
+ print(f"Shape: {last_layer.shape}") # (236252, 4)
168
+ ```
169
+
170
+ ### Load Specific Proteins
171
+
172
+ ```python
173
+ import pandas as pd
174
+
175
+ # Load all shards and filter to specific proteins
176
+ df = pd.concat([
177
+ pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
178
+ for i in range(4)
179
+ ], ignore_index=True)
180
+
181
+ # Get TP53 tumor suppressor embeddings (all 36 layers)
182
+ tp53_data = df[df['sequence_id'] == 'ENSP00000269305.4'].sort_values('layer_idx')
183
+ tp53_embeddings = np.array(tp53_data['mean_embedding'].tolist())
184
+ print(f"TP53 shape: {tp53_embeddings.shape}") # (36, 1152)
185
+ ```
186
+
187
+ ### Train a Classifier (Last Layer Only)
188
+
189
+ ```python
190
+ from sklearn.ensemble import RandomForestClassifier
191
+ import numpy as np
192
+ import pandas as pd
193
+
194
+ # Load only last layer from all shards
195
+ dfs = []
196
+ for i in range(4): # 4 shards for 600M
197
+ df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
198
+ dfs.append(df[df['layer_idx'] == 35])
199
+
200
+ embeddings_df = pd.concat(dfs, ignore_index=True)
201
+
202
+ # Extract features
203
+ X = np.array(embeddings_df['mean_embedding'].tolist()) # (236252, 1152)
204
+ # y = your_labels # e.g., GO terms, subcellular localization
205
+
206
+ clf = RandomForestClassifier()
207
+ clf.fit(X, y)
208
+ ```
209
+
210
+ ### Protein Similarity Search
211
+
212
+ ```python
213
+ from sklearn.metrics.pairwise import cosine_similarity
214
+ import pandas as pd
215
+ import numpy as np
216
+
217
+ # Load last layer from all shards
218
+ dfs = []
219
+ for i in range(4):
220
+ df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
221
+ dfs.append(df[df['layer_idx'] == 35])
222
+
223
+ df = pd.concat(dfs, ignore_index=True)
224
+
225
+ # Query: Find proteins similar to TP53
226
+ query_emb = df[df['sequence_id'] == 'ENSP00000269305.4']['mean_embedding'].iloc[0]
227
+ all_embs = np.array(df['mean_embedding'].tolist())
228
+
229
+ similarities = cosine_similarity([query_emb], all_embs)[0]
230
+ top_10_indices = similarities.argsort()[-11:-1][::-1]
231
+
232
+ print("Top 10 proteins similar to TP53:")
233
+ for idx in top_10_indices:
234
+ seq_id = df.iloc[idx]['sequence_id']
235
+ sim = similarities[idx]
236
+ print(f" {seq_id}: {sim:.4f}")
237
+ ```
238
+
239
+ ### Join with Sequences
240
+
241
+ ```python
242
+ import pandas as pd
243
+
244
+ # Load embeddings (last layer only)
245
+ embeddings = pd.concat([
246
+ pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
247
+ for i in range(4)
248
+ ], ignore_index=True)
249
+ embeddings = embeddings[embeddings['layer_idx'] == 35]
250
+
251
+ # Load sequences
252
+ sequences = pd.read_parquet('sequences.parquet')
253
+
254
+ # Merge
255
+ merged = embeddings.merge(sequences, on='sequence_id', how='left')
256
+ print(f"Merged shape: {merged.shape}")
257
+ print(f"Columns: {merged.columns.tolist()}")
258
+ ```
259
+
260
+ ## πŸ“ˆ Dataset Statistics
261
+
262
+ ### Coverage
263
+ - **Source:** Homo sapiens GRCh38 peptide sequences from [Ensembl](https://www.ensembl.org/)
264
+ - **Total in source:** 245,535 sequences
265
+ - **Processed:** 236,252 sequences (96.2%)
266
+ - **Filtered:** 9,283 sequences (3.8% - containing ambiguous/invalid amino acids)
267
+
268
+ ### Sequence Characteristics
269
+ - **Length range:** 1 - 35,991 amino acids
270
+ - **Mean length:** ~460 AA
271
+ - **Median length:** ~282 AA
272
+ - **Valid amino acids:** 20 standard (ACDEFGHIKLMNPQRSTVWY)
273
+
274
+ ### Model Comparison
275
+
276
+ | Model | Params | Layers | Embed Dim | Shards | Total Size | Total Rows |
277
+ |-------|--------|--------|-----------|--------|------------|------------|
278
+ | ESMC 300M | 300M | 30 | 960 | 3 | 10.3 GB | 7,087,560 |
279
+ | ESMC 600M | 600M | 36 | 1152 | 4 | 14.8 GB | 8,505,072 |
280
+
281
+ ## πŸ”¬ Generation Details
282
+
283
+ ### Models
284
+ - **ESMC 300M:** `EvolutionaryScale/esmc-300m-2024-12` (revision: `a19d363`)
285
+ - **ESMC 600M:** `EvolutionaryScale/esmc-600m-2024-12` (revision: `d11cc14`)
286
+ - **Library:** ESMC v3.1.3 from [EvolutionaryScale](https://github.com/evolutionaryscale/esm)
287
+
288
+ ### Processing Pipeline
289
+ 1. βœ… Tokenize sequences with BOS/EOS tokens
290
+ 2. βœ… Forward pass through all layers (`model.eval()`, `torch.no_grad()`)
291
+ 3. βœ… Remove BOS/EOS tokens from outputs
292
+ 4. βœ… Mean pool across sequence length dimension
293
+ 5. βœ… Extract to CPU as float32
294
+
295
+ ### Configuration
296
+ - **Batch size:** Adaptive (8 for ≀4096 AA, 1 for longer sequences)
297
+ - **Max length:** 50,000 amino acids
298
+ - **Random seed:** 42 (reproducible)
299
+ - **Hardware:** NVIDIA RTX A6000 (48GB VRAM)
300
+ - **Quality checks:** βœ… No missing values, βœ… Correct layer counts, βœ… No duplicates
301
+ - **Sharding:** Split to ~3.5 GB per shard for optimal HuggingFace compatibility
302
+
303
+ ## ❓ FAQ
304
+
305
+ **Q: Which layer should I use?**
306
+ A: The **last layer** (29 for 300M, 35 for 600M) typically works best for downstream tasks. Some applications benefit from intermediate layers or combining multiple layers.
307
+
308
+ **Q: How do I load all shards at once?**
309
+ A: Use glob patterns with pandas/polars:
310
+ ```python
311
+ import pandas as pd
312
+ df = pd.concat([
313
+ pd.read_parquet(f) for f in glob('esmc_600m_embeddings-train-*.parquet')
314
+ ], ignore_index=True)
315
+ ```
316
+ Or use HuggingFace `datasets` library which handles shards automatically.
317
+
318
+ **Q: Can I load just one shard?**
319
+ A: Yes! Each shard is independent and contains a subset of proteins. Useful for memory-constrained environments or parallel processing.
320
+
321
+ **Q: 300M vs 600M - which to use?**
322
+ A: **600M** is larger and may capture more nuanced patterns. **300M** is faster to work with. We recommend trying both!
323
+
324
+ **Q: Are embeddings normalized?**
325
+ A: No, these are raw mean-pooled embeddings. Apply L2 normalization if needed for cosine similarity.
326
+
327
+ **Q: What sequences were filtered out?**
328
+ A: 9,283 sequences (3.8%) containing non-standard amino acids:
329
+ - X (ambiguous): 9,049 sequences
330
+ - \* (stop codon): 152 sequences
331
+ - U (selenocysteine): 89 sequences
332
+
333
+ **Q: Can I use this commercially?**
334
+ A: **Yes!** Under CC BY 4.0 license - free to use with attribution to BioLM.ai.
335
+
336
+ **Q: How are proteins distributed across shards?**
337
+ A: Proteins are split sequentially (by row order) across shards. To get all layers for a protein, you may need to check all shards (though typically a protein's layers are in the same shard).
338
+
339
+ **Q: Which shard contains a specific protein?**
340
+ A: Load the `sequences.parquet` file to see all sequence IDs, then search each shard. Or use the HuggingFace `datasets` library which handles this automatically.
341
+
342
+ ## πŸ“š Citation
343
+
344
+ If you use this dataset in your work, please cite:
345
+
346
+ ```bibtex
347
+ @dataset{biolm_human_proteome_esmc_2025,
348
+ title={Human Proteome ESMC Embeddings},
349
+ author={BioLM.ai},
350
+ year={2025},
351
+ month={October},
352
+ publisher={HuggingFace},
353
+ url={https://huggingface.co/datasets/biolm/human-proteome-esmc-embeddings}
354
+ }
355
+ ```
356
+
357
+ And the ESMC model:
358
+ ```bibtex
359
+ @article{esmc2024,
360
+ title={Evolutionary Scale Modeling: Protein Language Models},
361
+ author={EvolutionaryScale},
362
+ year={2024},
363
+ url={https://github.com/evolutionaryscale/esm}
364
+ }
365
+ ```
366
+
367
+ ## πŸ“„ License
368
+
369
+ **CC BY 4.0** - Free to use with attribution to BioLM.ai
370
+
371
+ - **Source data** (Ensembl): Freely available
372
+ - **ESMC models**: Apache 2.0
373
+ - **This dataset**: CC BY 4.0
374
+
375
+ ## πŸ™ Acknowledgments
376
+
377
+ - **EvolutionaryScale** for developing and open-sourcing ESMC models
378
+ - **Ensembl** for curating and maintaining the human proteome reference
379
+ - **HuggingFace** for hosting and serving this dataset
380
+
381
+ ## πŸ“ž Contact & Support
382
+
383
+ - **Organization:** [BioLM.ai](https://biolm.ai)
384
+ - **Python SDK:** [py-biolm](https://github.com/BioLM/py-biolm) - Run inference on ESMC and many other biosequence models via API
385
+ - **HuggingFace Discussions:** Use the Community tab for questions and feedback
386
+ - **Issues:** Report problems via HuggingFace Discussions
387
+
388
+ ---
389
+
390
+ **Version:** 1.0.0 | **Last updated:** October 2025 | **Dataset size:** ~26 GB (8 sharded files)