Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
tags:
|
| 7 |
+
- scRNA-seq
|
| 8 |
+
- single-cell
|
| 9 |
+
- self-supervised-learning
|
| 10 |
+
- JEPA
|
| 11 |
+
- biology
|
| 12 |
+
datasets:
|
| 13 |
+
- vevotx/Tahoe-100M
|
| 14 |
+
pretty_name: GeneJEPA (Perceiver JEPA for scRNA-seq)
|
| 15 |
+
pipeline_tag: feature-extraction
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# GeneJEPA — A Perceiver-style JEPA for scRNA-seq
|
| 19 |
+
|
| 20 |
+
**GeneJEPA** is a Joint-Embedding Predictive Architecture (JEPA) trained for self-supervised representation learning on single-cell RNA-seq.
|
| 21 |
+
It uses a Perceiver-style encoder to handle sparse, high-dimensional gene count vectors and learns from masked block prediction—no labels required.
|
| 22 |
+
|
| 23 |
+
> **Why?** Produce compact cell embeddings you can use for clustering, transfer learning, linear probes, and downstream biological tasks.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## Repository contents
|
| 28 |
+
|
| 29 |
+
This model repo intentionally contains **artifacts only** (no training code):
|
| 30 |
+
|
| 31 |
+
- **`genejepa-epoch=49.ckpt`** — final PyTorch Lightning checkpoint (student encoder + predictor + EMA state, etc.)
|
| 32 |
+
- **`gene_metadata.parquet`** — mapping between foundation token IDs and gene identifiers used to build the embedding vocab.
|
| 33 |
+
- **`global_stats.json`** — global `log1p(counts)` normalization stats (`mean`, `std`) computed over a large sample of training data.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Model summary
|
| 38 |
+
|
| 39 |
+
- **Backbone:** Perceiver-style encoder over tokenized genes (identity + Fourier features of expression value)
|
| 40 |
+
- **Latents:** 512
|
| 41 |
+
- **Dimensionality:** 768
|
| 42 |
+
- **Blocks:** 24 transformer blocks on the latent array
|
| 43 |
+
- **Heads:** 12
|
| 44 |
+
- **Masking:** stochastic, block-wise targets with context complement
|
| 45 |
+
- **Predictor:** BYOL-style MLP head
|
| 46 |
+
- **EMA teacher:** maintained during training (for targets)
|
| 47 |
+
|
| 48 |
+
> Default tokenizer Fourier settings: `N_f=64`, `min_freq=0.1`, `max_freq=100.0`, `freq_scale=1.0`.
|
| 49 |
+
|
| 50 |
+
## Download artifacts
|
| 51 |
+
```python
|
| 52 |
+
from huggingface_hub import hf_hub_download
|
| 53 |
+
|
| 54 |
+
ckpt_path = hf_hub_download(repo_id="<your-username>/<your-model-id>",
|
| 55 |
+
filename="genejepa-epoch=49.ckpt")
|
| 56 |
+
meta_path = hf_hub_download(repo_id="<your-username>/<your-model-id>",
|
| 57 |
+
filename="gene_metadata.parquet")
|
| 58 |
+
stats_path = hf_hub_download(repo_id="<your-username>/<your-model-id>",
|
| 59 |
+
filename="global_stats.json")
|
| 60 |
+
```
|
| 61 |
+
## Contact
|
| 62 |
+
|
| 63 |
+
elonlit@biostate.ai
|
| 64 |
+
|
| 65 |
+
|