elonlit commited on
Commit
e4df113
·
verified ·
1 Parent(s): 50aebf2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - scRNA-seq
8
+ - single-cell
9
+ - self-supervised-learning
10
+ - JEPA
11
+ - biology
12
+ datasets:
13
+ - vevotx/Tahoe-100M
14
+ pretty_name: GeneJEPA (Perceiver JEPA for scRNA-seq)
15
+ pipeline_tag: feature-extraction
16
+ ---
17
+
18
+ # GeneJEPA — A Perceiver-style JEPA for scRNA-seq
19
+
20
+ **GeneJEPA** is a Joint-Embedding Predictive Architecture (JEPA) trained for self-supervised representation learning on single-cell RNA-seq.
21
+ It uses a Perceiver-style encoder to handle sparse, high-dimensional gene count vectors and learns from masked block prediction—no labels required.
22
+
23
+ > **Why?** Produce compact cell embeddings you can use for clustering, transfer learning, linear probes, and downstream biological tasks.
24
+
25
+ ---
26
+
27
+ ## Repository contents
28
+
29
+ This model repo intentionally contains **artifacts only** (no training code):
30
+
31
+ - **`genejepa-epoch=49.ckpt`** — final PyTorch Lightning checkpoint (student encoder + predictor + EMA state, etc.)
32
+ - **`gene_metadata.parquet`** — mapping between foundation token IDs and gene identifiers used to build the embedding vocab.
33
+ - **`global_stats.json`** — global `log1p(counts)` normalization stats (`mean`, `std`) computed over a large sample of training data.
34
+
35
+ ---
36
+
37
+ ## Model summary
38
+
39
+ - **Backbone:** Perceiver-style encoder over tokenized genes (identity + Fourier features of expression value)
40
+ - **Latents:** 512
41
+ - **Dimensionality:** 768
42
+ - **Blocks:** 24 transformer blocks on the latent array
43
+ - **Heads:** 12
44
+ - **Masking:** stochastic, block-wise targets with context complement
45
+ - **Predictor:** BYOL-style MLP head
46
+ - **EMA teacher:** maintained during training (for targets)
47
+
48
+ > Default tokenizer Fourier settings: `N_f=64`, `min_freq=0.1`, `max_freq=100.0`, `freq_scale=1.0`.
49
+
50
+ ## Download artifacts
51
+ ```python
52
+ from huggingface_hub import hf_hub_download
53
+
54
+ ckpt_path = hf_hub_download(repo_id="<your-username>/<your-model-id>",
55
+ filename="genejepa-epoch=49.ckpt")
56
+ meta_path = hf_hub_download(repo_id="<your-username>/<your-model-id>",
57
+ filename="gene_metadata.parquet")
58
+ stats_path = hf_hub_download(repo_id="<your-username>/<your-model-id>",
59
+ filename="global_stats.json")
60
+ ```
61
+ ## Contact
62
+
63
+ elonlit@biostate.ai
64
+
65
+