DNA-Bacteria-JEPA
Self-supervised Joint Embedding Predictive Architecture (JEPA) pretrained on 20 bacterial genomes (301K ร 500bp fragments).
Architecture
- Encoder: SparseTransformer (6L ร 384D ร 6H, 8.5M params)
- Predictor: Transformer bottleneck (4L ร 192D ร 6H, 1.5M params)
- Training: Multi-block masking + VICReg (token + sequence level) + EMA target encoder
- Curriculum: Mask ratio 15%โ50%, block length 3โ15 over 200 epochs
Training Data
20 bacterial species spanning 6 phyla (GC: 28.6%โ72.1%): Proteobacteria (9), Firmicutes (6), Actinobacteria (2), Deinococcota (2), Tenericutes (1), Spirochaetes (1)
Results (Epoch 200)
| Level | Classes | Random | kNN Acc | Lift |
|---|---|---|---|---|
| Species | 20 | 5.0% | 13.0% | 2.6ร |
| Class | 9 | 11.1% | 32.8% | 3.0ร |
| Phylum | 6 | 16.7% | 46.1% | 2.8ร |
| GC Content | 3 | 33.3% | 50.7% | 1.5ร |
Model captures hierarchical taxonomic structure without supervision.
Training Health
- RankMe: 380/384 (98% dimension utilization)
- GC decorrelation: |r| = 0.000
- No representational collapse
Author
Valentin Uzan โ GitHub
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support