DNA-Bacteria-JEPA

Self-supervised Joint Embedding Predictive Architecture (JEPA) pretrained on 20 bacterial genomes (301K ร— 500bp fragments).

Architecture

  • Encoder: SparseTransformer (6L ร— 384D ร— 6H, 8.5M params)
  • Predictor: Transformer bottleneck (4L ร— 192D ร— 6H, 1.5M params)
  • Training: Multi-block masking + VICReg (token + sequence level) + EMA target encoder
  • Curriculum: Mask ratio 15%โ†’50%, block length 3โ†’15 over 200 epochs

Training Data

20 bacterial species spanning 6 phyla (GC: 28.6%โ€“72.1%): Proteobacteria (9), Firmicutes (6), Actinobacteria (2), Deinococcota (2), Tenericutes (1), Spirochaetes (1)

Results (Epoch 200)

Level Classes Random kNN Acc Lift
Species 20 5.0% 13.0% 2.6ร—
Class 9 11.1% 32.8% 3.0ร—
Phylum 6 16.7% 46.1% 2.8ร—
GC Content 3 33.3% 50.7% 1.5ร—

Model captures hierarchical taxonomic structure without supervision.

Training Health

  • RankMe: 380/384 (98% dimension utilization)
  • GC decorrelation: |r| = 0.000
  • No representational collapse

Author

Valentin Uzan โ€” GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support