DNA-Bacteria-JEPA

Self-supervised Joint Embedding Predictive Architecture (JEPA) pretrained on 20 bacterial genomes (301K × 500bp fragments).

Architecture

Encoder: SparseTransformer (6L × 384D × 6H, 8.5M params)
Predictor: Transformer bottleneck (4L × 192D × 6H, 1.5M params)
Training: Multi-block masking + VICReg (token + sequence level) + EMA target encoder
Curriculum: Mask ratio 15%→50%, block length 3→15 over 200 epochs

Training Data

20 bacterial species spanning 6 phyla (GC: 28.6%–72.1%): Proteobacteria (9), Firmicutes (6), Actinobacteria (2), Deinococcota (2), Tenericutes (1), Spirochaetes (1)

Results (Epoch 200)

Level	Classes	Random	kNN Acc	Lift
Species	20	5.0%	13.0%	2.6×
Class	9	11.1%	32.8%	3.0×
Phylum	6	16.7%	46.1%	2.8×
GC Content	3	33.3%	50.7%	1.5×

Model captures hierarchical taxonomic structure without supervision.

Training Health

RankMe: 380/384 (98% dimension utilization)
GC decorrelation: |r| = 0.000
No representational collapse

Author

Valentin Uzan — GitHub

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support