DeepGenopix
DNA-first Computer Vision Pipeline for Transposable Element Analysis
Adapting the DeepSeek-OCR Contextual Optical Compression paradigm into 1D genomics.
48 bp/token β 12Γ density improvement yielding ~144Γ cheaper quadratic attention.
What is DeepGenopix?
DeepGenopix treats raw DNA sequences as 1D images and applies a computer-vision-inspired pipeline β Perception Stem β Visual Compressor β Transformer Backbone β to classify and quantify transposable element (TE) families. The key innovation is a deterministic bitwise pixelization that packs 12 bases into a single RGB pixel (24 bits), giving a 48 bp/token compression rate after the 4Γ stride compressor.
This is not a learned tokenizer. It is a physics layer: A=0, C=1, G=2, T=3 (2 bits/base), packed MSB-first into a 24-bit integer, then split into R/G/B channels. No embeddings, no BPE vocabularies, no trainable tokenizers.
Architecture
Raw DNA βββΊ Pixelizer (12bpβRGB) βββΊ Perception Stem (Conv1D) βββΊ
Compressor (4Γ stride) βββΊ Transformer Backbone βββΊ Pooling βββΊ Classifier
| Stage | Component | Input β Output |
|---|---|---|
| Physics | Pixelizer | DNA string β (3, L) float tensor |
| Perception | Conv1D Stem (k=3, 64ch) | (B, 3, L) β (B, 64, L) |
| Compression | Conv1D Compressor (k=4, s=4) | (B, 64, L) β (B, 256, L/4) |
| Context | TransformerEncoder (2 layers, 4 heads) | (B, L/4, 256) β (B, L/4, 256) |
| Pooling | Masked Global Average Pool | (B, L/4, 256) β (B, 256) |
| Decision | Linear Classifier | (B, 256) β (B, num_classes) |
Key design decisions:
MaskedGlobalAvgPoolclamps denominator to β₯1 to prevent NaN on fully-masked sequencesdynamic_collate_fnpads only to the longest in each batch (not a global max), saving VRAM- Class-weighted CrossEntropyLoss for imbalanced TE family distributions
- ReduceLROnPlateau + early stopping with F1-macro as the validation metric
Quick Start
# 1. Clone & install
git clone https://huggingface.co/vedatonuryilmaz/deepgenopix
cd deepgenopix
pip install -e .
# 2. Generate synthetic test data (for pipeline validation)
python scripts/generate_synthetic.py --n_per_family 500
# 3. Run ETL pipeline (pixelize + stratified split)
python scripts/etl_pipeline.py --raw_fasta data/raw/te_sequences_synthetic.fa --max_records 10000
# 4. Train baseline preset
python scripts/train.py --preset baseline_v1 --data_dir data/processed --epochs 30
Installation
Requires Python β₯3.10 and PyTorch β₯2.0.
pip install -e .
# Or with dev dependencies:
pip install -e ".[dev]"
Optional dependencies for full functionality:
faiss-cpuβ for Visual Sinkhorn-EM quantification (FAISS ANN index)umap-learnβ for UMAP visualization of embeddingstrackioβ for experiment tracking and alerts
Usage
Inference on new sequences
from deepgenopix.inference import load_model, predict_single
model = load_model("checkpoints/best_model.pth", num_classes=12)
class_id, probs = predict_single(model, "ATGCGTACGT...")
Extract embeddings
from deepgenopix.model import DeepGenopixClassifier
model = DeepGenopixClassifier(num_classes=12)
embeddings = model.get_embeddings(pixels, lengths) # (B, 256)
Saliency maps
from deepgenopix.analysis import compute_saliency, plot_saliency
saliency = compute_saliency(model, pixels, lengths)
plot_saliency(saliency, sequence="ATGC...", output_path="saliency.png")
Phase 1: Biological Validation
Eight hyperparameter presets testing compression limits, latent capacity, depth, and perception scope:
| Preset | Stride | Latent | Stem Kernel | Layers | BP/Token |
|---|---|---|---|---|---|
| baseline_v1 | 4 | 256 | 3 | 2 | 48 |
| stride2_v1 | 2 | 256 | 3 | 2 | 24 |
| stride8_v1 | 8 | 256 | 3 | 2 | 96 |
| latent128_v1 | 4 | 128 | 3 | 2 | 48 |
| latent768_v1 | 4 | 768 | 3 | 2 | 48 |
| layers4_v1 | 4 | 256 | 3 | 4 | 48 |
| stem5_v1 | 4 | 256 | 5 | 2 | 48 |
| stem7_v1 | 4 | 256 | 7 | 2 | 48 |
Run a sweep:
python scripts/pipeline.py --presets all --n_per_family 2000 --epochs 30
Phase 2: Visual Sinkhorn-EM
Resolving multi-mapping RNA-seq reads using Optimal Transport on visual embeddings instead of alignment scores.
from deepgenopix.quant import run_quantification, QuantConfig
result = run_quantification(
read_sequences=reads,
locus_sequences=loci,
locus_labels=labels,
encoder=model,
mode="snk_em", # or "snk_ot", "vis_em"
)
Phase 3 & 4 (Planned)
- Knowledge Distillation: Visual encoder β GROVER embedding alignment
- Multi-Modal: Expand input channels for ATAC-seq + methylation
Testing
pytest tests/ -v
Project Structure
deepgenopix/
βββ src/deepgenopix/
β βββ __init__.py # Package exports
β βββ config.py # Architecture constants
β βββ dataset.py # PyTorch Dataset + collate_fn
β βββ etl.py # FASTA β pixelized tensor pipeline
β βββ inference.py # Model loading, prediction, batch inference
β βββ model.py # Neural architecture
β βββ pixelizer.py # DNA β RGB bitwise encoding
β βββ quant.py # Visual Sinkhorn-EM quantification
β βββ trainer.py # Training loop with validation + Trackio
β βββ analysis.py # Saliency, UMAP, confusion matrices
β βββ io_utils.py # FASTA/JSON/CSV helpers
βββ scripts/
β βββ generate_synthetic.py # Synthetic TE data generator
β βββ etl_pipeline.py # Full ETL with UCSC downloads
β βββ train.py # Hyperparameter sweep trainer
β βββ pipeline.py # All-in-one: generate β ETL β train
βββ tests/
β βββ test_pixelizer.py
β βββ test_model.py
βββ pyproject.toml
βββ LICENSE
βββ README.md
Citation
Based on DeepSeek-OCR contextual optical compression paradigm:
@article{deepseek2025ocr,
title={DeepSeek-OCR: Contextual Optical Compression},
journal={arXiv preprint arXiv:2510.18234},
year={2025}
}
For DeepGenopix specifically:
@software{deepgenopix2026,
author = {Vedat Onur Yilmaz},
title = {DeepGenopix: DNA-first Computer Vision for Transposable Element Analysis},
year = {2026},
url = {https://huggingface.co/vedatonuryilmaz/deepgenopix}
}
License
MIT β see LICENSE.