DeepGenopix

DNA-first Computer Vision Pipeline for Transposable Element Analysis

Adapting the DeepSeek-OCR Contextual Optical Compression paradigm into 1D genomics.
48 bp/token — 12× density improvement yielding ~144× cheaper quadratic attention.

What is DeepGenopix?

DeepGenopix treats raw DNA sequences as 1D images and applies a computer-vision-inspired pipeline — Perception Stem → Visual Compressor → Transformer Backbone — to classify and quantify transposable element (TE) families. The key innovation is a deterministic bitwise pixelization that packs 12 bases into a single RGB pixel (24 bits), giving a 48 bp/token compression rate after the 4× stride compressor.

This is not a learned tokenizer. It is a physics layer: A=0, C=1, G=2, T=3 (2 bits/base), packed MSB-first into a 24-bit integer, then split into R/G/B channels. No embeddings, no BPE vocabularies, no trainable tokenizers.

Architecture

Raw DNA ──► Pixelizer (12bp→RGB) ──► Perception Stem (Conv1D) ──►
Compressor (4× stride) ──► Transformer Backbone ──► Pooling ──► Classifier

Stage	Component	Input → Output
Physics	Pixelizer	DNA string → (3, L) float tensor
Perception	Conv1D Stem (k=3, 64ch)	(B, 3, L) → (B, 64, L)
Compression	Conv1D Compressor (k=4, s=4)	(B, 64, L) → (B, 256, L/4)
Context	TransformerEncoder (2 layers, 4 heads)	(B, L/4, 256) → (B, L/4, 256)
Pooling	Masked Global Average Pool	(B, L/4, 256) → (B, 256)
Decision	Linear Classifier	(B, 256) → (B, num_classes)

Key design decisions:

MaskedGlobalAvgPool clamps denominator to ≥1 to prevent NaN on fully-masked sequences
dynamic_collate_fn pads only to the longest in each batch (not a global max), saving VRAM
Class-weighted CrossEntropyLoss for imbalanced TE family distributions
ReduceLROnPlateau + early stopping with F1-macro as the validation metric

Quick Start

# 1. Clone & install
git clone https://huggingface.co/vedatonuryilmaz/deepgenopix
cd deepgenopix
pip install -e .

# 2. Generate synthetic test data (for pipeline validation)
python scripts/generate_synthetic.py --n_per_family 500

# 3. Run ETL pipeline (pixelize + stratified split)
python scripts/etl_pipeline.py --raw_fasta data/raw/te_sequences_synthetic.fa --max_records 10000

# 4. Train baseline preset
python scripts/train.py --preset baseline_v1 --data_dir data/processed --epochs 30

Installation

Requires Python ≥3.10 and PyTorch ≥2.0.

pip install -e .
# Or with dev dependencies:
pip install -e ".[dev]"

Optional dependencies for full functionality:

faiss-cpu — for Visual Sinkhorn-EM quantification (FAISS ANN index)
umap-learn — for UMAP visualization of embeddings
trackio — for experiment tracking and alerts

Usage

Inference on new sequences

from deepgenopix.inference import load_model, predict_single

model = load_model("checkpoints/best_model.pth", num_classes=12)
class_id, probs = predict_single(model, "ATGCGTACGT...")

Extract embeddings

from deepgenopix.model import DeepGenopixClassifier

model = DeepGenopixClassifier(num_classes=12)
embeddings = model.get_embeddings(pixels, lengths)  # (B, 256)

Saliency maps

from deepgenopix.analysis import compute_saliency, plot_saliency

saliency = compute_saliency(model, pixels, lengths)
plot_saliency(saliency, sequence="ATGC...", output_path="saliency.png")

Phase 1: Biological Validation

Eight hyperparameter presets testing compression limits, latent capacity, depth, and perception scope:

Preset	Stride	Latent	Stem Kernel	Layers	BP/Token
baseline_v1	4	256	3	2	48
stride2_v1	2	256	3	2	24
stride8_v1	8	256	3	2	96
latent128_v1	4	128	3	2	48
latent768_v1	4	768	3	2	48
layers4_v1	4	256	3	4	48
stem5_v1	4	256	5	2	48
stem7_v1	4	256	7	2	48

Run a sweep:

python scripts/pipeline.py --presets all --n_per_family 2000 --epochs 30

Phase 2: Visual Sinkhorn-EM

Resolving multi-mapping RNA-seq reads using Optimal Transport on visual embeddings instead of alignment scores.

from deepgenopix.quant import run_quantification, QuantConfig

result = run_quantification(
    read_sequences=reads,
    locus_sequences=loci,
    locus_labels=labels,
    encoder=model,
    mode="snk_em",  # or "snk_ot", "vis_em"
)

Phase 3 & 4 (Planned)

Knowledge Distillation: Visual encoder → GROVER embedding alignment
Multi-Modal: Expand input channels for ATAC-seq + methylation

Testing

pytest tests/ -v

Project Structure

deepgenopix/
├── src/deepgenopix/
│   ├── __init__.py      # Package exports
│   ├── config.py        # Architecture constants
│   ├── dataset.py       # PyTorch Dataset + collate_fn
│   ├── etl.py           # FASTA → pixelized tensor pipeline
│   ├── inference.py     # Model loading, prediction, batch inference
│   ├── model.py         # Neural architecture
│   ├── pixelizer.py     # DNA → RGB bitwise encoding
│   ├── quant.py         # Visual Sinkhorn-EM quantification
│   ├── trainer.py       # Training loop with validation + Trackio
│   ├── analysis.py      # Saliency, UMAP, confusion matrices
│   └── io_utils.py      # FASTA/JSON/CSV helpers
├── scripts/
│   ├── generate_synthetic.py  # Synthetic TE data generator
│   ├── etl_pipeline.py        # Full ETL with UCSC downloads
│   ├── train.py               # Hyperparameter sweep trainer
│   └── pipeline.py            # All-in-one: generate → ETL → train
├── tests/
│   ├── test_pixelizer.py
│   └── test_model.py
├── pyproject.toml
├── LICENSE
└── README.md

Citation

Based on DeepSeek-OCR contextual optical compression paradigm:

@article{deepseek2025ocr,
  title={DeepSeek-OCR: Contextual Optical Compression},
  journal={arXiv preprint arXiv:2510.18234},
  year={2025}
}

For DeepGenopix specifically:

@software{deepgenopix2026,
  author = {Vedat Onur Yilmaz},
  title = {DeepGenopix: DNA-first Computer Vision for Transposable Element Analysis},
  year = {2026},
  url = {https://huggingface.co/vedatonuryilmaz/deepgenopix}
}

License

MIT — see LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for vedatonuryilmaz/deepgenopix

DeepSeek-OCR: Contexts Optical Compression

Paper • 2510.18234 • Published Oct 21, 2025 • 95