DeepGenopix

DNA-first Computer Vision Pipeline for Transposable Element Analysis

Adapting the DeepSeek-OCR Contextual Optical Compression paradigm into 1D genomics.
48 bp/token β€” 12Γ— density improvement yielding ~144Γ— cheaper quadratic attention.

What is DeepGenopix?

DeepGenopix treats raw DNA sequences as 1D images and applies a computer-vision-inspired pipeline β€” Perception Stem β†’ Visual Compressor β†’ Transformer Backbone β€” to classify and quantify transposable element (TE) families. The key innovation is a deterministic bitwise pixelization that packs 12 bases into a single RGB pixel (24 bits), giving a 48 bp/token compression rate after the 4Γ— stride compressor.

This is not a learned tokenizer. It is a physics layer: A=0, C=1, G=2, T=3 (2 bits/base), packed MSB-first into a 24-bit integer, then split into R/G/B channels. No embeddings, no BPE vocabularies, no trainable tokenizers.

Architecture

Raw DNA ──► Pixelizer (12bpβ†’RGB) ──► Perception Stem (Conv1D) ──►
Compressor (4Γ— stride) ──► Transformer Backbone ──► Pooling ──► Classifier
Stage Component Input β†’ Output
Physics Pixelizer DNA string β†’ (3, L) float tensor
Perception Conv1D Stem (k=3, 64ch) (B, 3, L) β†’ (B, 64, L)
Compression Conv1D Compressor (k=4, s=4) (B, 64, L) β†’ (B, 256, L/4)
Context TransformerEncoder (2 layers, 4 heads) (B, L/4, 256) β†’ (B, L/4, 256)
Pooling Masked Global Average Pool (B, L/4, 256) β†’ (B, 256)
Decision Linear Classifier (B, 256) β†’ (B, num_classes)

Key design decisions:

  • MaskedGlobalAvgPool clamps denominator to β‰₯1 to prevent NaN on fully-masked sequences
  • dynamic_collate_fn pads only to the longest in each batch (not a global max), saving VRAM
  • Class-weighted CrossEntropyLoss for imbalanced TE family distributions
  • ReduceLROnPlateau + early stopping with F1-macro as the validation metric

Quick Start

# 1. Clone & install
git clone https://huggingface.co/vedatonuryilmaz/deepgenopix
cd deepgenopix
pip install -e .

# 2. Generate synthetic test data (for pipeline validation)
python scripts/generate_synthetic.py --n_per_family 500

# 3. Run ETL pipeline (pixelize + stratified split)
python scripts/etl_pipeline.py --raw_fasta data/raw/te_sequences_synthetic.fa --max_records 10000

# 4. Train baseline preset
python scripts/train.py --preset baseline_v1 --data_dir data/processed --epochs 30

Installation

Requires Python β‰₯3.10 and PyTorch β‰₯2.0.

pip install -e .
# Or with dev dependencies:
pip install -e ".[dev]"

Optional dependencies for full functionality:

  • faiss-cpu β€” for Visual Sinkhorn-EM quantification (FAISS ANN index)
  • umap-learn β€” for UMAP visualization of embeddings
  • trackio β€” for experiment tracking and alerts

Usage

Inference on new sequences

from deepgenopix.inference import load_model, predict_single

model = load_model("checkpoints/best_model.pth", num_classes=12)
class_id, probs = predict_single(model, "ATGCGTACGT...")

Extract embeddings

from deepgenopix.model import DeepGenopixClassifier

model = DeepGenopixClassifier(num_classes=12)
embeddings = model.get_embeddings(pixels, lengths)  # (B, 256)

Saliency maps

from deepgenopix.analysis import compute_saliency, plot_saliency

saliency = compute_saliency(model, pixels, lengths)
plot_saliency(saliency, sequence="ATGC...", output_path="saliency.png")

Phase 1: Biological Validation

Eight hyperparameter presets testing compression limits, latent capacity, depth, and perception scope:

Preset Stride Latent Stem Kernel Layers BP/Token
baseline_v1 4 256 3 2 48
stride2_v1 2 256 3 2 24
stride8_v1 8 256 3 2 96
latent128_v1 4 128 3 2 48
latent768_v1 4 768 3 2 48
layers4_v1 4 256 3 4 48
stem5_v1 4 256 5 2 48
stem7_v1 4 256 7 2 48

Run a sweep:

python scripts/pipeline.py --presets all --n_per_family 2000 --epochs 30

Phase 2: Visual Sinkhorn-EM

Resolving multi-mapping RNA-seq reads using Optimal Transport on visual embeddings instead of alignment scores.

from deepgenopix.quant import run_quantification, QuantConfig

result = run_quantification(
    read_sequences=reads,
    locus_sequences=loci,
    locus_labels=labels,
    encoder=model,
    mode="snk_em",  # or "snk_ot", "vis_em"
)

Phase 3 & 4 (Planned)

  • Knowledge Distillation: Visual encoder β†’ GROVER embedding alignment
  • Multi-Modal: Expand input channels for ATAC-seq + methylation

Testing

pytest tests/ -v

Project Structure

deepgenopix/
β”œβ”€β”€ src/deepgenopix/
β”‚   β”œβ”€β”€ __init__.py      # Package exports
β”‚   β”œβ”€β”€ config.py        # Architecture constants
β”‚   β”œβ”€β”€ dataset.py       # PyTorch Dataset + collate_fn
β”‚   β”œβ”€β”€ etl.py           # FASTA β†’ pixelized tensor pipeline
β”‚   β”œβ”€β”€ inference.py     # Model loading, prediction, batch inference
β”‚   β”œβ”€β”€ model.py         # Neural architecture
β”‚   β”œβ”€β”€ pixelizer.py     # DNA β†’ RGB bitwise encoding
β”‚   β”œβ”€β”€ quant.py         # Visual Sinkhorn-EM quantification
β”‚   β”œβ”€β”€ trainer.py       # Training loop with validation + Trackio
β”‚   β”œβ”€β”€ analysis.py      # Saliency, UMAP, confusion matrices
β”‚   └── io_utils.py      # FASTA/JSON/CSV helpers
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ generate_synthetic.py  # Synthetic TE data generator
β”‚   β”œβ”€β”€ etl_pipeline.py        # Full ETL with UCSC downloads
β”‚   β”œβ”€β”€ train.py               # Hyperparameter sweep trainer
β”‚   └── pipeline.py            # All-in-one: generate β†’ ETL β†’ train
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_pixelizer.py
β”‚   └── test_model.py
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ LICENSE
└── README.md

Citation

Based on DeepSeek-OCR contextual optical compression paradigm:

@article{deepseek2025ocr,
  title={DeepSeek-OCR: Contextual Optical Compression},
  journal={arXiv preprint arXiv:2510.18234},
  year={2025}
}

For DeepGenopix specifically:

@software{deepgenopix2026,
  author = {Vedat Onur Yilmaz},
  title = {DeepGenopix: DNA-first Computer Vision for Transposable Element Analysis},
  year = {2026},
  url = {https://huggingface.co/vedatonuryilmaz/deepgenopix}
}

License

MIT β€” see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for vedatonuryilmaz/deepgenopix