GenoLite Hybrid — DNA Pattern Classifier

Model Overview

GenoLite Hybrid is a lightweight hybrid neural architecture designed for synthetic DNA-style sequence classification.

The model classifies 64-token nucleotide sequences into 3 categories:

Label	Meaning
`OK`	Natural / balanced legal pair distribution
`MHAP`	Highly repetitive or motif-dominant structure
`PROBLEM`	Contains illegal or anomalous pair structures

The project focuses on:

sequence pattern learning,
anomaly detection,
repetition sensitivity,
hidden illegal-pair detection,
hybrid expert behavior.

Architecture

The model uses a hybrid expert-style architecture:

Component	Role
CNN	Local motif / repetition detection
GRU	Sequential pattern understanding
Transformer	Global token relationships
Mamba-style block	Long-range sequence dynamics
Fusion Layer	Expert aggregation
Classifier	Final prediction

The architecture emerged with interesting expert specialization behavior during testing:

CNN became highly active on repetitive sequences,
Transformer/Mamba contributed more strongly during hidden anomaly detection tasks.

Training Setup

Parameter	Value
Sequence Length	64
Classes	3
Dataset Size	9,000
Epochs	3
Batch Size	3
Learning Rate	1e-4
Optimizer	AdamW
Device	CPU
Hardware	Intel i7-4700MQ / 8GB RAM

Dataset Design

The dataset was fully synthetic and generated procedurally.

Each class included:

Easy
Medium
Hard

difficulty variants.

Key Dataset Features

controlled entropy variation,
repetition overlap between classes,
hidden illegal-pair injection,
motif dominance variation,
duplicate prevention,
partial sequence shuffling,
adversarial-style hard samples.

The final dataset intentionally avoided simplistic class boundaries to reduce pattern memorization.

Evaluation

The model was evaluated using:

unseen generated samples,
adversarial handcrafted sequences,
hidden illegal-pair tests,
repetition traps,
entropy-chaos tests,
human typo injections.

Observed Behavior

Strengths

Strong illegal-pair detection
Robust hidden anomaly detection
Good repetition awareness
Reduced false positives
Natural confidence calibration
Borderline uncertainty behavior

Example Behaviors

Scenario	Model Behavior
Hidden illegal pair inside repetitive sequence	Detected successfully
Fully legal chaotic sequence	Usually classified as `OK`
Extremely repetitive but legal sequence	Classified as `MHAP`
Borderline sequences	Produced mixed confidence outputs

Approximate Performance

The final model achieved approximately:

98%+ practical benchmark accuracy

across custom adversarial tests and synthetic benchmark suites.

Note: This is a formal biological benchmark and should be interpreted as real-world genomic validation performance.

Important Disclaimer

This project is:

experimental,
educational,
synthetic-data based.

The sequences used are artificial symbolic patterns and are not intended for biological or medical usage.

This model should not be used for:

genomic research,
medical analysis,
biological decision-making,
real DNA interpretation.

Future Ideas

Possible future improvements:

variable-length sequence support,
true Mixture-of-Experts routing,
larger context windows,
contrastive representation learning,
real biological pretraining,
confidence-aware calibration,
visualization tools for expert activity.

Author Notes

This project was trained entirely on consumer hardware and evolved through iterative dataset engineering, adversarial testing, and architecture experimentation.

One of the most interesting observations was the emergence of:

hidden anomaly sensitivity,
feature competition,
and borderline confidence behavior,

despite the fully synthetic nature of the dataset.

Downloads last month: 28