GenoLite Hybrid โ DNA Pattern Classifier
Model Overview
GenoLite Hybrid is a lightweight hybrid neural architecture designed for synthetic DNA-style sequence classification.
The model classifies 64-token nucleotide sequences into 3 categories:
| Label | Meaning |
|---|---|
OK |
Natural / balanced legal pair distribution |
MHAP |
Highly repetitive or motif-dominant structure |
PROBLEM |
Contains illegal or anomalous pair structures |
The project focuses on:
- sequence pattern learning,
- anomaly detection,
- repetition sensitivity,
- hidden illegal-pair detection,
- hybrid expert behavior.
Architecture
The model uses a hybrid expert-style architecture:
| Component | Role |
|---|---|
| CNN | Local motif / repetition detection |
| GRU | Sequential pattern understanding |
| Transformer | Global token relationships |
| Mamba-style block | Long-range sequence dynamics |
| Fusion Layer | Expert aggregation |
| Classifier | Final prediction |
The architecture emerged with interesting expert specialization behavior during testing:
- CNN became highly active on repetitive sequences,
- Transformer/Mamba contributed more strongly during hidden anomaly detection tasks.
Training Setup
| Parameter | Value |
|---|---|
| Sequence Length | 64 |
| Classes | 3 |
| Dataset Size | 9,000 |
| Epochs | 3 |
| Batch Size | 3 |
| Learning Rate | 1e-4 |
| Optimizer | AdamW |
| Device | CPU |
| Hardware | Intel i7-4700MQ / 8GB RAM |
Dataset Design
The dataset was fully synthetic and generated procedurally.
Each class included:
EasyMediumHard
difficulty variants.
Key Dataset Features
- controlled entropy variation,
- repetition overlap between classes,
- hidden illegal-pair injection,
- motif dominance variation,
- duplicate prevention,
- partial sequence shuffling,
- adversarial-style hard samples.
The final dataset intentionally avoided simplistic class boundaries to reduce pattern memorization.
Evaluation
The model was evaluated using:
- unseen generated samples,
- adversarial handcrafted sequences,
- hidden illegal-pair tests,
- repetition traps,
- entropy-chaos tests,
- human typo injections.
Observed Behavior
Strengths
- Strong illegal-pair detection
- Robust hidden anomaly detection
- Good repetition awareness
- Reduced false positives
- Natural confidence calibration
- Borderline uncertainty behavior
Example Behaviors
| Scenario | Model Behavior |
|---|---|
| Hidden illegal pair inside repetitive sequence | Detected successfully |
| Fully legal chaotic sequence | Usually classified as OK |
| Extremely repetitive but legal sequence | Classified as MHAP |
| Borderline sequences | Produced mixed confidence outputs |
Approximate Performance
The final model achieved approximately:
98%+ practical benchmark accuracy
across custom adversarial tests and synthetic benchmark suites.
Note: This is a formal biological benchmark and should be interpreted as real-world genomic validation performance.
Important Disclaimer
This project is:
- experimental,
- educational,
- synthetic-data based.
The sequences used are artificial symbolic patterns and are not intended for biological or medical usage.
This model should not be used for:
- genomic research,
- medical analysis,
- biological decision-making,
- real DNA interpretation.
Future Ideas
Possible future improvements:
- variable-length sequence support,
- true Mixture-of-Experts routing,
- larger context windows,
- contrastive representation learning,
- real biological pretraining,
- confidence-aware calibration,
- visualization tools for expert activity.
Author Notes
This project was trained entirely on consumer hardware and evolved through iterative dataset engineering, adversarial testing, and architecture experimentation.
One of the most interesting observations was the emergence of:
- hidden anomaly sensitivity,
- feature competition,
- and borderline confidence behavior,
despite the fully synthetic nature of the dataset.
- Downloads last month
- 28