| --- |
| license: other |
| license_name: brsx-open-license |
| license_link: https://brsxlabs.gt.tc/brsxlicense.html |
| pipeline_tag: text-classification |
| tags: |
| - DNA |
| - Biology |
| - gen |
| --- |
| |
| # GenoLite Hybrid — DNA Pattern Classifier |
|
|
| ## Model Overview |
|
|
| **GenoLite Hybrid** is a lightweight hybrid neural architecture designed for synthetic DNA-style sequence classification. |
|
|
| The model classifies 64-token nucleotide sequences into 3 categories: |
|
|
| | Label | Meaning | |
| | --------- | --------------------------------------------- | |
| | `OK` | Natural / balanced legal pair distribution | |
| | `MHAP` | Highly repetitive or motif-dominant structure | |
| | `PROBLEM` | Contains illegal or anomalous pair structures | |
|
|
| The project focuses on: |
|
|
| * sequence pattern learning, |
| * anomaly detection, |
| * repetition sensitivity, |
| * hidden illegal-pair detection, |
| * hybrid expert behavior. |
|
|
| --- |
|
|
| # Architecture |
|
|
| The model uses a hybrid expert-style architecture: |
|
|
| | Component | Role | |
| | ----------------- | ---------------------------------- | |
| | CNN | Local motif / repetition detection | |
| | GRU | Sequential pattern understanding | |
| | Transformer | Global token relationships | |
| | Mamba-style block | Long-range sequence dynamics | |
| | Fusion Layer | Expert aggregation | |
| | Classifier | Final prediction | |
|
|
| The architecture emerged with interesting expert specialization behavior during testing: |
|
|
| * CNN became highly active on repetitive sequences, |
| * Transformer/Mamba contributed more strongly during hidden anomaly detection tasks. |
|
|
| --- |
|
|
| # Training Setup |
|
|
| | Parameter | Value | |
| | --------------- | ------------------------- | |
| | Sequence Length | 64 | |
| | Classes | 3 | |
| | Dataset Size | 9,000 | |
| | Epochs | 3 | |
| | Batch Size | 3 | |
| | Learning Rate | 1e-4 | |
| | Optimizer | AdamW | |
| | Device | CPU | |
| | Hardware | Intel i7-4700MQ / 8GB RAM | |
|
|
| --- |
|
|
| # Dataset Design |
|
|
| The dataset was fully synthetic and generated procedurally. |
|
|
| Each class included: |
|
|
| * `Easy` |
| * `Medium` |
| * `Hard` |
|
|
| difficulty variants. |
|
|
| ## Key Dataset Features |
|
|
| * controlled entropy variation, |
| * repetition overlap between classes, |
| * hidden illegal-pair injection, |
| * motif dominance variation, |
| * duplicate prevention, |
| * partial sequence shuffling, |
| * adversarial-style hard samples. |
|
|
| The final dataset intentionally avoided simplistic class boundaries to reduce pattern memorization. |
|
|
| --- |
|
|
| # Evaluation |
|
|
| The model was evaluated using: |
|
|
| * unseen generated samples, |
| * adversarial handcrafted sequences, |
| * hidden illegal-pair tests, |
| * repetition traps, |
| * entropy-chaos tests, |
| * human typo injections. |
|
|
| ## Observed Behavior |
|
|
| ### Strengths |
|
|
| * Strong illegal-pair detection |
| * Robust hidden anomaly detection |
| * Good repetition awareness |
| * Reduced false positives |
| * Natural confidence calibration |
| * Borderline uncertainty behavior |
|
|
| ### Example Behaviors |
|
|
| | Scenario | Model Behavior | |
| | ---------------------------------------------- | --------------------------------- | |
| | Hidden illegal pair inside repetitive sequence | Detected successfully | |
| | Fully legal chaotic sequence | Usually classified as `OK` | |
| | Extremely repetitive but legal sequence | Classified as `MHAP` | |
| | Borderline sequences | Produced mixed confidence outputs | |
|
|
| --- |
|
|
| # Approximate Performance |
|
|
| The final model achieved approximately: |
|
|
| ```text |
| 98%+ practical benchmark accuracy |
| ``` |
|
|
| across custom adversarial tests and synthetic benchmark suites. |
|
|
| Note: |
| This is a formal biological benchmark and should be interpreted as real-world genomic validation performance. |
|
|
| --- |
|
|
| # Important Disclaimer |
|
|
| This project is: |
|
|
| * experimental, |
| * educational, |
| * synthetic-data based. |
|
|
| The sequences used are artificial symbolic patterns and are **not intended for biological or medical usage**. |
|
|
| This model should not be used for: |
|
|
| * genomic research, |
| * medical analysis, |
| * biological decision-making, |
| * real DNA interpretation. |
|
|
| --- |
|
|
| # Future Ideas |
|
|
| Possible future improvements: |
|
|
| * variable-length sequence support, |
| * true Mixture-of-Experts routing, |
| * larger context windows, |
| * contrastive representation learning, |
| * real biological pretraining, |
| * confidence-aware calibration, |
| * visualization tools for expert activity. |
|
|
| --- |
|
|
| # Author Notes |
|
|
| This project was trained entirely on consumer hardware and evolved through iterative dataset engineering, adversarial testing, and architecture experimentation. |
|
|
| One of the most interesting observations was the emergence of: |
|
|
| * hidden anomaly sensitivity, |
| * feature competition, |
| * and borderline confidence behavior, |
|
|
| despite the fully synthetic nature of the dataset. |
|
|