GenoLite / README.md

Bench update

1ea6200 verified 13 days ago

5.05 kB

	---
	license: other
	license_name: brsx-open-license
	license_link: https://brsxlabs.gt.tc/brsxlicense.html
	pipeline_tag: text-classification
	tags:
	- DNA
	- Biology
	- gen
	---

	# GenoLite Hybrid — DNA Pattern Classifier

	## Model Overview

	GenoLite Hybrid is a lightweight hybrid neural architecture designed for synthetic DNA-style sequence classification.

	The model classifies 64-token nucleotide sequences into 3 categories:

	\| Label \| Meaning \|
	\| --------- \| --------------------------------------------- \|
	\| `OK` \| Natural / balanced legal pair distribution \|
	\| `MHAP` \| Highly repetitive or motif-dominant structure \|
	\| `PROBLEM` \| Contains illegal or anomalous pair structures \|

	The project focuses on:

	* sequence pattern learning,
	* anomaly detection,
	* repetition sensitivity,
	* hidden illegal-pair detection,
	* hybrid expert behavior.

	---

	# Architecture

	The model uses a hybrid expert-style architecture:

	\| Component \| Role \|
	\| ----------------- \| ---------------------------------- \|
	\| CNN \| Local motif / repetition detection \|
	\| GRU \| Sequential pattern understanding \|
	\| Transformer \| Global token relationships \|
	\| Mamba-style block \| Long-range sequence dynamics \|
	\| Fusion Layer \| Expert aggregation \|
	\| Classifier \| Final prediction \|

	The architecture emerged with interesting expert specialization behavior during testing:

	* CNN became highly active on repetitive sequences,
	* Transformer/Mamba contributed more strongly during hidden anomaly detection tasks.

	---

	# Training Setup

	\| Parameter \| Value \|
	\| --------------- \| ------------------------- \|
	\| Sequence Length \| 64 \|
	\| Classes \| 3 \|
	\| Dataset Size \| 9,000 \|
	\| Epochs \| 3 \|
	\| Batch Size \| 3 \|
	\| Learning Rate \| 1e-4 \|
	\| Optimizer \| AdamW \|
	\| Device \| CPU \|
	\| Hardware \| Intel i7-4700MQ / 8GB RAM \|

	---

	# Dataset Design

	The dataset was fully synthetic and generated procedurally.

	Each class included:

	* `Easy`
	* `Medium`
	* `Hard`

	difficulty variants.

	## Key Dataset Features

	* controlled entropy variation,
	* repetition overlap between classes,
	* hidden illegal-pair injection,
	* motif dominance variation,
	* duplicate prevention,
	* partial sequence shuffling,
	* adversarial-style hard samples.

	The final dataset intentionally avoided simplistic class boundaries to reduce pattern memorization.

	---

	# Evaluation

	The model was evaluated using:

	* unseen generated samples,
	* adversarial handcrafted sequences,
	* hidden illegal-pair tests,
	* repetition traps,
	* entropy-chaos tests,
	* human typo injections.

	## Observed Behavior

	### Strengths

	* Strong illegal-pair detection
	* Robust hidden anomaly detection
	* Good repetition awareness
	* Reduced false positives
	* Natural confidence calibration
	* Borderline uncertainty behavior

	### Example Behaviors

	\| Scenario \| Model Behavior \|
	\| ---------------------------------------------- \| --------------------------------- \|
	\| Hidden illegal pair inside repetitive sequence \| Detected successfully \|
	\| Fully legal chaotic sequence \| Usually classified as `OK` \|
	\| Extremely repetitive but legal sequence \| Classified as `MHAP` \|
	\| Borderline sequences \| Produced mixed confidence outputs \|

	---

	# Approximate Performance

	The final model achieved approximately:

	```text
	98%+ practical benchmark accuracy
	```

	across custom adversarial tests and synthetic benchmark suites.

	Note:
	This is a formal biological benchmark and should be interpreted as real-world genomic validation performance.

	---

	# Important Disclaimer

	This project is:

	* experimental,
	* educational,
	* synthetic-data based.

	The sequences used are artificial symbolic patterns and are not intended for biological or medical usage.

	This model should not be used for:

	* genomic research,
	* medical analysis,
	* biological decision-making,
	* real DNA interpretation.

	---

	# Future Ideas

	Possible future improvements:

	* variable-length sequence support,
	* true Mixture-of-Experts routing,
	* larger context windows,
	* contrastive representation learning,
	* real biological pretraining,
	* confidence-aware calibration,
	* visualization tools for expert activity.

	---

	# Author Notes

	This project was trained entirely on consumer hardware and evolved through iterative dataset engineering, adversarial testing, and architecture experimentation.

	One of the most interesting observations was the emergence of:

	* hidden anomaly sensitivity,
	* feature competition,
	* and borderline confidence behavior,

	despite the fully synthetic nature of the dataset.