| ---
|
| license: apache-2.0
|
| tags:
|
| - biology
|
| - genomics
|
| - dna
|
| - mixture-of-experts
|
| - modular
|
| library_name: pytorch
|
| pipeline_tag: text-generation
|
| ---
|
|
|
| # πΌ DaisyChain β Genomics
|
|
|
| A **modular genomic mind**: four dense ~74M DNA/RNA specialists (β295M params total,
|
| **under Carbon-500M**) behind a learned router. Instead of one monolithic foundation
|
| model, DaisyChain trains one crisp specialist per biological domain β each
|
| **distilled per-domain from [Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)** β
|
| and routes each sequence to its home specialist.
|
|
|
| | Specialist | Domain | Params |
|
| |---|---|---|
|
| | `eukaryote` | Eukaryotic genomic DNA | ~74M |
|
| | `prokaryote` | Bacterial / prokaryotic DNA | ~74M |
|
| | `mrna` | Mature mRNA (coding transcript) | ~74M |
|
| | `mrna_splice` | Pre-mRNA / splice-site regions | ~74M |
|
|
|
| ## The router
|
|
|
| A small learned router reads each specialist's **surprise** (bits/base) and a PCA of its
|
| **hidden state**, then predicts the home domain β recovering the bias-corrections a plain
|
| argmin-perplexity rule misses. Held-out routing accuracy: **100.0%** (vs 87.5% argmin).
|
| Only one ~74M specialist runs per query, so inference is ~7Γ cheaper per token than the
|
| 500M monolith.
|
|
|
| ## How each specialist is built
|
|
|
| Interleaved **continued pretraining** (next-token CE on its domain) and **offline
|
| knowledge distillation** from Carbon-500M (soft-target + a factorized per-nucleotide
|
| variant via Carbon's FNS branch) β i.e. cBTM-style domain experts, iterated per expert.
|
|
|
| ## Capability vs Carbon-500M (the fair baseline)
|
|
|
| | metric | DaisyChain | Carbon-500M |
|
| |---|---|---|
|
| | likelihood β bits/base, **base-pair (FNS)** (β) | 1.88 | 1.79 |
|
| | seq-recovery eukaryote β FNS base-level (β) | 31.5% | 38.9% |
|
| | seq-recovery bacteria β FNS base-level (β) | 40.9% | 54.1% |
|
|
|
| *Likelihood is the base-pair (FNS) score β Carbon's own `score_sequence` / `compute_bp_probs`,
|
| verified to 6e-08. The progress-log table below tracks the 6-mer joint CE, a softer proxy.*
|
|
|
| Behind a 500M/1T-token monolith but within striking distance at ~15% of the active
|
| compute β and the gap keeps closing with more per-domain training (work in progress).
|
|
|
| ### Progress log β closing the gap to Carbon-500M (BASE-PAIR / FNS metric)
|
|
|
| Re-baselined on Carbon's **base-pair (FNS)** metric (`score_sequence`, verified to 6e-08) β each
|
| round's full 4-specialist set re-scored. **trailing** = `mean ours β mean Carbon` (Carbon mean 1.7870).
|
| The final row (1.8622 / +0.0752) matches the independently-verified current standing.
|
|
|
| | date | update | euk | prok | mrna | splice | mean | **trailing** |
|
| |---|---|---|---|---|---|---|---|
|
| | 2026-06-22 | baseline (round-1 distill + router) | 1.965 | 1.994 | 1.910 | 1.935 | 1.9510 | +0.1640 |
|
| | 2026-06-23 | mrna 12k-distill | 1.965 | 1.994 | 1.927 | 1.935 | 1.9552 | +0.1682 |
|
| | 2026-06-23 | prokaryote round 1 | 1.965 | 1.918 | 1.927 | 1.935 | 1.9363 | +0.1493 |
|
| | 2026-06-24 | eukaryote | 1.928 | 1.918 | 1.927 | 1.935 | 1.9272 | +0.1403 |
|
| | 2026-06-25 | mrna | 1.928 | 1.918 | 1.788 | 1.935 | 1.8924 | +0.1054 |
|
| | 2026-06-25 | mrna_splice | 1.928 | 1.918 | 1.788 | 1.873 | 1.8768 | +0.0898 |
|
| | 2026-06-26 | prokaryote round 2 | 1.928 | 1.914 | 1.788 | 1.873 | 1.8758 | +0.0889 |
|
| | 2026-06-27 | eukaryote round 2 (routing 100%) | 1.924 | 1.914 | 1.788 | 1.873 | 1.8747 | +0.0878 |
|
| | 2026-06-29 | Muon passes β mrna + prokaryote | 1.924 | 1.868 | 1.784 | 1.873 | **1.8622** | **+0.0752** |
|
|
|
| (The mrna 12k-distill round *worsened* base-pair likelihood (+0.1640β+0.1682) even though it improved
|
| 6-mer CE β the later base+distill mrna round fixed it. The soft 6-mer metric hid that.)
|
|
|
| <details><summary>Earlier 6-mer joint CE trajectory (softer proxy β kept for reference)</summary>
|
|
|
| | date | update | mean DaisyChain (6-mer CE) | mean Carbon-500M (6-mer CE) | **trailing** |
|
| |---|---|---|---|---|
|
| | 2026-06-22 | baseline (94.8% routing) | 1.8644 | 1.7502 | +0.1142 |
|
| | 2026-06-23 | mrna 12k-distill (95.7%) | 1.8599 | 1.7502 | +0.1096 |
|
| | 2026-06-23 | prokaryote round 1 (96.2%) | 1.8528 | 1.7502 | +0.1026 |
|
| | 2026-06-24 | eukaryote (98.0%) | 1.8413 | 1.7502 | +0.0911 |
|
| | 2026-06-25 | mrna (98.3%) | 1.8075 | 1.7502 | +0.0573 |
|
| | 2026-06-25 | mrna_splice (99.8%) | 1.7959 | 1.7502 | +0.0457 |
|
| | 2026-06-26 | prokaryote round 2 (99.8%) | 1.7929 | 1.7502 | +0.0427 |
|
|
|
| </details>
|
|
|
| *The table above is the **base-pair (FNS)** trajectory β Carbon's actual metric, fully re-baselined.
|
| Honest current standing: mean **1.8622 vs 1.7870 (+0.0752)**, base-pair **wins ~36/100**, **no domain ahead**
|
| (splice's earlier "beats Carbon" was only on the 6-mer proxy). Recovery uses Carbon's FNS base-level argmax
|
| decoder (per-base accuracy, next-30bp, n=50, ctx=1536) β our measurement of Carbon-500M (a draft model,
|
| explicitly not benchmark-competitive), not Carbon's published 3B figures.*
|
|
|
| ## Usage
|
|
|
| ```python
|
| from daisychain import DaisyChain
|
| dc = DaisyChain(root=".", device="cpu")
|
| home, bits_per_base = dc.route("ACGTACGT...") # which domain?
|
| print(home, bits_per_base)
|
| print(dc.generate(home, length=180)) # sample from the home specialist
|
| ```
|
|
|
| Files: `daisychain.py` (inference), `model.py` / `specialist_presets.py` /
|
| `spike_tokenizer.py` / `registry.py` (architecture), `tokenizer.json`,
|
| `<domain>/model.safetensors` (the 4 specialists), `router2.pt` (router).
|
|
|
| > Interactive demo: the **DaisyChain Space** routes DNA in real time.
|
|
|
| ## Citation
|
|
|
| **If you use these models, please cite the author β Dean Byrne (Quazim0t0):**
|
|
|
| ```bibtex
|
| @misc{byrne2026daisychain,
|
| title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
|
| author = {Byrne, Dean},
|
| year = {2026},
|
| howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
|
| note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
|
| from Carbon-500M behind a learned router}
|
| }
|
| ```
|
|
|
| ### Built on
|
|
|
| DaisyChain stands on these works:
|
|
|
| ```bibtex
|
| @misc{carbon2025,
|
| title = {Carbon: Genomic Foundation Models},
|
| author = {{HuggingFaceBio}},
|
| year = {2025},
|
| howpublished = {\url{https://huggingface.co/HuggingFaceBio/Carbon-500M}}
|
| }
|
|
|
| @article{li2022branchtrainmerge,
|
| title = {Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
|
| author = {Li, Margaret and Gururangan, Suchin and Dettmers, Tim and Lewis, Mike and
|
| Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
|
| journal = {arXiv preprint arXiv:2208.03306},
|
| year = {2022}
|
| }
|
|
|
| @article{gururangan2023cbtm,
|
| title = {Scaling Expert Language Models with Unsupervised Domain Discovery},
|
| author = {Gururangan, Suchin and Li, Margaret and Lewis, Mike and Shi, Weijia and
|
| Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
|
| journal = {arXiv preprint arXiv:2303.14177},
|
| year = {2023}
|
| }
|
|
|
| @article{sukhbaatar2024btx,
|
| title = {Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM},
|
| author = {Sukhbaatar, Sainbayar and Golovneva, Olga and Sharma, Vasu and Xu, Hu and
|
| Lin, Xi Victoria and Roziere, Baptiste and Kahn, Jacob and Li, Daniel and
|
| Yih, Wen-tau and Weston, Jason and Li, Xian},
|
| journal = {arXiv preprint arXiv:2403.07816},
|
| year = {2024}
|
| }
|
|
|
| @article{hinton2015distilling,
|
| title = {Distilling the Knowledge in a Neural Network},
|
| author = {Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
|
| journal = {arXiv preprint arXiv:1503.02531},
|
| year = {2015}
|
| }
|
|
|
| @inproceedings{furlanello2018born,
|
| title = {Born-Again Neural Networks},
|
| author = {Furlanello, Tommaso and Lipton, Zachary C. and Tschannen, Michael and
|
| Itti, Laurent and Anandkumar, Anima},
|
| booktitle = {ICML},
|
| year = {2018}
|
| }
|
|
|
| @inproceedings{gururangan2020dapt,
|
| title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
|
| author = {Gururangan, Suchin and Marasovi{\'c}, Ana and Swayamdipta, Swabha and
|
| Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A.},
|
| booktitle = {ACL},
|
| year = {2020}
|
| }
|
| ```
|
|
|