daisychain-genomics / README.md
Quazim0t0's picture
Upload README.md with huggingface_hub
b841cd4 verified
|
Raw
History Blame Contribute Delete
8.35 kB
metadata
license: apache-2.0
tags:
  - biology
  - genomics
  - dna
  - mixture-of-experts
  - modular
library_name: pytorch
pipeline_tag: text-generation

🌼 DaisyChain β€” Genomics

A modular genomic mind: four dense ~74M DNA/RNA specialists (β‰ˆ295M params total, under Carbon-500M) behind a learned router. Instead of one monolithic foundation model, DaisyChain trains one crisp specialist per biological domain β€” each distilled per-domain from Carbon-500M β€” and routes each sequence to its home specialist.

Specialist Domain Params
eukaryote Eukaryotic genomic DNA ~74M
prokaryote Bacterial / prokaryotic DNA ~74M
mrna Mature mRNA (coding transcript) ~74M
mrna_splice Pre-mRNA / splice-site regions ~74M

The router

A small learned router reads each specialist's surprise (bits/base) and a PCA of its hidden state, then predicts the home domain β€” recovering the bias-corrections a plain argmin-perplexity rule misses. Held-out routing accuracy: 100.0% (vs 87.5% argmin). Only one ~74M specialist runs per query, so inference is ~7Γ— cheaper per token than the 500M monolith.

How each specialist is built

Interleaved continued pretraining (next-token CE on its domain) and offline knowledge distillation from Carbon-500M (soft-target + a factorized per-nucleotide variant via Carbon's FNS branch) β€” i.e. cBTM-style domain experts, iterated per expert.

Capability vs Carbon-500M (the fair baseline)

metric DaisyChain Carbon-500M
likelihood β€” bits/base, base-pair (FNS) (↓) 1.88 1.79
seq-recovery eukaryote β€” FNS base-level (↑) 31.5% 38.9%
seq-recovery bacteria β€” FNS base-level (↑) 40.9% 54.1%

Likelihood is the base-pair (FNS) score β€” Carbon's own score_sequence / compute_bp_probs, verified to 6e-08. The progress-log table below tracks the 6-mer joint CE, a softer proxy.

Behind a 500M/1T-token monolith but within striking distance at ~15% of the active compute β€” and the gap keeps closing with more per-domain training (work in progress).

Progress log β€” closing the gap to Carbon-500M (BASE-PAIR / FNS metric)

Re-baselined on Carbon's base-pair (FNS) metric (score_sequence, verified to 6e-08) β€” each round's full 4-specialist set re-scored. trailing = mean ours βˆ’ mean Carbon (Carbon mean 1.7870). The final row (1.8622 / +0.0752) matches the independently-verified current standing.

date update euk prok mrna splice mean trailing
2026-06-22 baseline (round-1 distill + router) 1.965 1.994 1.910 1.935 1.9510 +0.1640
2026-06-23 mrna 12k-distill 1.965 1.994 1.927 1.935 1.9552 +0.1682
2026-06-23 prokaryote round 1 1.965 1.918 1.927 1.935 1.9363 +0.1493
2026-06-24 eukaryote 1.928 1.918 1.927 1.935 1.9272 +0.1403
2026-06-25 mrna 1.928 1.918 1.788 1.935 1.8924 +0.1054
2026-06-25 mrna_splice 1.928 1.918 1.788 1.873 1.8768 +0.0898
2026-06-26 prokaryote round 2 1.928 1.914 1.788 1.873 1.8758 +0.0889
2026-06-27 eukaryote round 2 (routing 100%) 1.924 1.914 1.788 1.873 1.8747 +0.0878
2026-06-29 Muon passes β€” mrna + prokaryote 1.924 1.868 1.784 1.873 1.8622 +0.0752

(The mrna 12k-distill round worsened base-pair likelihood (+0.1640β†’+0.1682) even though it improved 6-mer CE β€” the later base+distill mrna round fixed it. The soft 6-mer metric hid that.)

Earlier 6-mer joint CE trajectory (softer proxy β€” kept for reference)
date update mean DaisyChain (6-mer CE) mean Carbon-500M (6-mer CE) trailing
2026-06-22 baseline (94.8% routing) 1.8644 1.7502 +0.1142
2026-06-23 mrna 12k-distill (95.7%) 1.8599 1.7502 +0.1096
2026-06-23 prokaryote round 1 (96.2%) 1.8528 1.7502 +0.1026
2026-06-24 eukaryote (98.0%) 1.8413 1.7502 +0.0911
2026-06-25 mrna (98.3%) 1.8075 1.7502 +0.0573
2026-06-25 mrna_splice (99.8%) 1.7959 1.7502 +0.0457
2026-06-26 prokaryote round 2 (99.8%) 1.7929 1.7502 +0.0427

*The table above is the base-pair (FNS) trajectory β€” Carbon's actual metric, fully re-baselined. Honest current standing: mean 1.8622 vs 1.7870 (+0.0752), base-pair wins ~36/100, no domain ahead (splice's earlier "beats Carbon" was only on the 6-mer proxy). Recovery uses Carbon's FNS base-level argmax decoder (per-base accuracy, next-30bp, n=50, ctx=1536) β€” our measurement of Carbon-500M (a draft model, explicitly not benchmark-competitive), not Carbon's published 3B figures.*

Usage

from daisychain import DaisyChain
dc = DaisyChain(root=".", device="cpu")
home, bits_per_base = dc.route("ACGTACGT...")   # which domain?
print(home, bits_per_base)
print(dc.generate(home, length=180))            # sample from the home specialist

Files: daisychain.py (inference), model.py / specialist_presets.py / spike_tokenizer.py / registry.py (architecture), tokenizer.json, <domain>/model.safetensors (the 4 specialists), router2.pt (router).

Interactive demo: the DaisyChain Space routes DNA in real time.

Citation

If you use these models, please cite the author β€” Dean Byrne (Quazim0t0):

@misc{byrne2026daisychain,
  title        = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
  note         = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
                  from Carbon-500M behind a learned router}
}

Built on

DaisyChain stands on these works:

@misc{carbon2025,
  title        = {Carbon: Genomic Foundation Models},
  author       = {{HuggingFaceBio}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/HuggingFaceBio/Carbon-500M}}
}

@article{li2022branchtrainmerge,
  title   = {Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
  author  = {Li, Margaret and Gururangan, Suchin and Dettmers, Tim and Lewis, Mike and
             Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2208.03306},
  year    = {2022}
}

@article{gururangan2023cbtm,
  title   = {Scaling Expert Language Models with Unsupervised Domain Discovery},
  author  = {Gururangan, Suchin and Li, Margaret and Lewis, Mike and Shi, Weijia and
             Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2303.14177},
  year    = {2023}
}

@article{sukhbaatar2024btx,
  title   = {Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM},
  author  = {Sukhbaatar, Sainbayar and Golovneva, Olga and Sharma, Vasu and Xu, Hu and
             Lin, Xi Victoria and Roziere, Baptiste and Kahn, Jacob and Li, Daniel and
             Yih, Wen-tau and Weston, Jason and Li, Xian},
  journal = {arXiv preprint arXiv:2403.07816},
  year    = {2024}
}

@article{hinton2015distilling,
  title   = {Distilling the Knowledge in a Neural Network},
  author  = {Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
  journal = {arXiv preprint arXiv:1503.02531},
  year    = {2015}
}

@inproceedings{furlanello2018born,
  title     = {Born-Again Neural Networks},
  author    = {Furlanello, Tommaso and Lipton, Zachary C. and Tschannen, Michael and
               Itti, Laurent and Anandkumar, Anima},
  booktitle = {ICML},
  year      = {2018}
}

@inproceedings{gururangan2020dapt,
  title     = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
  author    = {Gururangan, Suchin and Marasovi{\'c}, Ana and Swayamdipta, Swabha and
               Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A.},
  booktitle = {ACL},
  year      = {2020}
}