Upload README.md with huggingface_hub

b841cd4 verified 2 days ago

8.35 kB

license: apache-2.0
tags:
  - biology
  - genomics
  - dna
  - mixture-of-experts
  - modular
library_name: pytorch
pipeline_tag: text-generation

🌼 DaisyChain — Genomics

A modular genomic mind: four dense ~74M DNA/RNA specialists (≈295M params total, under Carbon-500M) behind a learned router. Instead of one monolithic foundation model, DaisyChain trains one crisp specialist per biological domain — each distilled per-domain from Carbon-500M — and routes each sequence to its home specialist.

Specialist	Domain	Params
`eukaryote`	Eukaryotic genomic DNA	~74M
`prokaryote`	Bacterial / prokaryotic DNA	~74M
`mrna`	Mature mRNA (coding transcript)	~74M
`mrna_splice`	Pre-mRNA / splice-site regions	~74M

The router

A small learned router reads each specialist's surprise (bits/base) and a PCA of its hidden state, then predicts the home domain — recovering the bias-corrections a plain argmin-perplexity rule misses. Held-out routing accuracy: 100.0% (vs 87.5% argmin). Only one ~74M specialist runs per query, so inference is ~7× cheaper per token than the 500M monolith.

How each specialist is built

Interleaved continued pretraining (next-token CE on its domain) and offline knowledge distillation from Carbon-500M (soft-target + a factorized per-nucleotide variant via Carbon's FNS branch) — i.e. cBTM-style domain experts, iterated per expert.

Capability vs Carbon-500M (the fair baseline)

metric	DaisyChain	Carbon-500M
likelihood — bits/base, base-pair (FNS) (↓)	1.88	1.79
seq-recovery eukaryote — FNS base-level (↑)	31.5%	38.9%
seq-recovery bacteria — FNS base-level (↑)	40.9%	54.1%

Likelihood is the base-pair (FNS) score — Carbon's own score_sequence / compute_bp_probs, verified to 6e-08. The progress-log table below tracks the 6-mer joint CE, a softer proxy.

Behind a 500M/1T-token monolith but within striking distance at ~15% of the active compute — and the gap keeps closing with more per-domain training (work in progress).

Progress log — closing the gap to Carbon-500M (BASE-PAIR / FNS metric)

Re-baselined on Carbon's base-pair (FNS) metric (score_sequence, verified to 6e-08) — each round's full 4-specialist set re-scored. trailing = mean ours − mean Carbon (Carbon mean 1.7870). The final row (1.8622 / +0.0752) matches the independently-verified current standing.

date	update	euk	prok	mrna	splice	mean	trailing
2026-06-22	baseline (round-1 distill + router)	1.965	1.994	1.910	1.935	1.9510	+0.1640
2026-06-23	mrna 12k-distill	1.965	1.994	1.927	1.935	1.9552	+0.1682
2026-06-23	prokaryote round 1	1.965	1.918	1.927	1.935	1.9363	+0.1493
2026-06-24	eukaryote	1.928	1.918	1.927	1.935	1.9272	+0.1403
2026-06-25	mrna	1.928	1.918	1.788	1.935	1.8924	+0.1054
2026-06-25	mrna_splice	1.928	1.918	1.788	1.873	1.8768	+0.0898
2026-06-26	prokaryote round 2	1.928	1.914	1.788	1.873	1.8758	+0.0889
2026-06-27	eukaryote round 2 (routing 100%)	1.924	1.914	1.788	1.873	1.8747	+0.0878
2026-06-29	Muon passes — mrna + prokaryote	1.924	1.868	1.784	1.873	1.8622	+0.0752

(The mrna 12k-distill round worsened base-pair likelihood (+0.1640→+0.1682) even though it improved 6-mer CE — the later base+distill mrna round fixed it. The soft 6-mer metric hid that.)

Earlier 6-mer joint CE trajectory (softer proxy — kept for reference)

date	update	mean DaisyChain (6-mer CE)	mean Carbon-500M (6-mer CE)	trailing
2026-06-22	baseline (94.8% routing)	1.8644	1.7502	+0.1142
2026-06-23	mrna 12k-distill (95.7%)	1.8599	1.7502	+0.1096
2026-06-23	prokaryote round 1 (96.2%)	1.8528	1.7502	+0.1026
2026-06-24	eukaryote (98.0%)	1.8413	1.7502	+0.0911
2026-06-25	mrna (98.3%)	1.8075	1.7502	+0.0573
2026-06-25	mrna_splice (99.8%)	1.7959	1.7502	+0.0457
2026-06-26	prokaryote round 2 (99.8%)	1.7929	1.7502	+0.0427

*The table above is the base-pair (FNS) trajectory — Carbon's actual metric, fully re-baselined. Honest current standing: mean 1.8622 vs 1.7870 (+0.0752), base-pair wins ~36/100, no domain ahead (splice's earlier "beats Carbon" was only on the 6-mer proxy). Recovery uses Carbon's FNS base-level argmax decoder (per-base accuracy, next-30bp, n=50, ctx=1536) — our measurement of Carbon-500M (a draft model, explicitly not benchmark-competitive), not Carbon's published 3B figures.*

Usage

from daisychain import DaisyChain
dc = DaisyChain(root=".", device="cpu")
home, bits_per_base = dc.route("ACGTACGT...")   # which domain?
print(home, bits_per_base)
print(dc.generate(home, length=180))            # sample from the home specialist

Files: daisychain.py (inference), model.py / specialist_presets.py / spike_tokenizer.py / registry.py (architecture), tokenizer.json, <domain>/model.safetensors (the 4 specialists), router2.pt (router).

Interactive demo: the DaisyChain Space routes DNA in real time.

Citation

If you use these models, please cite the author — Dean Byrne (Quazim0t0):

@misc{byrne2026daisychain,
  title        = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
  note         = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
                  from Carbon-500M behind a learned router}
}

Built on

DaisyChain stands on these works:

@misc{carbon2025,
  title        = {Carbon: Genomic Foundation Models},
  author       = {{HuggingFaceBio}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/HuggingFaceBio/Carbon-500M}}
}

@article{li2022branchtrainmerge,
  title   = {Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
  author  = {Li, Margaret and Gururangan, Suchin and Dettmers, Tim and Lewis, Mike and
             Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2208.03306},
  year    = {2022}
}

@article{gururangan2023cbtm,
  title   = {Scaling Expert Language Models with Unsupervised Domain Discovery},
  author  = {Gururangan, Suchin and Li, Margaret and Lewis, Mike and Shi, Weijia and
             Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2303.14177},
  year    = {2023}
}

@article{sukhbaatar2024btx,
  title   = {Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM},
  author  = {Sukhbaatar, Sainbayar and Golovneva, Olga and Sharma, Vasu and Xu, Hu and
             Lin, Xi Victoria and Roziere, Baptiste and Kahn, Jacob and Li, Daniel and
             Yih, Wen-tau and Weston, Jason and Li, Xian},
  journal = {arXiv preprint arXiv:2403.07816},
  year    = {2024}
}

@article{hinton2015distilling,
  title   = {Distilling the Knowledge in a Neural Network},
  author  = {Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
  journal = {arXiv preprint arXiv:1503.02531},
  year    = {2015}
}

@inproceedings{furlanello2018born,
  title     = {Born-Again Neural Networks},
  author    = {Furlanello, Tommaso and Lipton, Zachary C. and Tschannen, Michael and
               Itti, Laurent and Anandkumar, Anima},
  booktitle = {ICML},
  year      = {2018}
}

@inproceedings{gururangan2020dapt,
  title     = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
  author    = {Gururangan, Suchin and Marasovi{\'c}, Ana and Swayamdipta, Swabha and
               Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A.},
  booktitle = {ACL},
  year      = {2020}
}