Upload README.md with huggingface_hub

b841cd4 verified 3 days ago

8.35 kB

	---
	license: apache-2.0
	tags:
	- biology
	- genomics
	- dna
	- mixture-of-experts
	- modular
	library_name: pytorch
	pipeline_tag: text-generation
	---

	# 🌼 DaisyChain — Genomics

	A modular genomic mind: four dense ~74M DNA/RNA specialists (≈295M params total,
	under Carbon-500M) behind a learned router. Instead of one monolithic foundation
	model, DaisyChain trains one crisp specialist per biological domain — each
	distilled per-domain from [Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M) —
	and routes each sequence to its home specialist.

	\| Specialist \| Domain \| Params \|
	\|---\|---\|---\|
	\| `eukaryote` \| Eukaryotic genomic DNA \| ~74M \|
	\| `prokaryote` \| Bacterial / prokaryotic DNA \| ~74M \|
	\| `mrna` \| Mature mRNA (coding transcript) \| ~74M \|
	\| `mrna_splice` \| Pre-mRNA / splice-site regions \| ~74M \|

	## The router

	A small learned router reads each specialist's surprise (bits/base) and a PCA of its
	hidden state, then predicts the home domain — recovering the bias-corrections a plain
	argmin-perplexity rule misses. Held-out routing accuracy: 100.0% (vs 87.5% argmin).
	Only one ~74M specialist runs per query, so inference is ~7× cheaper per token than the
	500M monolith.

	## How each specialist is built

	Interleaved continued pretraining (next-token CE on its domain) and **offline
	knowledge distillation** from Carbon-500M (soft-target + a factorized per-nucleotide
	variant via Carbon's FNS branch) — i.e. cBTM-style domain experts, iterated per expert.

	## Capability vs Carbon-500M (the fair baseline)

	\| metric \| DaisyChain \| Carbon-500M \|
	\|---\|---\|---\|
	\| likelihood — bits/base, base-pair (FNS) (↓) \| 1.88 \| 1.79 \|
	\| seq-recovery eukaryote — FNS base-level (↑) \| 31.5% \| 38.9% \|
	\| seq-recovery bacteria — FNS base-level (↑) \| 40.9% \| 54.1% \|

	*Likelihood is the base-pair (FNS) score — Carbon's own `score_sequence` / `compute_bp_probs`,
	verified to 6e-08. The progress-log table below tracks the 6-mer joint CE, a softer proxy.*

	Behind a 500M/1T-token monolith but within striking distance at ~15% of the active
	compute — and the gap keeps closing with more per-domain training (work in progress).

	### Progress log — closing the gap to Carbon-500M (BASE-PAIR / FNS metric)

	Re-baselined on Carbon's base-pair (FNS) metric (`score_sequence`, verified to 6e-08) — each
	round's full 4-specialist set re-scored. trailing = `mean ours − mean Carbon` (Carbon mean 1.7870).
	The final row (1.8622 / +0.0752) matches the independently-verified current standing.

	\| date \| update \| euk \| prok \| mrna \| splice \| mean \| trailing \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| 2026-06-22 \| baseline (round-1 distill + router) \| 1.965 \| 1.994 \| 1.910 \| 1.935 \| 1.9510 \| +0.1640 \|
	\| 2026-06-23 \| mrna 12k-distill \| 1.965 \| 1.994 \| 1.927 \| 1.935 \| 1.9552 \| +0.1682 \|
	\| 2026-06-23 \| prokaryote round 1 \| 1.965 \| 1.918 \| 1.927 \| 1.935 \| 1.9363 \| +0.1493 \|
	\| 2026-06-24 \| eukaryote \| 1.928 \| 1.918 \| 1.927 \| 1.935 \| 1.9272 \| +0.1403 \|
	\| 2026-06-25 \| mrna \| 1.928 \| 1.918 \| 1.788 \| 1.935 \| 1.8924 \| +0.1054 \|
	\| 2026-06-25 \| mrna_splice \| 1.928 \| 1.918 \| 1.788 \| 1.873 \| 1.8768 \| +0.0898 \|
	\| 2026-06-26 \| prokaryote round 2 \| 1.928 \| 1.914 \| 1.788 \| 1.873 \| 1.8758 \| +0.0889 \|
	\| 2026-06-27 \| eukaryote round 2 (routing 100%) \| 1.924 \| 1.914 \| 1.788 \| 1.873 \| 1.8747 \| +0.0878 \|
	\| 2026-06-29 \| Muon passes — mrna + prokaryote \| 1.924 \| 1.868 \| 1.784 \| 1.873 \| 1.8622 \| +0.0752 \|

	(The mrna 12k-distill round worsened base-pair likelihood (+0.1640→+0.1682) even though it improved
	6-mer CE — the later base+distill mrna round fixed it. The soft 6-mer metric hid that.)

	<details><summary>Earlier 6-mer joint CE trajectory (softer proxy — kept for reference)</summary>

	\| date \| update \| mean DaisyChain (6-mer CE) \| mean Carbon-500M (6-mer CE) \| trailing \|
	\|---\|---\|---\|---\|---\|
	\| 2026-06-22 \| baseline (94.8% routing) \| 1.8644 \| 1.7502 \| +0.1142 \|
	\| 2026-06-23 \| mrna 12k-distill (95.7%) \| 1.8599 \| 1.7502 \| +0.1096 \|
	\| 2026-06-23 \| prokaryote round 1 (96.2%) \| 1.8528 \| 1.7502 \| +0.1026 \|
	\| 2026-06-24 \| eukaryote (98.0%) \| 1.8413 \| 1.7502 \| +0.0911 \|
	\| 2026-06-25 \| mrna (98.3%) \| 1.8075 \| 1.7502 \| +0.0573 \|
	\| 2026-06-25 \| mrna_splice (99.8%) \| 1.7959 \| 1.7502 \| +0.0457 \|
	\| 2026-06-26 \| prokaryote round 2 (99.8%) \| 1.7929 \| 1.7502 \| +0.0427 \|

	</details>

	The table above is the base-pair (FNS)* trajectory — Carbon's actual metric, fully re-baselined.
	Honest current standing: mean 1.8622 vs 1.7870 (+0.0752), base-pair wins ~36/100, no domain ahead
	(splice's earlier "beats Carbon" was only on the 6-mer proxy). Recovery uses Carbon's FNS base-level argmax
	decoder (per-base accuracy, next-30bp, n=50, ctx=1536) — our measurement of Carbon-500M (a draft model,
	explicitly not benchmark-competitive), not Carbon's published 3B figures.*

	## Usage

	```python
	from daisychain import DaisyChain
	dc = DaisyChain(root=".", device="cpu")
	home, bits_per_base = dc.route("ACGTACGT...") # which domain?
	print(home, bits_per_base)
	print(dc.generate(home, length=180)) # sample from the home specialist
	```

	Files: `daisychain.py` (inference), `model.py` / `specialist_presets.py` /
	`spike_tokenizer.py` / `registry.py` (architecture), `tokenizer.json`,
	`<domain>/model.safetensors` (the 4 specialists), `router2.pt` (router).

	> Interactive demo: the DaisyChain Space routes DNA in real time.

	## Citation

	If you use these models, please cite the author — Dean Byrne (Quazim0t0):

	```bibtex
	@misc{byrne2026daisychain,
	title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
	author = {Byrne, Dean},
	year = {2026},
	howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
	note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
	from Carbon-500M behind a learned router}
	}
	```

	### Built on

	DaisyChain stands on these works:

	```bibtex
	@misc{carbon2025,
	title = {Carbon: Genomic Foundation Models},
	author = {{HuggingFaceBio}},
	year = {2025},
	howpublished = {\url{https://huggingface.co/HuggingFaceBio/Carbon-500M}}
	}

	@article{li2022branchtrainmerge,
	title = {Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
	author = {Li, Margaret and Gururangan, Suchin and Dettmers, Tim and Lewis, Mike and
	Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
	journal = {arXiv preprint arXiv:2208.03306},
	year = {2022}
	}

	@article{gururangan2023cbtm,
	title = {Scaling Expert Language Models with Unsupervised Domain Discovery},
	author = {Gururangan, Suchin and Li, Margaret and Lewis, Mike and Shi, Weijia and
	Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
	journal = {arXiv preprint arXiv:2303.14177},
	year = {2023}
	}

	@article{sukhbaatar2024btx,
	title = {Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM},
	author = {Sukhbaatar, Sainbayar and Golovneva, Olga and Sharma, Vasu and Xu, Hu and
	Lin, Xi Victoria and Roziere, Baptiste and Kahn, Jacob and Li, Daniel and
	Yih, Wen-tau and Weston, Jason and Li, Xian},
	journal = {arXiv preprint arXiv:2403.07816},
	year = {2024}
	}

	@article{hinton2015distilling,
	title = {Distilling the Knowledge in a Neural Network},
	author = {Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
	journal = {arXiv preprint arXiv:1503.02531},
	year = {2015}
	}

	@inproceedings{furlanello2018born,
	title = {Born-Again Neural Networks},
	author = {Furlanello, Tommaso and Lipton, Zachary C. and Tschannen, Michael and
	Itti, Laurent and Anandkumar, Anima},
	booktitle = {ICML},
	year = {2018}
	}

	@inproceedings{gururangan2020dapt,
	title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
	author = {Gururangan, Suchin and Marasovi{\'c}, Ana and Swayamdipta, Swabha and
	Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A.},
	booktitle = {ACL},
	year = {2020}
	}
	```