Spaces:

DaisyChainAI
/

README

Running

App Files Files Community

README / README.md

Quazim0t0

Upload README.md with huggingface_hub

869fec5 verified 18 minutes ago

preview code

Raw

History Blame Contribute Delete

7.61 kB

	---
	title: README
	emoji: 📈
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	---

	# 🌼 DaisyChainAI

	We build capable systems by daisy-chaining a handful of
	small, sharp specialists behind a learned router — instead of training one giant model to do
	everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together,
	they behave like one model at a fraction of the active compute.

	---

	## 🔗 What "daisy-chaining" means

	A daisy chain links independent units in series so a signal can flow from one to the next,
	each unit handling what it's good at and passing the rest along. That's exactly how our systems work:

	- Each link is one small specialist — a dense ~74M model trained on a single domain. It is
	excellent at its own data and (deliberately) surprised by everything else.
	- The router is the connector between links. When an input arrives, every specialist reports how
	surprised it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work
	to the link that's most at home with it.
	- The chain grows link by link. Because the specialists are trained separately, you can chain a
	new domain on without retraining the others — add a link, extend the router, done.
	- One link runs per query. Only the routed specialist computes, so a chain of four ~74M experts
	costs ~74M of compute per token — roughly 7× cheaper than a 500M monolith of comparable scope.

	So "DaisyChain" is both the brand and the mechanism: **a chain of specialists, connected by routing,
	that you extend one flower at a time.**

	---

	## 🛠️ How the models are built

	Each specialist is grown by interleaving two steps, per domain:

	1. Continued pretraining — next-token training on only that domain's data, so the specialist
	becomes genuinely crisp on its home distribution (and the router can tell the links apart).
	2. Per-domain distillation — the specialist is distilled from a larger teacher foundation model
	restricted to its own domain (soft-target KD, plus a factorized per-nucleotide variant where the
	teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic
	clone — the specialization is what makes routing work.

	We iterate those two steps until each link is as strong as its capacity allows, then train the
	router. In lineage this is a cluster Branch-Train-Merge (cBTM) mixture of domain experts —
	independent experts + perplexity-aware routing — with iterative distillation from a larger teacher.

	---

	## 🧬 Current project — DaisyChain Genomics

	Four DNA/RNA specialists (eukaryote · prokaryote · mRNA · mRNA-splice, ~74M each, **≈295M total —
	under 500M), each distilled per-domain from [Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)**
	behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial)
	maps one-to-one onto our four specialists.

	### Where it actually stands (measured on Carbon's own base-pair / FNS metric)

	We score likelihood the way Carbon does — marginalizing each 6-mer into six per-base distributions and
	taking mean per-base log-prob (`score_sequence`). Our implementation reproduces Carbon's `compute_bp_probs`
	to 6e-08, so these are apples-to-apples.

	\| \| DaisyChain \| Carbon-500M \|
	\|---\|---\|---\|
	\| Routing accuracy (held-out) \| 100.0% \| — \|
	\| Likelihood — base-pair bits/base (↓) \| 1.875 \| 1.787 \|
	\| Seq-recovery, eukaryote (FNS, ↑) \| 31.5% \| 38.9% \|
	\| Seq-recovery, bacteria (FNS, ↑) \| 40.9% \| 54.1% \|
	\| Active params / query \| ~74M (one specialist) \| 500M \|

	Honest standing: ~+0.088 bits/base behind, and no single domain beats Carbon yet. The gap is
	concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest.
	Note Carbon-500M is itself a draft model, explicitly "not designed to be competitive on downstream
	benchmarks" — so it's a fair, achievable target, not the 3B/8B flagships.

	- 📦 Model: [`DaisyChainAI/daisychain-genomics`](https://huggingface.co/DaisyChainAI/daisychain-genomics)
	- 🎮 Live demo: [`Daisychain-Genomics-Demo`](https://huggingface.co/spaces/DaisyChainAI/Daisychain-Genomics-Demo) — paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder.

	---

	## 📓 Build log — what we got right, and what we got wrong

	We build in the open, mistakes included. This project's honest history:

	What worked
	- Per-domain specialists + a learned router reached 100% held-out routing — one ~74M model active per query.
	- Snapshot-then-pick-best distillation: snapshot every few thousand steps, deploy the snapshot with the
	best held-out score, never the last one. This caught over-distillation (models that memorize the distill
	cache and regress on held-out data) and made every round regression-guarded.
	- Re-fitting the router after every specialist swap. Router features are coupled to the checkpoints;
	skipping the re-fit once produced a fake "regression" that was pure routing drift.
	- FNS per-base distillation targets — distilling the teacher's base-pair marginals, not the 4096-way
	6-mer distribution, gave the small students a tractable, base-pair-correct objective.

	What we got wrong (and corrected)
	- We reported the wrong metric for days. We measured likelihood as 6-mer cross-entropy (a softer proxy)
	instead of Carbon's base-pair (FNS) score. The proxy flattered us: it showed ~+0.043 behind and even
	"splice beats Carbon." On Carbon's actual metric the gap is +0.089 and no domain is ahead. We re-baselined
	the entire project history on the real metric.
	- We measured sequence recovery with the wrong decoder (6-mer argmax) instead of Carbon's **FNS base-level
	argmax*. Re-measuring with their decoder changed the numbers (and actually raised* our bacteria recovery).
	- An early eval had a frame-alignment bug — feeding a context length not divisible by 6 knocked our 6-mer
	model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
	- Decoding took several wrong turns before matching Carbon: greedy with no repetition control (collapsed to
	homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual
	base-pair FNS decoder (top-p at the 6-mer level → per-base selection).
	- One training round improved the proxy while regressing the real metric (an early mRNA distill-only pass)
	— invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.

	The lesson: measure the way the baseline measures, or you aren't comparing anything. A stricter, honest
	evaluation didn't sink the project — it pointed to exactly which domains to attack and which "wins" were illusions.

	More links on the chain — and more chains — coming. 🌼

	## Citation

	If you use these models, please cite the author — Dean Byrne (Quazim0t0):

	```bibtex
	@misc{byrne2026daisychain,
	title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
	author = {Byrne, Dean},
	year = {2026},
	howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
	note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
	from Carbon-500M behind a learned router}
	}
	```