--- title: README emoji: ๐Ÿ“ˆ colorFrom: blue colorTo: indigo sdk: static pinned: false --- # ๐ŸŒผ DaisyChainAI We build capable systems by *daisy-chaining* a handful of small, sharp specialists behind a learned router โ€” instead of training one giant model to do everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together, they behave like one model at a fraction of the active compute. --- ## ๐Ÿ”— What "daisy-chaining" means A **daisy chain** links independent units in series so a signal can flow from one to the next, each unit handling what it's good at and passing the rest along. That's exactly how our systems work: - **Each link is one small specialist** โ€” a dense ~74M model trained on a *single* domain. It is excellent at its own data and (deliberately) surprised by everything else. - **The router is the connector between links.** When an input arrives, every specialist reports how *surprised* it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work to the link that's most at home with it. - **The chain grows link by link.** Because the specialists are trained *separately*, you can chain a new domain on without retraining the others โ€” add a link, extend the router, done. - **One link runs per query.** Only the routed specialist computes, so a chain of four ~74M experts costs ~74M of compute per token โ€” roughly **7ร— cheaper** than a 500M monolith of comparable scope. So "DaisyChain" is both the brand and the mechanism: **a chain of specialists, connected by routing, that you extend one flower at a time.** --- ## ๐Ÿ› ๏ธ How the models are built Each specialist is grown by **interleaving two steps**, per domain: 1. **Continued pretraining** โ€” next-token training on *only* that domain's data, so the specialist becomes genuinely crisp on its home distribution (and the router can tell the links apart). 2. **Per-domain distillation** โ€” the specialist is distilled from a larger teacher foundation model *restricted to its own domain* (soft-target KD, plus a factorized per-nucleotide variant where the teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic clone โ€” the specialization is what makes routing work. We iterate those two steps until each link is as strong as its capacity allows, then train the **router**. In lineage this is a **cluster Branch-Train-Merge (cBTM)** mixture of domain experts โ€” independent experts + perplexity-aware routing โ€” with iterative distillation from a larger teacher. --- ## ๐Ÿงฌ Current project โ€” DaisyChain Genomics Four DNA/RNA specialists (**eukaryote ยท prokaryote ยท mRNA ยท mRNA-splice**, ~74M each, **โ‰ˆ295M total โ€” under 500M**), each distilled per-domain from **[Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)** behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial) maps one-to-one onto our four specialists. ### Where it actually stands (measured on Carbon's own base-pair / FNS metric) We score likelihood the way Carbon does โ€” marginalizing each 6-mer into six per-base distributions and taking mean per-base log-prob (`score_sequence`). Our implementation reproduces Carbon's `compute_bp_probs` to **6e-08**, so these are apples-to-apples. | | DaisyChain | Carbon-500M | |---|---|---| | **Routing accuracy** (held-out) | **99.8%** | โ€” | | **Likelihood โ€” base-pair bits/base** (โ†“) | **1.876** | **1.787** | | Seq-recovery, eukaryote (FNS, โ†‘) | 31.5% | 38.9% | | Seq-recovery, bacteria (FNS, โ†‘) | 40.9% | 54.1% | | Active params / query | ~74M (one specialist) | 500M | **Honest standing: ~+0.089 bits/base behind, and no single domain beats Carbon yet.** The gap is concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest. Note Carbon-500M is itself a *draft model*, explicitly "not designed to be competitive on downstream benchmarks" โ€” so it's a fair, achievable target, not the 3B/8B flagships. - ๐Ÿ“ฆ **Model:** [`DaisyChainAI/daisychain-genomics`](https://huggingface.co/DaisyChainAI/daisychain-genomics) - ๐ŸŽฎ **Live demo:** [`Daisychain-Genomics-Demo`](https://huggingface.co/spaces/DaisyChainAI/Daisychain-Genomics-Demo) โ€” paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder. --- ## ๐Ÿ““ Build log โ€” what we got right, and what we got wrong We build in the open, mistakes included. This project's honest history: **What worked** - **Per-domain specialists + a learned router** reached **99.8%** held-out routing โ€” one ~74M model active per query. - **Snapshot-then-pick-best** distillation: snapshot every few thousand steps, deploy the snapshot with the best *held-out* score, never the last one. This caught over-distillation (models that memorize the distill cache and regress on held-out data) and made every round regression-guarded. - **Re-fitting the router after every specialist swap.** Router features are coupled to the checkpoints; skipping the re-fit once produced a fake "regression" that was pure routing drift. - **FNS per-base distillation targets** โ€” distilling the teacher's *base-pair* marginals, not the 4096-way 6-mer distribution, gave the small students a tractable, base-pair-correct objective. **What we got wrong (and corrected)** - **We reported the wrong metric for days.** We measured likelihood as **6-mer cross-entropy** (a softer proxy) instead of Carbon's **base-pair (FNS)** score. The proxy flattered us: it showed ~+0.043 behind and even "splice beats Carbon." On Carbon's actual metric the gap is **+0.089 and no domain is ahead.** We re-baselined the entire project history on the real metric. - **We measured sequence recovery with the wrong decoder** (6-mer argmax) instead of Carbon's **FNS base-level argmax**. Re-measuring with their decoder changed the numbers (and actually *raised* our bacteria recovery). - **An early eval had a frame-alignment bug** โ€” feeding a context length not divisible by 6 knocked our 6-mer model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid. - **Decoding took several wrong turns** before matching Carbon: greedy with no repetition control (collapsed to homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual **base-pair FNS decoder** (top-p at the 6-mer level โ†’ per-base selection). - **One training round improved the proxy while regressing the real metric** (an early mRNA distill-only pass) โ€” invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it. **The lesson:** *measure the way the baseline measures, or you aren't comparing anything.* A stricter, honest evaluation didn't sink the project โ€” it pointed to exactly which domains to attack and which "wins" were illusions. More links on the chain โ€” and more chains โ€” coming. ๐ŸŒผ ## Citation **If you use these models, please cite the author โ€” Dean Byrne (Quazim0t0):** ```bibtex @misc{byrne2026daisychain, title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists}, author = {Byrne, Dean}, year = {2026}, howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}}, note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain from Carbon-500M behind a learned router} } ```