Spaces:
Running
Running
| title: README | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| # ๐ผ DaisyChainAI | |
| We build capable systems by *daisy-chaining* a handful of | |
| small, sharp specialists behind a learned router โ instead of training one giant model to do | |
| everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together, | |
| they behave like one model at a fraction of the active compute. | |
| --- | |
| ## ๐ What "daisy-chaining" means | |
| A **daisy chain** links independent units in series so a signal can flow from one to the next, | |
| each unit handling what it's good at and passing the rest along. That's exactly how our systems work: | |
| - **Each link is one small specialist** โ a dense ~74M model trained on a *single* domain. It is | |
| excellent at its own data and (deliberately) surprised by everything else. | |
| - **The router is the connector between links.** When an input arrives, every specialist reports how | |
| *surprised* it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work | |
| to the link that's most at home with it. | |
| - **The chain grows link by link.** Because the specialists are trained *separately*, you can chain a | |
| new domain on without retraining the others โ add a link, extend the router, done. | |
| - **One link runs per query.** Only the routed specialist computes, so a chain of four ~74M experts | |
| costs ~74M of compute per token โ roughly **7ร cheaper** than a 500M monolith of comparable scope. | |
| So "DaisyChain" is both the brand and the mechanism: **a chain of specialists, connected by routing, | |
| that you extend one flower at a time.** | |
| --- | |
| ## ๐ ๏ธ How the models are built | |
| Each specialist is grown by **interleaving two steps**, per domain: | |
| 1. **Continued pretraining** โ next-token training on *only* that domain's data, so the specialist | |
| becomes genuinely crisp on its home distribution (and the router can tell the links apart). | |
| 2. **Per-domain distillation** โ the specialist is distilled from a larger teacher foundation model | |
| *restricted to its own domain* (soft-target KD, plus a factorized per-nucleotide variant where the | |
| teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic | |
| clone โ the specialization is what makes routing work. | |
| We iterate those two steps until each link is as strong as its capacity allows, then train the | |
| **router**. In lineage this is a **cluster Branch-Train-Merge (cBTM)** mixture of domain experts โ | |
| independent experts + perplexity-aware routing โ with iterative distillation from a larger teacher. | |
| --- | |
| ## ๐งฌ Current project โ DaisyChain Genomics | |
| Four DNA/RNA specialists (**eukaryote ยท prokaryote ยท mRNA ยท mRNA-splice**, ~74M each, **โ295M total โ | |
| under 500M**), each distilled per-domain from **[Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)** | |
| behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial) | |
| maps one-to-one onto our four specialists. | |
| ### Where it actually stands (measured on Carbon's own base-pair / FNS metric) | |
| We score likelihood the way Carbon does โ marginalizing each 6-mer into six per-base distributions and | |
| taking mean per-base log-prob (`score_sequence`). Our implementation reproduces Carbon's `compute_bp_probs` | |
| to **6e-08**, so these are apples-to-apples. | |
| | | DaisyChain | Carbon-500M | | |
| |---|---|---| | |
| | **Routing accuracy** (held-out) | **100.0%** | โ | | |
| | **Likelihood โ base-pair bits/base** (โ) | **1.875** | **1.787** | | |
| | Seq-recovery, eukaryote (FNS, โ) | 31.5% | 38.9% | | |
| | Seq-recovery, bacteria (FNS, โ) | 40.9% | 54.1% | | |
| | Active params / query | ~74M (one specialist) | 500M | | |
| **Honest standing: ~+0.088 bits/base behind, and no single domain beats Carbon yet.** The gap is | |
| concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest. | |
| Note Carbon-500M is itself a *draft model*, explicitly "not designed to be competitive on downstream | |
| benchmarks" โ so it's a fair, achievable target, not the 3B/8B flagships. | |
| - ๐ฆ **Model:** [`DaisyChainAI/daisychain-genomics`](https://huggingface.co/DaisyChainAI/daisychain-genomics) | |
| - ๐ฎ **Live demo:** [`Daisychain-Genomics-Demo`](https://huggingface.co/spaces/DaisyChainAI/Daisychain-Genomics-Demo) โ paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder. | |
| --- | |
| ## ๐ Build log โ what we got right, and what we got wrong | |
| We build in the open, mistakes included. This project's honest history: | |
| **What worked** | |
| - **Per-domain specialists + a learned router** reached **100%** held-out routing โ one ~74M model active per query. | |
| - **Snapshot-then-pick-best** distillation: snapshot every few thousand steps, deploy the snapshot with the | |
| best *held-out* score, never the last one. This caught over-distillation (models that memorize the distill | |
| cache and regress on held-out data) and made every round regression-guarded. | |
| - **Re-fitting the router after every specialist swap.** Router features are coupled to the checkpoints; | |
| skipping the re-fit once produced a fake "regression" that was pure routing drift. | |
| - **FNS per-base distillation targets** โ distilling the teacher's *base-pair* marginals, not the 4096-way | |
| 6-mer distribution, gave the small students a tractable, base-pair-correct objective. | |
| **What we got wrong (and corrected)** | |
| - **We reported the wrong metric for days.** We measured likelihood as **6-mer cross-entropy** (a softer proxy) | |
| instead of Carbon's **base-pair (FNS)** score. The proxy flattered us: it showed ~+0.043 behind and even | |
| "splice beats Carbon." On Carbon's actual metric the gap is **+0.089 and no domain is ahead.** We re-baselined | |
| the entire project history on the real metric. | |
| - **We measured sequence recovery with the wrong decoder** (6-mer argmax) instead of Carbon's **FNS base-level | |
| argmax**. Re-measuring with their decoder changed the numbers (and actually *raised* our bacteria recovery). | |
| - **An early eval had a frame-alignment bug** โ feeding a context length not divisible by 6 knocked our 6-mer | |
| model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid. | |
| - **Decoding took several wrong turns** before matching Carbon: greedy with no repetition control (collapsed to | |
| homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual | |
| **base-pair FNS decoder** (top-p at the 6-mer level โ per-base selection). | |
| - **One training round improved the proxy while regressing the real metric** (an early mRNA distill-only pass) | |
| โ invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it. | |
| **The lesson:** *measure the way the baseline measures, or you aren't comparing anything.* A stricter, honest | |
| evaluation didn't sink the project โ it pointed to exactly which domains to attack and which "wins" were illusions. | |
| More links on the chain โ and more chains โ coming. ๐ผ | |
| ## Citation | |
| **If you use these models, please cite the author โ Dean Byrne (Quazim0t0):** | |
| ```bibtex | |
| @misc{byrne2026daisychain, | |
| title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists}, | |
| author = {Byrne, Dean}, | |
| year = {2026}, | |
| howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}}, | |
| note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain | |
| from Carbon-500M behind a learned router} | |
| } | |
| ``` | |