Spaces:
Running
Running
File size: 7,614 Bytes
e343b1c 915f224 d269b58 915f224 9e49e5f 915f224 9e49e5f 915f224 9e49e5f 915f224 9e49e5f a9da3b3 869fec5 9e49e5f 915f224 869fec5 9e49e5f 915f224 9e49e5f a9da3b3 9e49e5f 915f224 46d9b28 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | ---
title: README
emoji: ๐
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
# ๐ผ DaisyChainAI
We build capable systems by *daisy-chaining* a handful of
small, sharp specialists behind a learned router โ instead of training one giant model to do
everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together,
they behave like one model at a fraction of the active compute.
---
## ๐ What "daisy-chaining" means
A **daisy chain** links independent units in series so a signal can flow from one to the next,
each unit handling what it's good at and passing the rest along. That's exactly how our systems work:
- **Each link is one small specialist** โ a dense ~74M model trained on a *single* domain. It is
excellent at its own data and (deliberately) surprised by everything else.
- **The router is the connector between links.** When an input arrives, every specialist reports how
*surprised* it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work
to the link that's most at home with it.
- **The chain grows link by link.** Because the specialists are trained *separately*, you can chain a
new domain on without retraining the others โ add a link, extend the router, done.
- **One link runs per query.** Only the routed specialist computes, so a chain of four ~74M experts
costs ~74M of compute per token โ roughly **7ร cheaper** than a 500M monolith of comparable scope.
So "DaisyChain" is both the brand and the mechanism: **a chain of specialists, connected by routing,
that you extend one flower at a time.**
---
## ๐ ๏ธ How the models are built
Each specialist is grown by **interleaving two steps**, per domain:
1. **Continued pretraining** โ next-token training on *only* that domain's data, so the specialist
becomes genuinely crisp on its home distribution (and the router can tell the links apart).
2. **Per-domain distillation** โ the specialist is distilled from a larger teacher foundation model
*restricted to its own domain* (soft-target KD, plus a factorized per-nucleotide variant where the
teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic
clone โ the specialization is what makes routing work.
We iterate those two steps until each link is as strong as its capacity allows, then train the
**router**. In lineage this is a **cluster Branch-Train-Merge (cBTM)** mixture of domain experts โ
independent experts + perplexity-aware routing โ with iterative distillation from a larger teacher.
---
## ๐งฌ Current project โ DaisyChain Genomics
Four DNA/RNA specialists (**eukaryote ยท prokaryote ยท mRNA ยท mRNA-splice**, ~74M each, **โ295M total โ
under 500M**), each distilled per-domain from **[Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)**
behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial)
maps one-to-one onto our four specialists.
### Where it actually stands (measured on Carbon's own base-pair / FNS metric)
We score likelihood the way Carbon does โ marginalizing each 6-mer into six per-base distributions and
taking mean per-base log-prob (`score_sequence`). Our implementation reproduces Carbon's `compute_bp_probs`
to **6e-08**, so these are apples-to-apples.
| | DaisyChain | Carbon-500M |
|---|---|---|
| **Routing accuracy** (held-out) | **100.0%** | โ |
| **Likelihood โ base-pair bits/base** (โ) | **1.875** | **1.787** |
| Seq-recovery, eukaryote (FNS, โ) | 31.5% | 38.9% |
| Seq-recovery, bacteria (FNS, โ) | 40.9% | 54.1% |
| Active params / query | ~74M (one specialist) | 500M |
**Honest standing: ~+0.088 bits/base behind, and no single domain beats Carbon yet.** The gap is
concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest.
Note Carbon-500M is itself a *draft model*, explicitly "not designed to be competitive on downstream
benchmarks" โ so it's a fair, achievable target, not the 3B/8B flagships.
- ๐ฆ **Model:** [`DaisyChainAI/daisychain-genomics`](https://huggingface.co/DaisyChainAI/daisychain-genomics)
- ๐ฎ **Live demo:** [`Daisychain-Genomics-Demo`](https://huggingface.co/spaces/DaisyChainAI/Daisychain-Genomics-Demo) โ paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder.
---
## ๐ Build log โ what we got right, and what we got wrong
We build in the open, mistakes included. This project's honest history:
**What worked**
- **Per-domain specialists + a learned router** reached **100%** held-out routing โ one ~74M model active per query.
- **Snapshot-then-pick-best** distillation: snapshot every few thousand steps, deploy the snapshot with the
best *held-out* score, never the last one. This caught over-distillation (models that memorize the distill
cache and regress on held-out data) and made every round regression-guarded.
- **Re-fitting the router after every specialist swap.** Router features are coupled to the checkpoints;
skipping the re-fit once produced a fake "regression" that was pure routing drift.
- **FNS per-base distillation targets** โ distilling the teacher's *base-pair* marginals, not the 4096-way
6-mer distribution, gave the small students a tractable, base-pair-correct objective.
**What we got wrong (and corrected)**
- **We reported the wrong metric for days.** We measured likelihood as **6-mer cross-entropy** (a softer proxy)
instead of Carbon's **base-pair (FNS)** score. The proxy flattered us: it showed ~+0.043 behind and even
"splice beats Carbon." On Carbon's actual metric the gap is **+0.089 and no domain is ahead.** We re-baselined
the entire project history on the real metric.
- **We measured sequence recovery with the wrong decoder** (6-mer argmax) instead of Carbon's **FNS base-level
argmax**. Re-measuring with their decoder changed the numbers (and actually *raised* our bacteria recovery).
- **An early eval had a frame-alignment bug** โ feeding a context length not divisible by 6 knocked our 6-mer
model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
- **Decoding took several wrong turns** before matching Carbon: greedy with no repetition control (collapsed to
homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual
**base-pair FNS decoder** (top-p at the 6-mer level โ per-base selection).
- **One training round improved the proxy while regressing the real metric** (an early mRNA distill-only pass)
โ invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.
**The lesson:** *measure the way the baseline measures, or you aren't comparing anything.* A stricter, honest
evaluation didn't sink the project โ it pointed to exactly which domains to attack and which "wins" were illusions.
More links on the chain โ and more chains โ coming. ๐ผ
## Citation
**If you use these models, please cite the author โ Dean Byrne (Quazim0t0):**
```bibtex
@misc{byrne2026daisychain,
title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
author = {Byrne, Dean},
year = {2026},
howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
from Carbon-500M behind a learned router}
}
```
|