README / README.md
Quazim0t0's picture
Upload README.md with huggingface_hub
869fec5 verified
|
Raw
History Blame Contribute Delete
7.61 kB
---
title: README
emoji: ๐Ÿ“ˆ
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---
# ๐ŸŒผ DaisyChainAI
We build capable systems by *daisy-chaining* a handful of
small, sharp specialists behind a learned router โ€” instead of training one giant model to do
everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together,
they behave like one model at a fraction of the active compute.
---
## ๐Ÿ”— What "daisy-chaining" means
A **daisy chain** links independent units in series so a signal can flow from one to the next,
each unit handling what it's good at and passing the rest along. That's exactly how our systems work:
- **Each link is one small specialist** โ€” a dense ~74M model trained on a *single* domain. It is
excellent at its own data and (deliberately) surprised by everything else.
- **The router is the connector between links.** When an input arrives, every specialist reports how
*surprised* it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work
to the link that's most at home with it.
- **The chain grows link by link.** Because the specialists are trained *separately*, you can chain a
new domain on without retraining the others โ€” add a link, extend the router, done.
- **One link runs per query.** Only the routed specialist computes, so a chain of four ~74M experts
costs ~74M of compute per token โ€” roughly **7ร— cheaper** than a 500M monolith of comparable scope.
So "DaisyChain" is both the brand and the mechanism: **a chain of specialists, connected by routing,
that you extend one flower at a time.**
---
## ๐Ÿ› ๏ธ How the models are built
Each specialist is grown by **interleaving two steps**, per domain:
1. **Continued pretraining** โ€” next-token training on *only* that domain's data, so the specialist
becomes genuinely crisp on its home distribution (and the router can tell the links apart).
2. **Per-domain distillation** โ€” the specialist is distilled from a larger teacher foundation model
*restricted to its own domain* (soft-target KD, plus a factorized per-nucleotide variant where the
teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic
clone โ€” the specialization is what makes routing work.
We iterate those two steps until each link is as strong as its capacity allows, then train the
**router**. In lineage this is a **cluster Branch-Train-Merge (cBTM)** mixture of domain experts โ€”
independent experts + perplexity-aware routing โ€” with iterative distillation from a larger teacher.
---
## ๐Ÿงฌ Current project โ€” DaisyChain Genomics
Four DNA/RNA specialists (**eukaryote ยท prokaryote ยท mRNA ยท mRNA-splice**, ~74M each, **โ‰ˆ295M total โ€”
under 500M**), each distilled per-domain from **[Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)**
behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial)
maps one-to-one onto our four specialists.
### Where it actually stands (measured on Carbon's own base-pair / FNS metric)
We score likelihood the way Carbon does โ€” marginalizing each 6-mer into six per-base distributions and
taking mean per-base log-prob (`score_sequence`). Our implementation reproduces Carbon's `compute_bp_probs`
to **6e-08**, so these are apples-to-apples.
| | DaisyChain | Carbon-500M |
|---|---|---|
| **Routing accuracy** (held-out) | **100.0%** | โ€” |
| **Likelihood โ€” base-pair bits/base** (โ†“) | **1.875** | **1.787** |
| Seq-recovery, eukaryote (FNS, โ†‘) | 31.5% | 38.9% |
| Seq-recovery, bacteria (FNS, โ†‘) | 40.9% | 54.1% |
| Active params / query | ~74M (one specialist) | 500M |
**Honest standing: ~+0.088 bits/base behind, and no single domain beats Carbon yet.** The gap is
concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest.
Note Carbon-500M is itself a *draft model*, explicitly "not designed to be competitive on downstream
benchmarks" โ€” so it's a fair, achievable target, not the 3B/8B flagships.
- ๐Ÿ“ฆ **Model:** [`DaisyChainAI/daisychain-genomics`](https://huggingface.co/DaisyChainAI/daisychain-genomics)
- ๐ŸŽฎ **Live demo:** [`Daisychain-Genomics-Demo`](https://huggingface.co/spaces/DaisyChainAI/Daisychain-Genomics-Demo) โ€” paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder.
---
## ๐Ÿ““ Build log โ€” what we got right, and what we got wrong
We build in the open, mistakes included. This project's honest history:
**What worked**
- **Per-domain specialists + a learned router** reached **100%** held-out routing โ€” one ~74M model active per query.
- **Snapshot-then-pick-best** distillation: snapshot every few thousand steps, deploy the snapshot with the
best *held-out* score, never the last one. This caught over-distillation (models that memorize the distill
cache and regress on held-out data) and made every round regression-guarded.
- **Re-fitting the router after every specialist swap.** Router features are coupled to the checkpoints;
skipping the re-fit once produced a fake "regression" that was pure routing drift.
- **FNS per-base distillation targets** โ€” distilling the teacher's *base-pair* marginals, not the 4096-way
6-mer distribution, gave the small students a tractable, base-pair-correct objective.
**What we got wrong (and corrected)**
- **We reported the wrong metric for days.** We measured likelihood as **6-mer cross-entropy** (a softer proxy)
instead of Carbon's **base-pair (FNS)** score. The proxy flattered us: it showed ~+0.043 behind and even
"splice beats Carbon." On Carbon's actual metric the gap is **+0.089 and no domain is ahead.** We re-baselined
the entire project history on the real metric.
- **We measured sequence recovery with the wrong decoder** (6-mer argmax) instead of Carbon's **FNS base-level
argmax**. Re-measuring with their decoder changed the numbers (and actually *raised* our bacteria recovery).
- **An early eval had a frame-alignment bug** โ€” feeding a context length not divisible by 6 knocked our 6-mer
model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
- **Decoding took several wrong turns** before matching Carbon: greedy with no repetition control (collapsed to
homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual
**base-pair FNS decoder** (top-p at the 6-mer level โ†’ per-base selection).
- **One training round improved the proxy while regressing the real metric** (an early mRNA distill-only pass)
โ€” invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.
**The lesson:** *measure the way the baseline measures, or you aren't comparing anything.* A stricter, honest
evaluation didn't sink the project โ€” it pointed to exactly which domains to attack and which "wins" were illusions.
More links on the chain โ€” and more chains โ€” coming. ๐ŸŒผ
## Citation
**If you use these models, please cite the author โ€” Dean Byrne (Quazim0t0):**
```bibtex
@misc{byrne2026daisychain,
title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
author = {Byrne, Dean},
year = {2026},
howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
from Carbon-500M behind a learned router}
}
```