Spaces:
Running
title: README
emoji: ๐
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
๐ผ DaisyChainAI
We build capable systems by daisy-chaining a handful of small, sharp specialists behind a learned router โ instead of training one giant model to do everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together, they behave like one model at a fraction of the active compute.
๐ What "daisy-chaining" means
A daisy chain links independent units in series so a signal can flow from one to the next, each unit handling what it's good at and passing the rest along. That's exactly how our systems work:
- Each link is one small specialist โ a dense ~74M model trained on a single domain. It is excellent at its own data and (deliberately) surprised by everything else.
- The router is the connector between links. When an input arrives, every specialist reports how surprised it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work to the link that's most at home with it.
- The chain grows link by link. Because the specialists are trained separately, you can chain a new domain on without retraining the others โ add a link, extend the router, done.
- One link runs per query. Only the routed specialist computes, so a chain of four ~74M experts costs ~74M of compute per token โ roughly 7ร cheaper than a 500M monolith of comparable scope.
So "DaisyChain" is both the brand and the mechanism: a chain of specialists, connected by routing, that you extend one flower at a time.
๐ ๏ธ How the models are built
Each specialist is grown by interleaving two steps, per domain:
- Continued pretraining โ next-token training on only that domain's data, so the specialist becomes genuinely crisp on its home distribution (and the router can tell the links apart).
- Per-domain distillation โ the specialist is distilled from a larger teacher foundation model restricted to its own domain (soft-target KD, plus a factorized per-nucleotide variant where the teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic clone โ the specialization is what makes routing work.
We iterate those two steps until each link is as strong as its capacity allows, then train the router. In lineage this is a cluster Branch-Train-Merge (cBTM) mixture of domain experts โ independent experts + perplexity-aware routing โ with iterative distillation from a larger teacher.
๐งฌ Current project โ DaisyChain Genomics
Four DNA/RNA specialists (eukaryote ยท prokaryote ยท mRNA ยท mRNA-splice, ~74M each, โ295M total โ under 500M), each distilled per-domain from Carbon-500M behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial) maps one-to-one onto our four specialists.
Where it actually stands (measured on Carbon's own base-pair / FNS metric)
We score likelihood the way Carbon does โ marginalizing each 6-mer into six per-base distributions and
taking mean per-base log-prob (score_sequence). Our implementation reproduces Carbon's compute_bp_probs
to 6e-08, so these are apples-to-apples.
| DaisyChain | Carbon-500M | |
|---|---|---|
| Routing accuracy (held-out) | 99.8% | โ |
| Likelihood โ base-pair bits/base (โ) | 1.876 | 1.787 |
| Seq-recovery, eukaryote (FNS, โ) | 31.5% | 38.9% |
| Seq-recovery, bacteria (FNS, โ) | 40.9% | 54.1% |
| Active params / query | ~74M (one specialist) | 500M |
Honest standing: ~+0.089 bits/base behind, and no single domain beats Carbon yet. The gap is concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest. Note Carbon-500M is itself a draft model, explicitly "not designed to be competitive on downstream benchmarks" โ so it's a fair, achievable target, not the 3B/8B flagships.
- ๐ฆ Model:
DaisyChainAI/daisychain-genomics - ๐ฎ Live demo:
Daisychain-Genomics-Demoโ paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder.
๐ Build log โ what we got right, and what we got wrong
We build in the open, mistakes included. This project's honest history:
What worked
- Per-domain specialists + a learned router reached 99.8% held-out routing โ one ~74M model active per query.
- Snapshot-then-pick-best distillation: snapshot every few thousand steps, deploy the snapshot with the best held-out score, never the last one. This caught over-distillation (models that memorize the distill cache and regress on held-out data) and made every round regression-guarded.
- Re-fitting the router after every specialist swap. Router features are coupled to the checkpoints; skipping the re-fit once produced a fake "regression" that was pure routing drift.
- FNS per-base distillation targets โ distilling the teacher's base-pair marginals, not the 4096-way 6-mer distribution, gave the small students a tractable, base-pair-correct objective.
What we got wrong (and corrected)
- We reported the wrong metric for days. We measured likelihood as 6-mer cross-entropy (a softer proxy) instead of Carbon's base-pair (FNS) score. The proxy flattered us: it showed ~+0.043 behind and even "splice beats Carbon." On Carbon's actual metric the gap is +0.089 and no domain is ahead. We re-baselined the entire project history on the real metric.
- We measured sequence recovery with the wrong decoder (6-mer argmax) instead of Carbon's FNS base-level argmax. Re-measuring with their decoder changed the numbers (and actually raised our bacteria recovery).
- An early eval had a frame-alignment bug โ feeding a context length not divisible by 6 knocked our 6-mer model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
- Decoding took several wrong turns before matching Carbon: greedy with no repetition control (collapsed to homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual base-pair FNS decoder (top-p at the 6-mer level โ per-base selection).
- One training round improved the proxy while regressing the real metric (an early mRNA distill-only pass) โ invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.
The lesson: measure the way the baseline measures, or you aren't comparing anything. A stricter, honest evaluation didn't sink the project โ it pointed to exactly which domains to attack and which "wins" were illusions.
More links on the chain โ and more chains โ coming. ๐ผ
Citation
If you use these models, please cite the author โ Dean Byrne (Quazim0t0):
@misc{byrne2026daisychain,
title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
author = {Byrne, Dean},
year = {2026},
howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
from Carbon-500M behind a learned router}
}