README / README.md
Quazim0t0's picture
Update org card: honest base-pair metrics (99.8% routing, +0.089 base-pair, no domain ahead yet) + build log of what went right/wrong
9e49e5f verified
|
Raw
History Blame Contribute Delete
7.61 kB
metadata
title: README
emoji: ๐Ÿ“ˆ
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false

๐ŸŒผ DaisyChainAI

We build capable systems by daisy-chaining a handful of small, sharp specialists behind a learned router โ€” instead of training one giant model to do everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together, they behave like one model at a fraction of the active compute.


๐Ÿ”— What "daisy-chaining" means

A daisy chain links independent units in series so a signal can flow from one to the next, each unit handling what it's good at and passing the rest along. That's exactly how our systems work:

  • Each link is one small specialist โ€” a dense ~74M model trained on a single domain. It is excellent at its own data and (deliberately) surprised by everything else.
  • The router is the connector between links. When an input arrives, every specialist reports how surprised it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work to the link that's most at home with it.
  • The chain grows link by link. Because the specialists are trained separately, you can chain a new domain on without retraining the others โ€” add a link, extend the router, done.
  • One link runs per query. Only the routed specialist computes, so a chain of four ~74M experts costs ~74M of compute per token โ€” roughly 7ร— cheaper than a 500M monolith of comparable scope.

So "DaisyChain" is both the brand and the mechanism: a chain of specialists, connected by routing, that you extend one flower at a time.


๐Ÿ› ๏ธ How the models are built

Each specialist is grown by interleaving two steps, per domain:

  1. Continued pretraining โ€” next-token training on only that domain's data, so the specialist becomes genuinely crisp on its home distribution (and the router can tell the links apart).
  2. Per-domain distillation โ€” the specialist is distilled from a larger teacher foundation model restricted to its own domain (soft-target KD, plus a factorized per-nucleotide variant where the teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic clone โ€” the specialization is what makes routing work.

We iterate those two steps until each link is as strong as its capacity allows, then train the router. In lineage this is a cluster Branch-Train-Merge (cBTM) mixture of domain experts โ€” independent experts + perplexity-aware routing โ€” with iterative distillation from a larger teacher.


๐Ÿงฌ Current project โ€” DaisyChain Genomics

Four DNA/RNA specialists (eukaryote ยท prokaryote ยท mRNA ยท mRNA-splice, ~74M each, โ‰ˆ295M total โ€” under 500M), each distilled per-domain from Carbon-500M behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial) maps one-to-one onto our four specialists.

Where it actually stands (measured on Carbon's own base-pair / FNS metric)

We score likelihood the way Carbon does โ€” marginalizing each 6-mer into six per-base distributions and taking mean per-base log-prob (score_sequence). Our implementation reproduces Carbon's compute_bp_probs to 6e-08, so these are apples-to-apples.

DaisyChain Carbon-500M
Routing accuracy (held-out) 99.8% โ€”
Likelihood โ€” base-pair bits/base (โ†“) 1.876 1.787
Seq-recovery, eukaryote (FNS, โ†‘) 31.5% 38.9%
Seq-recovery, bacteria (FNS, โ†‘) 40.9% 54.1%
Active params / query ~74M (one specialist) 500M

Honest standing: ~+0.089 bits/base behind, and no single domain beats Carbon yet. The gap is concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest. Note Carbon-500M is itself a draft model, explicitly "not designed to be competitive on downstream benchmarks" โ€” so it's a fair, achievable target, not the 3B/8B flagships.


๐Ÿ““ Build log โ€” what we got right, and what we got wrong

We build in the open, mistakes included. This project's honest history:

What worked

  • Per-domain specialists + a learned router reached 99.8% held-out routing โ€” one ~74M model active per query.
  • Snapshot-then-pick-best distillation: snapshot every few thousand steps, deploy the snapshot with the best held-out score, never the last one. This caught over-distillation (models that memorize the distill cache and regress on held-out data) and made every round regression-guarded.
  • Re-fitting the router after every specialist swap. Router features are coupled to the checkpoints; skipping the re-fit once produced a fake "regression" that was pure routing drift.
  • FNS per-base distillation targets โ€” distilling the teacher's base-pair marginals, not the 4096-way 6-mer distribution, gave the small students a tractable, base-pair-correct objective.

What we got wrong (and corrected)

  • We reported the wrong metric for days. We measured likelihood as 6-mer cross-entropy (a softer proxy) instead of Carbon's base-pair (FNS) score. The proxy flattered us: it showed ~+0.043 behind and even "splice beats Carbon." On Carbon's actual metric the gap is +0.089 and no domain is ahead. We re-baselined the entire project history on the real metric.
  • We measured sequence recovery with the wrong decoder (6-mer argmax) instead of Carbon's FNS base-level argmax. Re-measuring with their decoder changed the numbers (and actually raised our bacteria recovery).
  • An early eval had a frame-alignment bug โ€” feeding a context length not divisible by 6 knocked our 6-mer model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
  • Decoding took several wrong turns before matching Carbon: greedy with no repetition control (collapsed to homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual base-pair FNS decoder (top-p at the 6-mer level โ†’ per-base selection).
  • One training round improved the proxy while regressing the real metric (an early mRNA distill-only pass) โ€” invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.

The lesson: measure the way the baseline measures, or you aren't comparing anything. A stricter, honest evaluation didn't sink the project โ€” it pointed to exactly which domains to attack and which "wins" were illusions.

More links on the chain โ€” and more chains โ€” coming. ๐ŸŒผ

Citation

If you use these models, please cite the author โ€” Dean Byrne (Quazim0t0):

@misc{byrne2026daisychain,
  title        = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
  note         = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
                  from Carbon-500M behind a learned router}
}