Spaces:

DaisyChainAI
/

README

Running

README / README.md

Update org card: honest base-pair metrics (99.8% routing, +0.089 base-pair, no domain ahead yet) + build log of what went right/wrong

9e49e5f verified about 5 hours ago

preview code

Raw

History Blame Contribute Delete

7.61 kB

metadata

title: README
emoji: 📈
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false

🌼 DaisyChainAI

We build capable systems by daisy-chaining a handful of small, sharp specialists behind a learned router — instead of training one giant model to do everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together, they behave like one model at a fraction of the active compute.

🔗 What "daisy-chaining" means

A daisy chain links independent units in series so a signal can flow from one to the next, each unit handling what it's good at and passing the rest along. That's exactly how our systems work:

Each link is one small specialist — a dense ~74M model trained on a single domain. It is excellent at its own data and (deliberately) surprised by everything else.
The router is the connector between links. When an input arrives, every specialist reports how surprised it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work to the link that's most at home with it.
The chain grows link by link. Because the specialists are trained separately, you can chain a new domain on without retraining the others — add a link, extend the router, done.
One link runs per query. Only the routed specialist computes, so a chain of four ~74M experts costs ~74M of compute per token — roughly 7× cheaper than a 500M monolith of comparable scope.

So "DaisyChain" is both the brand and the mechanism: a chain of specialists, connected by routing, that you extend one flower at a time.

🛠️ How the models are built

Each specialist is grown by interleaving two steps, per domain:

Continued pretraining — next-token training on only that domain's data, so the specialist becomes genuinely crisp on its home distribution (and the router can tell the links apart).
Per-domain distillation — the specialist is distilled from a larger teacher foundation model restricted to its own domain (soft-target KD, plus a factorized per-nucleotide variant where the teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic clone — the specialization is what makes routing work.

We iterate those two steps until each link is as strong as its capacity allows, then train the router. In lineage this is a cluster Branch-Train-Merge (cBTM) mixture of domain experts — independent experts + perplexity-aware routing — with iterative distillation from a larger teacher.

🧬 Current project — DaisyChain Genomics

Four DNA/RNA specialists (eukaryote · prokaryote · mRNA · mRNA-splice, ~74M each, ≈295M total — under 500M), each distilled per-domain from Carbon-500M behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial) maps one-to-one onto our four specialists.

Where it actually stands (measured on Carbon's own base-pair / FNS metric)

We score likelihood the way Carbon does — marginalizing each 6-mer into six per-base distributions and taking mean per-base log-prob (score_sequence). Our implementation reproduces Carbon's compute_bp_probs to 6e-08, so these are apples-to-apples.

	DaisyChain	Carbon-500M
Routing accuracy (held-out)	99.8%	—
Likelihood — base-pair bits/base (↓)	1.876	1.787
Seq-recovery, eukaryote (FNS, ↑)	31.5%	38.9%
Seq-recovery, bacteria (FNS, ↑)	40.9%	54.1%
Active params / query	~74M (one specialist)	500M

Honest standing: ~+0.089 bits/base behind, and no single domain beats Carbon yet. The gap is concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest. Note Carbon-500M is itself a draft model, explicitly "not designed to be competitive on downstream benchmarks" — so it's a fair, achievable target, not the 3B/8B flagships.

📦 Model: DaisyChainAI/daisychain-genomics
🎮 Live demo: Daisychain-Genomics-Demo — paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder.

📓 Build log — what we got right, and what we got wrong

We build in the open, mistakes included. This project's honest history:

What worked

Per-domain specialists + a learned router reached 99.8% held-out routing — one ~74M model active per query.
Snapshot-then-pick-best distillation: snapshot every few thousand steps, deploy the snapshot with the best held-out score, never the last one. This caught over-distillation (models that memorize the distill cache and regress on held-out data) and made every round regression-guarded.
Re-fitting the router after every specialist swap. Router features are coupled to the checkpoints; skipping the re-fit once produced a fake "regression" that was pure routing drift.
FNS per-base distillation targets — distilling the teacher's base-pair marginals, not the 4096-way 6-mer distribution, gave the small students a tractable, base-pair-correct objective.

What we got wrong (and corrected)

We reported the wrong metric for days. We measured likelihood as 6-mer cross-entropy (a softer proxy) instead of Carbon's base-pair (FNS) score. The proxy flattered us: it showed ~+0.043 behind and even "splice beats Carbon." On Carbon's actual metric the gap is +0.089 and no domain is ahead. We re-baselined the entire project history on the real metric.
We measured sequence recovery with the wrong decoder (6-mer argmax) instead of Carbon's FNS base-level argmax. Re-measuring with their decoder changed the numbers (and actually raised our bacteria recovery).
An early eval had a frame-alignment bug — feeding a context length not divisible by 6 knocked our 6-mer model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
Decoding took several wrong turns before matching Carbon: greedy with no repetition control (collapsed to homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual base-pair FNS decoder (top-p at the 6-mer level → per-base selection).
One training round improved the proxy while regressing the real metric (an early mRNA distill-only pass) — invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.

The lesson: measure the way the baseline measures, or you aren't comparing anything. A stricter, honest evaluation didn't sink the project — it pointed to exactly which domains to attack and which "wins" were illusions.

More links on the chain — and more chains — coming. 🌼

Citation

If you use these models, please cite the author — Dean Byrne (Quazim0t0):

@misc{byrne2026daisychain,
  title        = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
  note         = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
                  from Carbon-500M behind a learned router}
}