Spaces:

DaisyChainAI
/

README

Running

File size: 7,614 Bytes

---
title: README
emoji: 📈
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
---

# 🌼 DaisyChainAI

We build capable systems by *daisy-chaining* a handful of
small, sharp specialists behind a learned router — instead of training one giant model to do
everything. Each specialist is cheap, swappable, and crisp on its own domain; chained together,
they behave like one model at a fraction of the active compute.

---

## 🔗 What "daisy-chaining" means

A **daisy chain** links independent units in series so a signal can flow from one to the next,
each unit handling what it's good at and passing the rest along. That's exactly how our systems work:

- **Each link is one small specialist** — a dense ~74M model trained on a *single* domain. It is
  excellent at its own data and (deliberately) surprised by everything else.
- **The router is the connector between links.** When an input arrives, every specialist reports how
  *surprised* it is (bits/base) and exposes its hidden state, and a tiny learned router hands the work
  to the link that's most at home with it.
- **The chain grows link by link.** Because the specialists are trained *separately*, you can chain a
  new domain on without retraining the others — add a link, extend the router, done.
- **One link runs per query.** Only the routed specialist computes, so a chain of four ~74M experts
  costs ~74M of compute per token — roughly **7× cheaper** than a 500M monolith of comparable scope.

So "DaisyChain" is both the brand and the mechanism: **a chain of specialists, connected by routing,
that you extend one flower at a time.**

---

## 🛠️ How the models are built

Each specialist is grown by **interleaving two steps**, per domain:

1. **Continued pretraining** — next-token training on *only* that domain's data, so the specialist
   becomes genuinely crisp on its home distribution (and the router can tell the links apart).
2. **Per-domain distillation** — the specialist is distilled from a larger teacher foundation model
   *restricted to its own domain* (soft-target KD, plus a factorized per-nucleotide variant where the
   teacher supports it). It learns the teacher's behavior on its slice without ever becoming a generic
   clone — the specialization is what makes routing work.

We iterate those two steps until each link is as strong as its capacity allows, then train the
**router**. In lineage this is a **cluster Branch-Train-Merge (cBTM)** mixture of domain experts —
independent experts + perplexity-aware routing — with iterative distillation from a larger teacher.

---

## 🧬 Current project — DaisyChain Genomics

Four DNA/RNA specialists (**eukaryote · prokaryote · mRNA · mRNA-splice**, ~74M each, **≈295M total —
under 500M**), each distilled per-domain from **[Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)**
behind a learned router. Carbon's domain mixture (50% eukaryotic / 25% mRNA / 10% splice / 15% bacterial)
maps one-to-one onto our four specialists.

### Where it actually stands (measured on Carbon's own base-pair / FNS metric)

We score likelihood the way Carbon does — marginalizing each 6-mer into six per-base distributions and
taking mean per-base log-prob (`score_sequence`). Our implementation reproduces Carbon's `compute_bp_probs`
to **6e-08**, so these are apples-to-apples.

| | DaisyChain | Carbon-500M |
|---|---|---|
| **Routing accuracy** (held-out) | **100.0%** | — |
| **Likelihood — base-pair bits/base** (↓) | **1.875** | **1.787** |
| Seq-recovery, eukaryote (FNS, ↑) | 31.5% | 38.9% |
| Seq-recovery, bacteria (FNS, ↑) | 40.9% | 54.1% |
| Active params / query | ~74M (one specialist) | 500M |

**Honest standing: ~+0.088 bits/base behind, and no single domain beats Carbon yet.** The gap is
concentrated in mRNA and bacterial DNA (Carbon's strongest domains); eukaryote and splice are closest.
Note Carbon-500M is itself a *draft model*, explicitly "not designed to be competitive on downstream
benchmarks" — so it's a fair, achievable target, not the 3B/8B flagships.

- 📦 **Model:** [`DaisyChainAI/daisychain-genomics`](https://huggingface.co/DaisyChainAI/daisychain-genomics)
- 🎮 **Live demo:** [`Daisychain-Genomics-Demo`](https://huggingface.co/spaces/DaisyChainAI/Daisychain-Genomics-Demo) — paste DNA, watch the chain light up specialist-by-specialist and route in real time, then generate with Carbon's base-pair (FNS) decoder.

---

## 📓 Build log — what we got right, and what we got wrong

We build in the open, mistakes included. This project's honest history:

**What worked**
- **Per-domain specialists + a learned router** reached **100%** held-out routing — one ~74M model active per query.
- **Snapshot-then-pick-best** distillation: snapshot every few thousand steps, deploy the snapshot with the
  best *held-out* score, never the last one. This caught over-distillation (models that memorize the distill
  cache and regress on held-out data) and made every round regression-guarded.
- **Re-fitting the router after every specialist swap.** Router features are coupled to the checkpoints;
  skipping the re-fit once produced a fake "regression" that was pure routing drift.
- **FNS per-base distillation targets** — distilling the teacher's *base-pair* marginals, not the 4096-way
  6-mer distribution, gave the small students a tractable, base-pair-correct objective.

**What we got wrong (and corrected)**
- **We reported the wrong metric for days.** We measured likelihood as **6-mer cross-entropy** (a softer proxy)
  instead of Carbon's **base-pair (FNS)** score. The proxy flattered us: it showed ~+0.043 behind and even
  "splice beats Carbon." On Carbon's actual metric the gap is **+0.089 and no domain is ahead.** We re-baselined
  the entire project history on the real metric.
- **We measured sequence recovery with the wrong decoder** (6-mer argmax) instead of Carbon's **FNS base-level
  argmax**. Re-measuring with their decoder changed the numbers (and actually *raised* our bacteria recovery).
- **An early eval had a frame-alignment bug** — feeding a context length not divisible by 6 knocked our 6-mer
  model out of phase and produced an impossible near-zero recovery. Fixed by aligning context to the 6-mer grid.
- **Decoding took several wrong turns** before matching Carbon: greedy with no repetition control (collapsed to
  homopolymers), then top-k sampling (trapped on low-complexity GC/AT loops), before adopting Carbon's actual
  **base-pair FNS decoder** (top-p at the 6-mer level → per-base selection).
- **One training round improved the proxy while regressing the real metric** (an early mRNA distill-only pass)
  — invisible on 6-mer CE, obvious on base-pair. A later base+distill round fixed it.

**The lesson:** *measure the way the baseline measures, or you aren't comparing anything.* A stricter, honest
evaluation didn't sink the project — it pointed to exactly which domains to attack and which "wins" were illusions.

More links on the chain — and more chains — coming. 🌼

## Citation

**If you use these models, please cite the author — Dean Byrne (Quazim0t0):**

```bibtex
@misc{byrne2026daisychain,
  title        = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
  author       = {Byrne, Dean},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
  note         = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
                  from Carbon-500M behind a learned router}
}
```