daisychain-genomics / README.md
Quazim0t0's picture
Upload README.md with huggingface_hub
b841cd4 verified
|
Raw
History Blame Contribute Delete
8.35 kB
---
license: apache-2.0
tags:
- biology
- genomics
- dna
- mixture-of-experts
- modular
library_name: pytorch
pipeline_tag: text-generation
---
# 🌼 DaisyChain β€” Genomics
A **modular genomic mind**: four dense ~74M DNA/RNA specialists (β‰ˆ295M params total,
**under Carbon-500M**) behind a learned router. Instead of one monolithic foundation
model, DaisyChain trains one crisp specialist per biological domain β€” each
**distilled per-domain from [Carbon-500M](https://huggingface.co/HuggingFaceBio/Carbon-500M)** β€”
and routes each sequence to its home specialist.
| Specialist | Domain | Params |
|---|---|---|
| `eukaryote` | Eukaryotic genomic DNA | ~74M |
| `prokaryote` | Bacterial / prokaryotic DNA | ~74M |
| `mrna` | Mature mRNA (coding transcript) | ~74M |
| `mrna_splice` | Pre-mRNA / splice-site regions | ~74M |
## The router
A small learned router reads each specialist's **surprise** (bits/base) and a PCA of its
**hidden state**, then predicts the home domain β€” recovering the bias-corrections a plain
argmin-perplexity rule misses. Held-out routing accuracy: **100.0%** (vs 87.5% argmin).
Only one ~74M specialist runs per query, so inference is ~7Γ— cheaper per token than the
500M monolith.
## How each specialist is built
Interleaved **continued pretraining** (next-token CE on its domain) and **offline
knowledge distillation** from Carbon-500M (soft-target + a factorized per-nucleotide
variant via Carbon's FNS branch) β€” i.e. cBTM-style domain experts, iterated per expert.
## Capability vs Carbon-500M (the fair baseline)
| metric | DaisyChain | Carbon-500M |
|---|---|---|
| likelihood β€” bits/base, **base-pair (FNS)** (↓) | 1.88 | 1.79 |
| seq-recovery eukaryote β€” FNS base-level (↑) | 31.5% | 38.9% |
| seq-recovery bacteria β€” FNS base-level (↑) | 40.9% | 54.1% |
*Likelihood is the base-pair (FNS) score β€” Carbon's own `score_sequence` / `compute_bp_probs`,
verified to 6e-08. The progress-log table below tracks the 6-mer joint CE, a softer proxy.*
Behind a 500M/1T-token monolith but within striking distance at ~15% of the active
compute β€” and the gap keeps closing with more per-domain training (work in progress).
### Progress log β€” closing the gap to Carbon-500M (BASE-PAIR / FNS metric)
Re-baselined on Carbon's **base-pair (FNS)** metric (`score_sequence`, verified to 6e-08) β€” each
round's full 4-specialist set re-scored. **trailing** = `mean ours βˆ’ mean Carbon` (Carbon mean 1.7870).
The final row (1.8622 / +0.0752) matches the independently-verified current standing.
| date | update | euk | prok | mrna | splice | mean | **trailing** |
|---|---|---|---|---|---|---|---|
| 2026-06-22 | baseline (round-1 distill + router) | 1.965 | 1.994 | 1.910 | 1.935 | 1.9510 | +0.1640 |
| 2026-06-23 | mrna 12k-distill | 1.965 | 1.994 | 1.927 | 1.935 | 1.9552 | +0.1682 |
| 2026-06-23 | prokaryote round 1 | 1.965 | 1.918 | 1.927 | 1.935 | 1.9363 | +0.1493 |
| 2026-06-24 | eukaryote | 1.928 | 1.918 | 1.927 | 1.935 | 1.9272 | +0.1403 |
| 2026-06-25 | mrna | 1.928 | 1.918 | 1.788 | 1.935 | 1.8924 | +0.1054 |
| 2026-06-25 | mrna_splice | 1.928 | 1.918 | 1.788 | 1.873 | 1.8768 | +0.0898 |
| 2026-06-26 | prokaryote round 2 | 1.928 | 1.914 | 1.788 | 1.873 | 1.8758 | +0.0889 |
| 2026-06-27 | eukaryote round 2 (routing 100%) | 1.924 | 1.914 | 1.788 | 1.873 | 1.8747 | +0.0878 |
| 2026-06-29 | Muon passes β€” mrna + prokaryote | 1.924 | 1.868 | 1.784 | 1.873 | **1.8622** | **+0.0752** |
(The mrna 12k-distill round *worsened* base-pair likelihood (+0.1640β†’+0.1682) even though it improved
6-mer CE β€” the later base+distill mrna round fixed it. The soft 6-mer metric hid that.)
<details><summary>Earlier 6-mer joint CE trajectory (softer proxy β€” kept for reference)</summary>
| date | update | mean DaisyChain (6-mer CE) | mean Carbon-500M (6-mer CE) | **trailing** |
|---|---|---|---|---|
| 2026-06-22 | baseline (94.8% routing) | 1.8644 | 1.7502 | +0.1142 |
| 2026-06-23 | mrna 12k-distill (95.7%) | 1.8599 | 1.7502 | +0.1096 |
| 2026-06-23 | prokaryote round 1 (96.2%) | 1.8528 | 1.7502 | +0.1026 |
| 2026-06-24 | eukaryote (98.0%) | 1.8413 | 1.7502 | +0.0911 |
| 2026-06-25 | mrna (98.3%) | 1.8075 | 1.7502 | +0.0573 |
| 2026-06-25 | mrna_splice (99.8%) | 1.7959 | 1.7502 | +0.0457 |
| 2026-06-26 | prokaryote round 2 (99.8%) | 1.7929 | 1.7502 | +0.0427 |
</details>
*The table above is the **base-pair (FNS)** trajectory β€” Carbon's actual metric, fully re-baselined.
Honest current standing: mean **1.8622 vs 1.7870 (+0.0752)**, base-pair **wins ~36/100**, **no domain ahead**
(splice's earlier "beats Carbon" was only on the 6-mer proxy). Recovery uses Carbon's FNS base-level argmax
decoder (per-base accuracy, next-30bp, n=50, ctx=1536) β€” our measurement of Carbon-500M (a draft model,
explicitly not benchmark-competitive), not Carbon's published 3B figures.*
## Usage
```python
from daisychain import DaisyChain
dc = DaisyChain(root=".", device="cpu")
home, bits_per_base = dc.route("ACGTACGT...") # which domain?
print(home, bits_per_base)
print(dc.generate(home, length=180)) # sample from the home specialist
```
Files: `daisychain.py` (inference), `model.py` / `specialist_presets.py` /
`spike_tokenizer.py` / `registry.py` (architecture), `tokenizer.json`,
`<domain>/model.safetensors` (the 4 specialists), `router2.pt` (router).
> Interactive demo: the **DaisyChain Space** routes DNA in real time.
## Citation
**If you use these models, please cite the author β€” Dean Byrne (Quazim0t0):**
```bibtex
@misc{byrne2026daisychain,
title = {DaisyChain Genomics: A Modular Mixture of Per-Domain Distilled Genomic Specialists},
author = {Byrne, Dean},
year = {2026},
howpublished = {\url{https://huggingface.co/DaisyChainAI/daisychain-genomics}},
note = {DaisyChainAI (Quazim0t0). Four ~74M DNA/RNA specialists distilled per-domain
from Carbon-500M behind a learned router}
}
```
### Built on
DaisyChain stands on these works:
```bibtex
@misc{carbon2025,
title = {Carbon: Genomic Foundation Models},
author = {{HuggingFaceBio}},
year = {2025},
howpublished = {\url{https://huggingface.co/HuggingFaceBio/Carbon-500M}}
}
@article{li2022branchtrainmerge,
title = {Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
author = {Li, Margaret and Gururangan, Suchin and Dettmers, Tim and Lewis, Mike and
Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
journal = {arXiv preprint arXiv:2208.03306},
year = {2022}
}
@article{gururangan2023cbtm,
title = {Scaling Expert Language Models with Unsupervised Domain Discovery},
author = {Gururangan, Suchin and Li, Margaret and Lewis, Mike and Shi, Weijia and
Althoff, Tim and Smith, Noah A. and Zettlemoyer, Luke},
journal = {arXiv preprint arXiv:2303.14177},
year = {2023}
}
@article{sukhbaatar2024btx,
title = {Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM},
author = {Sukhbaatar, Sainbayar and Golovneva, Olga and Sharma, Vasu and Xu, Hu and
Lin, Xi Victoria and Roziere, Baptiste and Kahn, Jacob and Li, Daniel and
Yih, Wen-tau and Weston, Jason and Li, Xian},
journal = {arXiv preprint arXiv:2403.07816},
year = {2024}
}
@article{hinton2015distilling,
title = {Distilling the Knowledge in a Neural Network},
author = {Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
journal = {arXiv preprint arXiv:1503.02531},
year = {2015}
}
@inproceedings{furlanello2018born,
title = {Born-Again Neural Networks},
author = {Furlanello, Tommaso and Lipton, Zachary C. and Tschannen, Michael and
Itti, Laurent and Anandkumar, Anima},
booktitle = {ICML},
year = {2018}
}
@inproceedings{gururangan2020dapt,
title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
author = {Gururangan, Suchin and Marasovi{\'c}, Ana and Swayamdipta, Swabha and
Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A.},
booktitle = {ACL},
year = {2020}
}
```