Spaces:
Running
Running
Intro polish: pareto chart in release rail + rewritten DNA Lab / Recipe ledes
Browse files- Release hero: move pareto throughput-vs-win-rate figure inside the green
left rail so it visually belongs to the announcement; add hairline border
and 640px max-width.
- DNA Lab tab-lede: lead with what the model is + invite to explore, drop
the "no curriculum" framing, soften the supervision claim (acknowledges
species/biotype tags).
- Recipe tab-lede: promote the curated data mixture to a third pillar
alongside the 6-mer tokenizer and FNS loss.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- assets/styles/layout.css +23 -0
- demo.html +18 -12
- img/pareto.png +3 -0
assets/styles/layout.css
CHANGED
|
@@ -48,6 +48,29 @@
|
|
| 48 |
color: #2d2d2a;
|
| 49 |
max-width: 760px;
|
| 50 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
/* Secondary "navigator" paragraph that follows the lede. Drops a step
|
| 52 |
in size + weight + saturation so the eye reads it as a follow-up /
|
| 53 |
table of contents rather than another full lede paragraph; the inline
|
|
|
|
| 48 |
color: #2d2d2a;
|
| 49 |
max-width: 760px;
|
| 50 |
}
|
| 51 |
+
/* Release-hero figure (pareto frontier). Sits below the lede rail,
|
| 52 |
+
left-aligned with the text column so the figure feels anchored to the
|
| 53 |
+
announcement rather than floating. Caption mirrors the secondary-note
|
| 54 |
+
typography for visual continuity. */
|
| 55 |
+
.tab-lede__figure {
|
| 56 |
+
margin: 28px 0 0;
|
| 57 |
+
max-width: 640px;
|
| 58 |
+
padding: 0;
|
| 59 |
+
}
|
| 60 |
+
.tab-lede__figure img {
|
| 61 |
+
display: block;
|
| 62 |
+
width: 100%;
|
| 63 |
+
height: auto;
|
| 64 |
+
border: 1px solid var(--hairline);
|
| 65 |
+
}
|
| 66 |
+
.tab-lede__figure figcaption {
|
| 67 |
+
margin-top: 10px;
|
| 68 |
+
font-family: "Inter", "Helvetica Neue", sans-serif;
|
| 69 |
+
font-size: 13px;
|
| 70 |
+
line-height: 1.55;
|
| 71 |
+
color: #5b5b56;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
/* Secondary "navigator" paragraph that follows the lede. Drops a step
|
| 75 |
in size + weight + saturation so the eye reads it as a follow-up /
|
| 76 |
table of contents rather than another full lede paragraph; the inline
|
demo.html
CHANGED
|
@@ -275,6 +275,10 @@
|
|
| 275 |
shipping with the full training code, the data pipeline, and the model weights.
|
| 276 |
Everything is open source on the Hugging Face Hub.
|
| 277 |
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
</div>
|
| 279 |
</div>
|
| 280 |
|
|
@@ -545,16 +549,16 @@
|
|
| 545 |
<div class="tab-lede__rail">
|
| 546 |
<span class="tab-lede__eyebrow">Intro</span>
|
| 547 |
<p>
|
| 548 |
-
<strong>Carbon-3B</strong> is a 3-billion-parameter language model for DNA.
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
|
| 553 |
</p>
|
| 554 |
<p class="tab-lede__note">
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
anyway: autocomplete a gene <a class="lede-chip" href="#completion">§1</a>, see
|
| 559 |
structure emerge in its confidence <a class="lede-chip" href="#track">§2</a>, score
|
| 560 |
a disease variant against a healthy one <a class="lede-chip" href="#vep">§3</a>,
|
|
@@ -1368,12 +1372,14 @@ for name, ids in zip(species_prefixes, new_ids):
|
|
| 1368 |
<span class="tab-lede__eyebrow">Intro</span>
|
| 1369 |
<p>
|
| 1370 |
Carbon's architecture is deliberately vanilla. What's <em>not</em> vanilla, and what
|
| 1371 |
-
gets the headline numbers in the DNA Lab tab, is
|
| 1372 |
tokenizer</strong> that lets the model see ~6× more genomic context per
|
| 1373 |
-
forward pass,
|
| 1374 |
that gives the model partial credit for near-miss tokens once cross-entropy
|
| 1375 |
-
training starts to wobble
|
| 1376 |
-
|
|
|
|
|
|
|
| 1377 |
</p>
|
| 1378 |
<p class="tab-lede__note">
|
| 1379 |
The sections below walk through each of those choices: how the tokenizer changes
|
|
|
|
| 275 |
shipping with the full training code, the data pipeline, and the model weights.
|
| 276 |
Everything is open source on the Hugging Face Hub.
|
| 277 |
</p>
|
| 278 |
+
<figure class="tab-lede__figure">
|
| 279 |
+
<img src="/img/pareto.png" alt="Throughput vs win rate pareto frontier: Carbon 3B/8B sit at high win rate and ~275× the throughput of Arc Evo2 7B, well ahead of GENERator-v2.">
|
| 280 |
+
<figcaption>Throughput (base pairs per second, log scale) vs win rate across open DNA foundation models. Carbon 3B matches Evo2 7B's win rate at roughly 275× the throughput.</figcaption>
|
| 281 |
+
</figure>
|
| 282 |
</div>
|
| 283 |
</div>
|
| 284 |
|
|
|
|
| 549 |
<div class="tab-lede__rail">
|
| 550 |
<span class="tab-lede__eyebrow">Intro</span>
|
| 551 |
<p>
|
| 552 |
+
<strong>Carbon-3B</strong> is a 3-billion-parameter language model for DNA. It is trained on
|
| 553 |
+
roughly 1 trillion tokens (6 trillion base pairs) of genomic sequence with a simple
|
| 554 |
+
objective: given some DNA, predict what comes next (six bases at a time, autoregressively).
|
| 555 |
+
Even though the objective is simple the resulting model is versatile. In the DNA lab you can
|
| 556 |
+
explore all the cool things we can do with a DNA model.
|
| 557 |
</p>
|
| 558 |
<p class="tab-lede__note">
|
| 559 |
+
Carbon-3B was trained unsupervised besides some simple tags for species and gene biotypes.
|
| 560 |
+
It wasn't trained to tell which mutations are pathogenic or how genes differ between species.
|
| 561 |
+
The sections below highlight what it picked up
|
| 562 |
anyway: autocomplete a gene <a class="lede-chip" href="#completion">§1</a>, see
|
| 563 |
structure emerge in its confidence <a class="lede-chip" href="#track">§2</a>, score
|
| 564 |
a disease variant against a healthy one <a class="lede-chip" href="#vep">§3</a>,
|
|
|
|
| 1372 |
<span class="tab-lede__eyebrow">Intro</span>
|
| 1373 |
<p>
|
| 1374 |
Carbon's architecture is deliberately vanilla. What's <em>not</em> vanilla, and what
|
| 1375 |
+
gets the headline numbers in the DNA Lab tab, is three things: a <strong>6-mer
|
| 1376 |
tokenizer</strong> that lets the model see ~6× more genomic context per
|
| 1377 |
+
forward pass, a <strong>Factorized Nucleotide Supervision (FNS)</strong> loss
|
| 1378 |
that gives the model partial credit for near-miss tokens once cross-entropy
|
| 1379 |
+
training starts to wobble, and a <strong>multi-stage curated data mixture</strong>,
|
| 1380 |
+
biased toward functional genomic regions. Everything else (architecture, optimizer)
|
| 1381 |
+
is standard recipe. The technical report details each choice and the ablations
|
| 1382 |
+
behind it.
|
| 1383 |
</p>
|
| 1384 |
<p class="tab-lede__note">
|
| 1385 |
The sections below walk through each of those choices: how the tokenizer changes
|
img/pareto.png
ADDED
|
Git LFS Details
|