lvwerra HF Staff Claude Opus 4.7 (1M context) commited on
Commit
d835cbf
·
1 Parent(s): 616acfe

Intro polish: pareto chart in release rail + rewritten DNA Lab / Recipe ledes

Browse files

- Release hero: move pareto throughput-vs-win-rate figure inside the green
left rail so it visually belongs to the announcement; add hairline border
and 640px max-width.
- DNA Lab tab-lede: lead with what the model is + invite to explore, drop
the "no curriculum" framing, soften the supervision claim (acknowledges
species/biotype tags).
- Recipe tab-lede: promote the curated data mixture to a third pillar
alongside the 6-mer tokenizer and FNS loss.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. assets/styles/layout.css +23 -0
  2. demo.html +18 -12
  3. img/pareto.png +3 -0
assets/styles/layout.css CHANGED
@@ -48,6 +48,29 @@
48
  color: #2d2d2a;
49
  max-width: 760px;
50
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  /* Secondary "navigator" paragraph that follows the lede. Drops a step
52
  in size + weight + saturation so the eye reads it as a follow-up /
53
  table of contents rather than another full lede paragraph; the inline
 
48
  color: #2d2d2a;
49
  max-width: 760px;
50
  }
51
+ /* Release-hero figure (pareto frontier). Sits below the lede rail,
52
+ left-aligned with the text column so the figure feels anchored to the
53
+ announcement rather than floating. Caption mirrors the secondary-note
54
+ typography for visual continuity. */
55
+ .tab-lede__figure {
56
+ margin: 28px 0 0;
57
+ max-width: 640px;
58
+ padding: 0;
59
+ }
60
+ .tab-lede__figure img {
61
+ display: block;
62
+ width: 100%;
63
+ height: auto;
64
+ border: 1px solid var(--hairline);
65
+ }
66
+ .tab-lede__figure figcaption {
67
+ margin-top: 10px;
68
+ font-family: "Inter", "Helvetica Neue", sans-serif;
69
+ font-size: 13px;
70
+ line-height: 1.55;
71
+ color: #5b5b56;
72
+ }
73
+
74
  /* Secondary "navigator" paragraph that follows the lede. Drops a step
75
  in size + weight + saturation so the eye reads it as a follow-up /
76
  table of contents rather than another full lede paragraph; the inline
demo.html CHANGED
@@ -275,6 +275,10 @@
275
  shipping with the full training code, the data pipeline, and the model weights.
276
  Everything is open source on the Hugging Face Hub.
277
  </p>
 
 
 
 
278
  </div>
279
  </div>
280
 
@@ -545,16 +549,16 @@
545
  <div class="tab-lede__rail">
546
  <span class="tab-lede__eyebrow">Intro</span>
547
  <p>
548
- <strong>Carbon-3B</strong> is a 3-billion-parameter language model for DNA. We trained it
549
- on roughly 1&nbsp;trillion tokens (6&nbsp;trillion base pairs) of genomic sequence with a
550
- single objective: given some DNA, predict what comes next (six bases at a time,
551
- autoregressively). That's it: no annotations, no labels, no biology curriculum.
552
- Just <em>read DNA, predict more DNA</em>.
553
  </p>
554
  <p class="tab-lede__note">
555
- The interesting question is what else falls out of that. We didn't tell Carbon-3B what an
556
- exon is. We didn't tell it which mutations are pathogenic. We didn't tell it how genes
557
- differ between species. The sections below are ways to read what it picked up
558
  anyway: autocomplete a gene <a class="lede-chip" href="#completion">§1</a>, see
559
  structure emerge in its confidence <a class="lede-chip" href="#track">§2</a>, score
560
  a disease variant against a healthy one <a class="lede-chip" href="#vep">§3</a>,
@@ -1368,12 +1372,14 @@ for name, ids in zip(species_prefixes, new_ids):
1368
  <span class="tab-lede__eyebrow">Intro</span>
1369
  <p>
1370
  Carbon's architecture is deliberately vanilla. What's <em>not</em> vanilla, and what
1371
- gets the headline numbers in the DNA Lab tab, is two things: a <strong>6-mer
1372
  tokenizer</strong> that lets the model see ~6&times; more genomic context per
1373
- forward pass, and a <strong>Factorized Nucleotide Supervision (FNS)</strong> loss
1374
  that gives the model partial credit for near-miss tokens once cross-entropy
1375
- training starts to wobble. Everything else (architecture, data mix, optimizer) is
1376
- standard recipe.
 
 
1377
  </p>
1378
  <p class="tab-lede__note">
1379
  The sections below walk through each of those choices: how the tokenizer changes
 
275
  shipping with the full training code, the data pipeline, and the model weights.
276
  Everything is open source on the Hugging Face Hub.
277
  </p>
278
+ <figure class="tab-lede__figure">
279
+ <img src="/img/pareto.png" alt="Throughput vs win rate pareto frontier: Carbon 3B/8B sit at high win rate and ~275× the throughput of Arc Evo2 7B, well ahead of GENERator-v2.">
280
+ <figcaption>Throughput (base pairs per second, log scale) vs win rate across open DNA foundation models. Carbon 3B matches Evo2 7B's win rate at roughly 275× the throughput.</figcaption>
281
+ </figure>
282
  </div>
283
  </div>
284
 
 
549
  <div class="tab-lede__rail">
550
  <span class="tab-lede__eyebrow">Intro</span>
551
  <p>
552
+ <strong>Carbon-3B</strong> is a 3-billion-parameter language model for DNA. It is trained on
553
+ roughly 1&nbsp;trillion tokens (6&nbsp;trillion base pairs) of genomic sequence with a simple
554
+ objective: given some DNA, predict what comes next (six bases at a time, autoregressively).
555
+ Even though the objective is simple the resulting model is versatile. In the DNA lab you can
556
+ explore all the cool things we can do with a DNA model.
557
  </p>
558
  <p class="tab-lede__note">
559
+ Carbon-3B was trained unsupervised besides some simple tags for species and gene biotypes.
560
+ It wasn't trained to tell which mutations are pathogenic or how genes differ between species.
561
+ The sections below highlight what it picked up
562
  anyway: autocomplete a gene <a class="lede-chip" href="#completion">§1</a>, see
563
  structure emerge in its confidence <a class="lede-chip" href="#track">§2</a>, score
564
  a disease variant against a healthy one <a class="lede-chip" href="#vep">§3</a>,
 
1372
  <span class="tab-lede__eyebrow">Intro</span>
1373
  <p>
1374
  Carbon's architecture is deliberately vanilla. What's <em>not</em> vanilla, and what
1375
+ gets the headline numbers in the DNA Lab tab, is three things: a <strong>6-mer
1376
  tokenizer</strong> that lets the model see ~6&times; more genomic context per
1377
+ forward pass, a <strong>Factorized Nucleotide Supervision (FNS)</strong> loss
1378
  that gives the model partial credit for near-miss tokens once cross-entropy
1379
+ training starts to wobble, and a <strong>multi-stage curated data mixture</strong>,
1380
+ biased toward functional genomic regions. Everything else (architecture, optimizer)
1381
+ is standard recipe. The technical report details each choice and the ablations
1382
+ behind it.
1383
  </p>
1384
  <p class="tab-lede__note">
1385
  The sections below walk through each of those choices: how the tokenizer changes
img/pareto.png ADDED

Git LFS Details

  • SHA256: 2cee784b63d5f933f8a64ead2c5e7eeecb576c8593fd27e977f89b17ca5ecd0a
  • Pointer size: 131 Bytes
  • Size of remote file: 170 kB