Spaces:

HuggingFaceBio
/

carbon-demo

Running

lvwerra HF Staff Claude Opus 4.7 (1M context) commited on 17 days ago

Commit

d835cbf

1 Parent(s): 616acfe

Intro polish: pareto chart in release rail + rewritten DNA Lab / Recipe ledes

- Release hero: move pareto throughput-vs-win-rate figure inside the green
left rail so it visually belongs to the announcement; add hairline border
and 640px max-width.
- DNA Lab tab-lede: lead with what the model is + invite to explore, drop
the "no curriculum" framing, soften the supervision claim (acknowledges
species/biotype tags).
- Recipe tab-lede: promote the curated data mixture to a third pillar
alongside the 6-mer tokenizer and FNS loss.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

assets/styles/layout.css +23 -0
demo.html +18 -12
img/pareto.png +3 -0

assets/styles/layout.css CHANGED Viewed

@@ -48,6 +48,29 @@
   color: #2d2d2a;
   max-width: 760px;
 }
 /* Secondary "navigator" paragraph that follows the lede. Drops a step
    in size + weight + saturation so the eye reads it as a follow-up /
    table of contents rather than another full lede paragraph; the inline

   color: #2d2d2a;
   max-width: 760px;
 }
+/* Release-hero figure (pareto frontier). Sits below the lede rail,
+   left-aligned with the text column so the figure feels anchored to the
+   announcement rather than floating. Caption mirrors the secondary-note
+   typography for visual continuity. */
+.tab-lede__figure {
+  margin: 28px 0 0;
+  max-width: 640px;
+  padding: 0;
+}
+.tab-lede__figure img {
+  display: block;
+  width: 100%;
+  height: auto;
+  border: 1px solid var(--hairline);
+}
+.tab-lede__figure figcaption {
+  margin-top: 10px;
+  font-family: "Inter", "Helvetica Neue", sans-serif;
+  font-size: 13px;
+  line-height: 1.55;
+  color: #5b5b56;
+}
 /* Secondary "navigator" paragraph that follows the lede. Drops a step
    in size + weight + saturation so the eye reads it as a follow-up /
    table of contents rather than another full lede paragraph; the inline

demo.html CHANGED Viewed

@@ -275,6 +275,10 @@
         shipping with the full training code, the data pipeline, and the model weights.
         Everything is open source on the Hugging Face Hub.
       </p>
     </div>
   </div>
@@ -545,16 +549,16 @@
   <div class="tab-lede__rail">
     <span class="tab-lede__eyebrow">Intro</span>
     <p>
-      <strong>Carbon-3B</strong> is a 3-billion-parameter language model for DNA. We trained it
-      on roughly 1&nbsp;trillion tokens (6&nbsp;trillion base pairs) of genomic sequence with a
-      single objective: given some DNA, predict what comes next (six bases at a time,
-      autoregressively). That's it: no annotations, no labels, no biology curriculum.
-      Just <em>read DNA, predict more DNA</em>.
     </p>
     <p class="tab-lede__note">
-      The interesting question is what else falls out of that. We didn't tell Carbon-3B what an
-      exon is. We didn't tell it which mutations are pathogenic. We didn't tell it how genes
-      differ between species. The sections below are ways to read what it picked up
       anyway: autocomplete a gene <a class="lede-chip" href="#completion">§1</a>, see
       structure emerge in its confidence <a class="lede-chip" href="#track">§2</a>, score
       a disease variant against a healthy one <a class="lede-chip" href="#vep">§3</a>,
@@ -1368,12 +1372,14 @@ for name, ids in zip(species_prefixes, new_ids):
     <span class="tab-lede__eyebrow">Intro</span>
     <p>
       Carbon's architecture is deliberately vanilla. What's <em>not</em> vanilla, and what
-      gets the headline numbers in the DNA Lab tab, is two things: a <strong>6-mer
       tokenizer</strong> that lets the model see ~6&times; more genomic context per
-      forward pass, and a <strong>Factorized Nucleotide Supervision (FNS)</strong> loss
       that gives the model partial credit for near-miss tokens once cross-entropy
-      training starts to wobble. Everything else (architecture, data mix, optimizer) is
-      standard recipe.
     </p>
     <p class="tab-lede__note">
       The sections below walk through each of those choices: how the tokenizer changes

         shipping with the full training code, the data pipeline, and the model weights.
         Everything is open source on the Hugging Face Hub.
       </p>
+      <figure class="tab-lede__figure">
+        <img src="/img/pareto.png" alt="Throughput vs win rate pareto frontier: Carbon 3B/8B sit at high win rate and ~275× the throughput of Arc Evo2 7B, well ahead of GENERator-v2.">
+        <figcaption>Throughput (base pairs per second, log scale) vs win rate across open DNA foundation models. Carbon 3B matches Evo2 7B's win rate at roughly 275× the throughput.</figcaption>
+      </figure>
     </div>
   </div>
   <div class="tab-lede__rail">
     <span class="tab-lede__eyebrow">Intro</span>
     <p>
+      <strong>Carbon-3B</strong> is a 3-billion-parameter language model for DNA. It is trained on
+      roughly 1&nbsp;trillion tokens (6&nbsp;trillion base pairs) of genomic sequence with a simple
+      objective: given some DNA, predict what comes next (six bases at a time, autoregressively).
+      Even though the objective is simple the resulting model is versatile. In the DNA lab you can
+      explore all the cool things we can do with a DNA model.
     </p>
     <p class="tab-lede__note">
+      Carbon-3B was trained unsupervised besides some simple tags for species and gene biotypes.
+      It wasn't trained to tell which mutations are pathogenic or how genes differ between species.
+      The sections below highlight what it picked up
       anyway: autocomplete a gene <a class="lede-chip" href="#completion">§1</a>, see
       structure emerge in its confidence <a class="lede-chip" href="#track">§2</a>, score
       a disease variant against a healthy one <a class="lede-chip" href="#vep">§3</a>,
     <span class="tab-lede__eyebrow">Intro</span>
     <p>
       Carbon's architecture is deliberately vanilla. What's <em>not</em> vanilla, and what
+      gets the headline numbers in the DNA Lab tab, is three things: a <strong>6-mer
       tokenizer</strong> that lets the model see ~6&times; more genomic context per
+      forward pass, a <strong>Factorized Nucleotide Supervision (FNS)</strong> loss
       that gives the model partial credit for near-miss tokens once cross-entropy
+      training starts to wobble, and a <strong>multi-stage curated data mixture</strong>,
+      biased toward functional genomic regions. Everything else (architecture, optimizer)
+      is standard recipe. The technical report details each choice and the ablations
+      behind it.
     </p>
     <p class="tab-lede__note">
       The sections below walk through each of those choices: how the tokenizer changes

img/pareto.png ADDED Viewed

Git LFS Details

SHA256: 2cee784b63d5f933f8a64ead2c5e7eeecb576c8593fd27e977f89b17ca5ecd0a
Pointer size: 131 Bytes
Size of remote file: 170 kB