Spaces:

HuggingFaceBio
/

carbon-demo

Running

lvwerra HF Staff commited on 17 days ago

Commit

b9445bd

1 Parent(s): 5798713

Recipe §3: extend BP section to cover scoring as a second endpoint

The same marginalization that powers FNS at training time also factors
the scoring endpoint, where you read P(actual base | context) directly
off the per-position marginals instead of forcing a token. Renamed the
section "BP-level inference" (id bpinference) to reflect both endpoints,
rewrote the lede, split the visual's last station into step 3a (generate)
and step 3b (score), and added a "score" tab to the code snippet showing
score_sequence() via the -remote checkpoints.

Files changed (1) hide show

demo.html +94 -36

demo.html CHANGED Viewed

@@ -1204,8 +1204,8 @@ for name, ids in zip(species_prefixes, new_ids):
       The sections below walk through each of those choices: how the tokenizer changes
       what a "token" means in DNA <a class="lede-chip" href="#tokenizer">§1</a>, how
       FNS rescues training in the BF16 regime <a class="lede-chip" href="#loss">§2</a>,
-      how bp-level generation falls out of the same marginalisation
-      <a class="lede-chip" href="#bpgen">§3</a>, what's in the training corpus
       <a class="lede-chip" href="#data">§4</a>, what the architecture looks like
       <a class="lede-chip" href="#architecture">§5</a>, how 8k-token pretraining reaches
       786 kbp at inference <a class="lede-chip" href="#longcontext">§6</a>, how Carbon
@@ -1331,23 +1331,27 @@ for name, ids in zip(species_prefixes, new_ids):
 </section>
 <!-- ============================================================ -->
-<!-- §8.5 · BP-LEVEL GENERATION                                    -->
 <!-- ============================================================ -->
-<section id="bpgen" class="section--two-col">
   <div class="section-narrative">
-  <div class="section-num">§3 · BP-level generation</div>
-  <div class="section-title">Sample bases, not 6-mers</div>
   <p class="lede">
-    The 6-mer tokenizer makes Carbon fast, but it's coarse at sampling time: each
-    step advances the sequence by 6 bases at once, temperature acts on a 4,096-way
-    distribution rather than per nucleotide, and stopping at an odd base count is
-    awkward. The same marginalisation that powers FNS at training time inverts the
-    tokenizer at inference: softmax over the 6-mer logits, then for each position
-    <code>p</code> sum the probabilities of every 6-mer that shares a given base at
-    <code>p</code>, and you recover six per-position 4-way base distributions.
-    Sample (or argmax) each independently, look up the matching 6-mer token id,
-    and force that token as the next selection. The decoder still emits one token
-    per step so throughput is unchanged, but the choice is now base-pair resolved.
   </p>
   </div>
@@ -1499,16 +1503,33 @@ for name, ids in zip(species_prefixes, new_ids):
         </div>
       </div>
-      <div style="text-align:center;color:#888;font-size:11px">▼ &nbsp; argmax (greedy) or multinomial (sampled) per position, then reassemble</div>
-      <div>
-        <div style="font-size:10px;color:#888;letter-spacing:1px;text-transform:uppercase;margin-bottom:6px">step 3 · forced as the next 6-mer token</div>
-        <div style="display:flex;align-items:center;justify-content:center;gap:10px;padding:12px;background:#fafaf6;border:1px solid #eee">
-          <div style="display:flex;gap:6px;font-size:18px;font-weight:700;color:#1A7A40;letter-spacing:2px">
-            <span>A</span><span>C</span><span>G</span><span>T</span><span>A</span><span>T</span>
           </div>
-          <span style="color:#888">→</span>
-          <span style="font-size:11px;color:#666">matching 6-mer token id forced via <code>scores.fill_(-inf); scores[id] = 0</code></span>
         </div>
       </div>
@@ -1517,25 +1538,33 @@ for name, ids in zip(species_prefixes, new_ids):
   <div class="takeaway">
     <strong>When to switch on bp-level</strong>
-    Reach for plain 6-mer sampling when 6-base granularity is fine: throughput-bound
-    decoding, long retrieval haystacks, large-scale screening. Switch to bp-level
-    when you need exact base counts, per-position masks, or temperature and top-p
-    applied at the base axis rather than the 4,096-way 6-mer axis. Same model, same
-    weights, same sampling controls; only the last step of the logits chain changes.
-    The <code>HuggingFaceBio/carbon-generate</code> repo ships this as a transformers
-    <code>custom_generate</code> method, so plain <code>LlamaForCausalLM</code>
-    checkpoints get bp-level generation without a custom modeling file or
-    <code>trust_remote_code</code> on the weights.
   </div>
   <details class="code-snippet">
     <summary>Run this from code</summary>
     <div class="code-snippet__body">
       <div class="code-snippet__tabs">
-        <button class="code-snippet__tab active" data-tab="local" type="button">transformers</button>
       </div>
       <button class="code-snippet__copy" type="button">Copy</button>
-      <div class="code-snippet__panel active" data-tab="local"><pre><code>from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 tok = AutoTokenizer.from_pretrained(
@@ -1549,7 +1578,7 @@ model = AutoModelForCausalLM.from_pretrained(
 prompt = "&lt;dna&gt;ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
 inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
-# `custom_generate` injects a logits processor that marginalises the
 # 6-mer logits to per-base distributions and samples each of the 6
 # positions independently, then forces the matching 6-mer token. All
 # standard generation knobs (temperature, top_p, top_k, repetition_penalty)
@@ -1566,6 +1595,35 @@ out = model.generate(
 # Slice off the prompt and decode the continuation as plain DNA.
 new_ids = out[0, inputs["input_ids"].shape[1]:]
 print(tok.decode(new_ids, skip_special_tokens=True))</code></pre></div>
     </div>
   </details>
   </div>

       The sections below walk through each of those choices: how the tokenizer changes
       what a "token" means in DNA <a class="lede-chip" href="#tokenizer">§1</a>, how
       FNS rescues training in the BF16 regime <a class="lede-chip" href="#loss">§2</a>,
+      how bp-level generation and scoring fall out of the same marginalization
+      <a class="lede-chip" href="#bpinference">§3</a>, what's in the training corpus
       <a class="lede-chip" href="#data">§4</a>, what the architecture looks like
       <a class="lede-chip" href="#architecture">§5</a>, how 8k-token pretraining reaches
       786 kbp at inference <a class="lede-chip" href="#longcontext">§6</a>, how Carbon
 </section>
 <!-- ============================================================ -->
+<!-- §8.5 · BP-LEVEL INFERENCE                                     -->
 <!-- ============================================================ -->
+<section id="bpinference" class="section--two-col">
   <div class="section-narrative">
+  <div class="section-num">§3 · BP-level inference</div>
+  <div class="section-title">Bases, not 6-mers</div>
   <p class="lede">
+    The 6-mer tokenizer makes Carbon fast, but it's coarse in both directions
+    of inference. When <em>generating</em>, each step advances the sequence by
+    6 bases at once and temperature acts on a 4,096-way distribution rather
+    than per nucleotide. When <em>scoring</em> an existing sequence, the raw
+    next-token likelihood answers "how likely is this 6-mer in context?", not
+    "how likely is this exact base at this exact position?", which is the
+    version you want for variant-effect prediction. The same marginalization
+    that powers FNS at training time fixes both: softmax over the 6-mer
+    logits, then for each position <code>p</code> sum the probabilities of
+    every 6-mer that shares a given base at <code>p</code>, and you recover
+    six per-position 4-way base distributions. To generate, sample (or argmax)
+    each independently and force the matching 6-mer token. To score, read
+    <em>P(actual base | context)</em> directly off the marginals at every
+    position. Same logits, same math, two endpoints.
   </p>
   </div>
         </div>
       </div>
+      <div style="text-align:center;color:#888;font-size:11px">▼ &nbsp; same marginals feed two endpoints: generate (force a token) or score (read off P(base))</div>
+      <div style="display:grid;grid-template-columns:1fr 1fr;gap:10px">
+        <!-- step 3a · generation endpoint -->
+        <div>
+          <div style="font-size:10px;color:#888;letter-spacing:1px;text-transform:uppercase;margin-bottom:6px">step 3a · generate</div>
+          <div style="display:flex;flex-direction:column;align-items:center;justify-content:center;gap:6px;padding:12px;background:#fafaf6;border:1px solid #eee;height:88px;box-sizing:border-box">
+            <div style="display:flex;gap:6px;font-size:18px;font-weight:700;color:#1A7A40;letter-spacing:2px">
+              <span>A</span><span>C</span><span>G</span><span>T</span><span>A</span><span>T</span>
+            </div>
+            <div style="font-size:10px;color:#666;text-align:center;line-height:1.4">
+              argmax / multinomial &rarr; force matching 6-mer token
+            </div>
+          </div>
+        </div>
+        <!-- step 3b · scoring endpoint -->
+        <div>
+          <div style="font-size:10px;color:#888;letter-spacing:1px;text-transform:uppercase;margin-bottom:6px">step 3b · score</div>
+          <div style="display:flex;flex-direction:column;align-items:center;justify-content:center;gap:6px;padding:12px;background:#fafaf6;border:1px solid #eee;height:88px;box-sizing:border-box">
+            <div style="display:flex;gap:8px;font-size:11px;color:#1A7A40;font-weight:600;font-feature-settings:'tnum'">
+              <span>.83</span><span>.71</span><span>.92</span><span>.67</span><span>.48</span><span>.79</span>
+            </div>
+            <div style="font-size:10px;color:#666;text-align:center;line-height:1.4">
+              read P(actual base | context) at each position
+            </div>
           </div>
         </div>
       </div>
   <div class="takeaway">
     <strong>When to switch on bp-level</strong>
+    Use plain 6-mer decoding when 6-base granularity is fine: throughput-bound
+    generation, long retrieval haystacks, large-scale screening. Reach for
+    bp-level <em>generation</em> when you need exact base counts, per-position
+    masks, or temperature applied at the base axis rather than the 4,096-way
+    6-mer axis. Reach for bp-level <em>scoring</em> whenever the task is about
+    a specific base: variant-effect prediction, single-nucleotide mutational
+    scans, comparing the likelihood of a reference and an alternate allele at
+    one position. Two complementary delivery paths: generation ships as a
+    transformers <code>custom_generate</code> method at
+    <code>HuggingFaceBio/carbon-generate</code> that works on the plain
+    <code>Carbon-3B</code>/<code>8B</code>/<code>500M</code> checkpoints
+    (standard <code>LlamaForCausalLM</code>, no custom modeling file).
+    Scoring ships in the <code>-remote</code> variants of those same
+    checkpoints, which add a <code>score_sequence(seq)</code> method that
+    returns per-base distributions and the probability of the observed base
+    at every position.
   </div>
   <details class="code-snippet">
     <summary>Run this from code</summary>
     <div class="code-snippet__body">
       <div class="code-snippet__tabs">
+        <button class="code-snippet__tab active" data-tab="generate" type="button">generate</button>
+        <button class="code-snippet__tab"        data-tab="score"    type="button">score</button>
       </div>
       <button class="code-snippet__copy" type="button">Copy</button>
+      <div class="code-snippet__panel active" data-tab="generate"><pre><code>from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 tok = AutoTokenizer.from_pretrained(
 prompt = "&lt;dna&gt;ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
 inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
+# `custom_generate` injects a logits processor that marginalizes the
 # 6-mer logits to per-base distributions and samples each of the 6
 # positions independently, then forces the matching 6-mer token. All
 # standard generation knobs (temperature, top_p, top_k, repetition_penalty)
 # Slice off the prompt and decode the continuation as plain DNA.
 new_ids = out[0, inputs["input_ids"].shape[1]:]
 print(tok.decode(new_ids, skip_special_tokens=True))</code></pre></div>
+      <div class="code-snippet__panel" data-tab="score"><pre><code>from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch, math
+# The -remote variants bundle modeling code that exposes
+# `score_sequence(seq)` directly on the model. It returns, for every
+# position in the input DNA, the marginal P(base | context) and the
+# probability of the observed base.
+tok = AutoTokenizer.from_pretrained(
+    "HuggingFaceBio/Carbon-3B-remote", trust_remote_code=True,
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "HuggingFaceBio/Carbon-3B-remote",
+    trust_remote_code=True,
+    dtype=torch.bfloat16, device_map="auto",
+)
+ref = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
+alt = ref[:20] + "G" + ref[21:]          # single-base substitution at pos 20
+# bp_probs: [seq_len, 4]   marginal P(A/T/C/G | context) at each position
+# actual:   [seq_len]      P(observed base | context) at each position
+bp_probs_ref, actual_ref = model.score_sequence(ref)
+bp_probs_alt, actual_alt = model.score_sequence(alt)
+# log-likelihood delta at the substituted position
+# is the per-base variant-effect score in its simplest form.
+delta = math.log(actual_alt[20].item() + 1e-12) \
+      - math.log(actual_ref[20].item() + 1e-12)
+print(f"log P(alt) - log P(ref) at pos 20: {delta:+.3f}")</code></pre></div>
     </div>
   </details>
   </div>