HuggingFaceBio
/

Carbon-3B

@@ -188,6 +188,77 @@ prompt = "<vertebrate_mammalian><protein_coding_region><dna>ATGCGCTAG..."
 The unconditional `<dna>SEQUENCE</dna>` format remains supported and is the default. See the Carbon technical report for the full list of supported metadata tags.
 ## Evaluation
 All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). The suite covers seven tasks across four capability families:

 The unconditional `<dna>SEQUENCE</dna>` format remains supported and is the default. See the Carbon technical report for the full list of supported metadata tags.
+### Base-pair-level generation and scoring
+The `fns` branch loads custom modeling code for Factorized Nucleotide Supervision (FNS). Carbon still uses its efficient 6-mer tokenizer, but during generation each selected 6-mer is assembled from six per-position nucleotide distributions, giving base-pair-level control over decoded DNA. Use this branch when you need exact base-pair counts, per-position masks, or temperature/top-p behavior applied at the nucleotide level rather than over the 4,096-way 6-mer distribution:
+```py
+import math
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HuggingFaceBio/Carbon-3B"
+revision = "fns"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to(device).eval()
+context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
+n_bp = 60
+inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=math.ceil(n_bp / tokenizer.k),
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
+generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]
+print(generated_dna)
+```
+The same per-base marginals are exposed through `score_sequence()`, which returns the probability assigned to the observed base at each position. Taking the mean log probability gives a base-pair-level sequence score, where higher values indicate higher model likelihood:
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HuggingFaceBio/Carbon-3B"
+revision = "fns"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to(device).eval()
+reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
+perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"
+with torch.no_grad():
+    bp_probs, actual_probs = model.score_sequence([reference, perturbed])
+scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]
+print(f"reference mean bp logp: {scores[0]:.4f}")
+print(f"perturbed mean bp logp: {scores[1]:.4f}")
+print(f"reference preferred: {scores[0] > scores[1]}")
+```
 ## Evaluation
 All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). The suite covers seven tasks across four capability families: