HuggingFaceBio
/

Carbon-3B

@@ -57,13 +57,10 @@ import torch
 repo = "HuggingFaceBio/Carbon-3B"
-# Tokenizer needs trust_remote_code for the DNA-specific logic
 tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
-# Model is standard Llama-family — no trust_remote_code needed
 model = AutoModelForCausalLM.from_pretrained(
     repo,
-    torch_dtype=torch.bfloat16,
 ).cuda().eval()
 # Wrap a DNA prompt with the <dna> tag (the model is trained with this format).
@@ -133,41 +130,6 @@ def score(seq: str) -> float:
     targets = ids[:, 1:]
     logp = F.log_softmax(logits.float(), dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
     return logp.mean().item()
-# QY: not sure if we still want to keep this per-token log-probabilities score function,
-# because we now have a more elegant one in modeling_carbon.py:
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-repo = "HuggingFaceBio/Carbon-3B"
-# Load tokenizer and model
-tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(
-    repo,
-    torch_dtype=torch.bfloat16,
-    trust_remote_code=True
-).cuda().eval()
-# Setup tokenizer for bp-level scoring (required for score_sequence)
-model.setup_tokenizer(tok)
-# Score sequences - automatically handles BOS token and padding
-sequences = ["ATCG" * 1024, "ACAT" * 2048]
-bp_probs_list, actual_probs_list = model.score_sequence(sequences)
-# bp_probs_list: list of [seq_len_i, 4] tensors - probability distribution over A/T/C/G at each position
-# actual_probs_list: list of [seq_len_i] tensors - probability of the actual base at each position
-# Compute metrics for each sequence
-for i, (seq, actual_probs) in enumerate(zip(sequences, actual_probs_list)):
-    log_likelihood = actual_probs.log().mean().item()  # Total log-likelihood
-    perplexity = torch.exp(-actual_probs.log().mean()).item()  # Perplexity
-    print(f"Sequence {i+1} (length {len(seq)}):")
-    print(f"  Mean log-likelihood: {log_likelihood:.2f}")
-    print(f"  Perplexity: {perplexity:.4f}")
-    print(f"  Mean probability: {actual_probs.mean().item():.4f}")
 ```
 For batched scoring with attention masking and full reproducible evaluation pipelines (sequence recovery, ClinVar / BRCA2 / TraitGym VEP, TATA / synonymous-codon perturbation, Genome-NIAH), use the official scripts in the [Carbon evaluation directory](https://github.com/huggingface/carbon/tree/main/evaluation) — see [`perturbation_tasks.py`](https://github.com/huggingface/carbon/blob/main/evaluation/perturbation_tasks.py) for the canonical `score_hf` implementation and [`README.md`](https://github.com/huggingface/carbon/blob/main/evaluation/README.md) for run instructions across all tasks.
@@ -187,7 +149,7 @@ config.rope_scaling = {
     "original_max_position_embeddings": 32768,
 }
 model = AutoModelForCausalLM.from_pretrained(
-    repo, config=config, torch_dtype=torch.bfloat16
 ).cuda().eval()
 ```
@@ -202,8 +164,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 draft = AutoModelForCausalLM.from_pretrained(
-    "HuggingFaceBio/carbon-500M",
-    torch_dtype=torch.bfloat16,
 ).cuda().eval()
 target = model  # Carbon-3B, loaded above
@@ -256,8 +218,8 @@ Below we highlight the three short-context probes for which we report headline n
 | | SYN v2 | <u>82.78</u> | 74.08 | **84.90** |
 Carbon-3B is competitive with Evo2-7B while being much faster to run.
-> TODO update TATA v2 and SYN v2 scores with teh new results!
->
 ### Long-context retrieval (Genome-NIAH)
 [Genome-NIAH](https://huggingface.co/datasets/HuggingFaceBio/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
@@ -270,7 +232,7 @@ Below are the scores on `niah`:
 | 64 k tokens (393 kbp)  | — / 0.79       | —               | 0.80    |
 Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20 at 64k due to the slow inference speed.
-> TODO try to run more 64k samples for Evo2 7B
 - **4× longer effective context than Generator-v2-3B.** Generator-v2-3B caps at 16 k tokens (≈ 98 kbp). Carbon-3B has a native context of 32 k tokens (≈ 197 kbp) and extends to 64 k tokens (≈ 384 kbp) at inference time with YaRN. It matches Generator-v2-3B on `niah` at 98 kbp.
 - **Matches Evo2-7B (1 M context) on `niah` at 384 kbp** (64 k tokens) under YaRN, despite being substantially smaller.

 repo = "HuggingFaceBio/Carbon-3B"
 tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
     repo,
+    dtype=torch.bfloat16,
 ).cuda().eval()
 # Wrap a DNA prompt with the <dna> tag (the model is trained with this format).
     targets = ids[:, 1:]
     logp = F.log_softmax(logits.float(), dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
     return logp.mean().item()
 ```
 For batched scoring with attention masking and full reproducible evaluation pipelines (sequence recovery, ClinVar / BRCA2 / TraitGym VEP, TATA / synonymous-codon perturbation, Genome-NIAH), use the official scripts in the [Carbon evaluation directory](https://github.com/huggingface/carbon/tree/main/evaluation) — see [`perturbation_tasks.py`](https://github.com/huggingface/carbon/blob/main/evaluation/perturbation_tasks.py) for the canonical `score_hf` implementation and [`README.md`](https://github.com/huggingface/carbon/blob/main/evaluation/README.md) for run instructions across all tasks.
     "original_max_position_embeddings": 32768,
 }
 model = AutoModelForCausalLM.from_pretrained(
+    repo, config=config, dtype=torch.bfloat16
 ).cuda().eval()
 ```
 import torch
 draft = AutoModelForCausalLM.from_pretrained(
+    "HuggingFaceBio/Carbon-500M",
+    dtype=torch.bfloat16,
 ).cuda().eval()
 target = model  # Carbon-3B, loaded above
 | | SYN v2 | <u>82.78</u> | 74.08 | **84.90** |
 Carbon-3B is competitive with Evo2-7B while being much faster to run.
+> TODO: update TATA v2 and SYN v2 scores with the new results
 ### Long-context retrieval (Genome-NIAH)
 [Genome-NIAH](https://huggingface.co/datasets/HuggingFaceBio/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
 | 64 k tokens (393 kbp)  | — / 0.79       | —               | 0.80    |
 Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20 at 64k due to the slow inference speed.
+> TODO: run more 64k samples for Evo2 7B
 - **4× longer effective context than Generator-v2-3B.** Generator-v2-3B caps at 16 k tokens (≈ 98 kbp). Carbon-3B has a native context of 32 k tokens (≈ 197 kbp) and extends to 64 k tokens (≈ 384 kbp) at inference time with YaRN. It matches Generator-v2-3B on `niah` at 98 kbp.
 - **Matches Evo2-7B (1 M context) on `niah` at 384 kbp** (64 k tokens) under YaRN, despite being substantially smaller.