HuggingFaceBio
/

Carbon-3B

@@ -28,7 +28,7 @@ A generative DNA foundation model from the **Carbon** family.
 **Carbon-3B** is a 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences, with a primary focus on eukaryotes. It has a native context length of **32,768 6-mer tokens (≈ 197k DNA base pairs)** and extends to **64,000 tokens (≈ 384 kbp)** at inference time via YaRN. Carbon-3B is designed to be both strong and efficient: on generative tasks (sequence recovery), variant-effect prediction, and motif-perturbation discrimination, it matches the capability of substantially larger single-nucleotide baselines such as Evo2-7B while running several times faster.
-Carbon-3B is the **flagship** model of the Carbon family. We also release [**Carbon-8B**](https://huggingface.co/hf-carbon/) for users who need additional capability at higher inference cost, and [**Carbon-500M**](https://huggingface.co/hf-carbon/) — a small generative model intended for speculative decoding alongside Carbon-3B (or Carbon-8B).
 ### Key features
@@ -55,7 +55,7 @@ pip install -U transformers
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
-repo = "hf-carbon/Carbon-3B"
 # Tokenizer needs trust_remote_code for the DNA-specific logic
 tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
@@ -164,14 +164,14 @@ We do **not** recommend pushing beyond 64 k tokens: retrieval quality degrades s
 ### Speculative decoding with Carbon-500M
-Carbon-3B and Carbon-500M share the same tokenizer and DNA template format, so [Carbon-500M](https://huggingface.co/hf-carbon/) can be used as a draft model for speculative decoding with Carbon-3B (or Carbon-8B) as the target model, reducing wall-clock generation cost at no quality loss.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 draft = AutoModelForCausalLM.from_pretrained(
-    "hf-carbon/carbon-500M",
     torch_dtype=torch.bfloat16,
 ).cuda().eval()
 target = model  # Carbon-3B, loaded above
@@ -229,7 +229,7 @@ Carbon-3B is competitive with Evo2-7B while being much faster to run.
 >
 ### Long-context retrieval (Genome-NIAH)
-[Genome-NIAH](https://huggingface.co/datasets/hf-carbon/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
 Below are the scores on `niah`:
 | Context length         | Carbon 3B 32k (native / YaRN 4×) | GENERator-v2 3B | Evo2-7B |
@@ -263,7 +263,7 @@ Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20
 Carbon-3B is pre-trained for **1T 6-mer tokens (≈ 6T DNA base pairs)** at sequence length 8 192, with a global batch size of 256 sequences (≈ 2 M tokens / step). The optimizer is AdamW throughout.
-The data mixture during the stable phases of pre-training (Phase 1 and the stable portion of Phase 2) is the one documented on the [`hf-carbon/carbon-pretraining-corpus` dataset card](https://huggingface.co/datasets/hf-carbon/carbon-pretraining-corpus): ≈ 70 % Generator-style eukaryotic genomic DNA, with mRNA, splice-enriched mRNA, and GTDB bacterial genomes alongside metadata-conditioned templates.
 The training uses a **staged objective and learning-rate schedule**:

 **Carbon-3B** is a 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences, with a primary focus on eukaryotes. It has a native context length of **32,768 6-mer tokens (≈ 197k DNA base pairs)** and extends to **64,000 tokens (≈ 384 kbp)** at inference time via YaRN. Carbon-3B is designed to be both strong and efficient: on generative tasks (sequence recovery), variant-effect prediction, and motif-perturbation discrimination, it matches the capability of substantially larger single-nucleotide baselines such as Evo2-7B while running several times faster.
+Carbon-3B is the **flagship** model of the Carbon family. We also release [**Carbon-8B**](https://huggingface.co/HuggingFaceBio/) for users who need additional capability at higher inference cost, and [**Carbon-500M**](https://huggingface.co/HuggingFaceBio/) — a small generative model intended for speculative decoding alongside Carbon-3B (or Carbon-8B).
 ### Key features
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
+repo = "HuggingFaceBio/Carbon-3B"
 # Tokenizer needs trust_remote_code for the DNA-specific logic
 tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
 ### Speculative decoding with Carbon-500M
+Carbon-3B and Carbon-500M share the same tokenizer and DNA template format, so [Carbon-500M](https://huggingface.co/HuggingFaceBio/) can be used as a draft model for speculative decoding with Carbon-3B (or Carbon-8B) as the target model, reducing wall-clock generation cost at no quality loss.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 draft = AutoModelForCausalLM.from_pretrained(
+    "HuggingFaceBio/carbon-500M",
     torch_dtype=torch.bfloat16,
 ).cuda().eval()
 target = model  # Carbon-3B, loaded above
 >
 ### Long-context retrieval (Genome-NIAH)
+[Genome-NIAH](https://huggingface.co/datasets/HuggingFaceBio/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
 Below are the scores on `niah`:
 | Context length         | Carbon 3B 32k (native / YaRN 4×) | GENERator-v2 3B | Evo2-7B |
 Carbon-3B is pre-trained for **1T 6-mer tokens (≈ 6T DNA base pairs)** at sequence length 8 192, with a global batch size of 256 sequences (≈ 2 M tokens / step). The optimizer is AdamW throughout.
+The data mixture during the stable phases of pre-training (Phase 1 and the stable portion of Phase 2) is the one documented on the [`HuggingFaceBio/carbon-pretraining-corpus` dataset card](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus): ≈ 70 % Generator-style eukaryotic genomic DNA, with mRNA, splice-enriched mRNA, and GTDB bacterial genomes alongside metadata-conditioned templates.
 The training uses a **staged objective and learning-rate schedule**: