Carbon-3B / README.md
danaaubakirova's picture
Update README.md
7074e55 verified
|
raw
history blame
18.2 kB
---
library_name: transformers
license: apache-2.0
language:
- dna
tags:
- dna
- genomic
- transformers
---
# Carbon-3B
A generative DNA foundation model from the **Carbon** family.
> TODO: add a banner
## Table of Contents
1. [Model Summary](#model-summary)
2. [How to use](#how-to-use)
3. [Evaluation](#evaluation)
4. [Training](#training)
5. [Limitations](#limitations)
6. [License](#license)
7. [Citation](#citation)
## Model Summary
**Carbon-3B** is a 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences, with a primary focus on eukaryotes. It has a native context length of **32,768 6-mer tokens (β‰ˆ 197k DNA base pairs)** and extends to **64,000 tokens (β‰ˆ 384 kbp)** at inference time via YaRN. Carbon-3B is designed to be both strong and efficient: on generative tasks (sequence recovery), variant-effect prediction, and motif-perturbation discrimination, it matches the capability of substantially larger single-nucleotide baselines such as Evo2-7B while running several times faster.
Carbon-3B is the **flagship** model of the Carbon family. We also release [**Carbon-8B**](https://huggingface.co/hf-carbon/) for users who need additional capability at higher inference cost, and [**Carbon-500M**](https://huggingface.co/hf-carbon/) β€” a small generative model intended for speculative decoding alongside Carbon-3B (or Carbon-8B).
### Key features
- **3B parameters**, 30 layers, hidden size 3072, GQA (32 heads, 4 KV groups), SwiGLU, RMSNorm.
- **Hybrid tokenizer**: non-overlapping 6-mer tokenization for DNA combined with the [Qwen3](https://huggingface.co/Qwen/Qwen3-4B) BPE vocabulary for English text. We found 6-mer tokenization to work substantially better than BPE for DNA β€” hence the hybrid setup, which keeps 6-mer for DNA while preserving a BPE vocabulary for future joint English + DNA training, the direction in which we believe genomic foundation models should converge. Each DNA token encodes 6 nucleotides, so 1 DNA token β‰ˆ 6 bp.
- **Native context: 32,768 tokens β‰ˆ 197 kbp.** Extendable to 64 k tokens (β‰ˆ 384 kbp) at inference time using YaRN.
- **Trained with a Cross-Entropy β†’ Factorised Nucleotide Supervision (FNS) objective schedule** to bridge coarse tokenization and single-nucleotide resolution (see the Carbon technical report).
- **Metadata-conditioned**: optional species-type and gene-type metadata tokens enable conditional generation.
- **Efficient inference**: TODO
Across our zero-shot evaluation suite β€” sequence recovery, four variant-effect-prediction (VEP) benchmarks (ClinVar coding, ClinVar non-coding, BRCA2, TraitGym Mendelian), and two sequence-level perturbation tasks (TATA-box and synonymous codon) β€” Carbon-3B is competitive with Evo2-7B. It additionally works well on long context and retrieves needles reliably from up to β‰ˆ 384 kbp of distal context on the Genome-NIAH long-context benchmark, while remaining several times faster than Evo2-7B.
For full design rationale and ablations, see the Carbon technical report and the [Carbon GitHub repository](https://github.com/huggingface/carbon).
## How to use
Carbon-3B is a standard Hugging Face causal LM. The custom DNA tokenizer requires `trust_remote_code=True` on the tokenizer; the model itself is stock `LlamaForCausalLM` and does not require it.
```bash
pip install -U transformers
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "hf-carbon/Carbon-3B"
# Tokenizer needs trust_remote_code for the DNA-specific logic
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
# Model is standard Llama-family β€” no trust_remote_code needed
model = AutoModelForCausalLM.from_pretrained(
repo,
torch_dtype=torch.bfloat16,
).cuda().eval()
# Wrap a DNA prompt with the <dna> tag (the model is trained with this format).
# DNA length should be a multiple of 6 β€” see the Tokenizer section below.
dna_prompt = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG" # 42 bp = 7 Γ— 6-mer
prompt = f"<dna>{dna_prompt}"
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
out = model.generate(
**inputs,
max_new_tokens=64,
do_sample=False,
)
# NOTE: do not pass skip_special_tokens=True β€” the hybrid tokenizer mis-handles TODO: fix
print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
```
> TODO: fix skip_special_tokens=True
### Tokenizer: working with DNA inputs
The Carbon tokenizer is a **hybrid** of BPE (for English text) and a fixed 6-mer scheme (for DNA). 6-mer tokenization is the central DNA-modeling design choice in Carbon β€” we found it works substantially better than BPE for DNA (see the Carbon technical report for the analysis) β€” but it comes with a few practical constraints when feeding input to the model. The tokenizer only switches into 6-mer mode when it sees the `<dna>` tag and emits `<oov>` token for any token with letters not in [ATCG].
#### 1. Always wrap DNA sequences with `<dna>` β€” this is critical
If you pass a raw DNA sequence **without** the `<dna>` tag, the tokenizer treats it as English text and applies BPE. BPE-tokenized DNA is essentially a different language for Carbon-3B, and performance collapses across every benchmark. Always prepend `<dna>` before any DNA content. Use `</dna>` to close the DNA block if you intend to follow it with non-DNA tokens.
```python
# ❌ Wrong β€” tokenized as BPE, performance collapses
prompt = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAG"
# βœ… Correct β€” `<dna>` flips the tokenizer into 6-mer mode, for generation
prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAG"
# βœ… Also correct β€” explicitly close the DNA block
prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAG</dna>"
```
The `<dna>` / `</dna>` tell the tokenizer where to switch modes.
#### 2. Only uppercase A, C, G, T are in the 6-mer vocab
Anything else inside a `<dna>...` block β€” lowercase bases, IUPAC ambiguity codes (`N`, `Y`, `R`, …), or any other character β€” is mapped to the `<oov>` token. Filter to canonical uppercase ACGT before passing input if you don't want `<oov>`.
#### 3. DNA length should be a multiple of 6
Each DNA token encodes 6 nucleotides, so the tokenizer groups input in non-overlapping 6-mer blocks. If the sequence is not a multiple of 6, the current tokenizer right-pads the trailing partial block with `A`s (e.g. `...CTAG` β†’ token `TAGAAA`). We recommend truncating to a multiple of 6 before passing the sequence in:
```python
def truncate_to_6mer(seq: str) -> str:
return seq[: (len(seq) // 6) * 6]
prompt = f"<dna>{truncate_to_6mer(seq)}"
```
> TODO add Kashif's PR for left padding and the auto dna tags flag.
> TODO edit text and example to say left padding instead of right padding
### Likelihood-based scoring
For variant-effect or perturbation tasks, score sequences with the model's per-token log-probabilities. A minimal single-sequence helper:
```python
import torch
import torch.nn.functional as F
@torch.no_grad()
def score(seq: str) -> float:
"""Mean log-prob per DNA token of `seq` (single sequence, no padding)."""
text = f"<dna>{seq}</dna>"
ids = tok(text, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
logits = model(ids).logits[:, :-1, :]
targets = ids[:, 1:]
logp = F.log_softmax(logits.float(), dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
return logp.mean().item()
```
For batched scoring with attention masking and full reproducible evaluation pipelines (sequence recovery, ClinVar / BRCA2 / TraitGym VEP, TATA / synonymous-codon perturbation, Genome-NIAH), use the official scripts in the [Carbon evaluation directory](https://github.com/huggingface/carbon/tree/main/evaluation) β€” see [`perturbation_tasks.py`](https://github.com/huggingface/carbon/blob/main/evaluation/perturbation_tasks.py) for the canonical `score_hf` implementation and [`README.md`](https://github.com/huggingface/carbon/blob/main/evaluation/README.md) for run instructions across all tasks.
### Long context: extending to 64 k tokens (β‰ˆ 384 kbp) with YaRN
The released `config.json` is configured for the native 32 k context. To extend to **64 k tokens (β‰ˆ 384 kbp)** at inference time, override `max_position_embeddings` and add a YaRN `rope_scaling` block. We recommend a **YaRN factor of 4**, which we observed gives better retrieval quality at 64 k than a tighter factor of 2:
```python
from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained(repo, trust_remote_code=True)
config.max_position_embeddings = 65536 # 64 k tokens β‰ˆ 384 kbp
config.rope_scaling = {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768,
}
model = AutoModelForCausalLM.from_pretrained(
repo, config=config, torch_dtype=torch.bfloat16
).cuda().eval()
```
We do **not** recommend pushing beyond 64 k tokens: retrieval quality degrades sharply at 128 k context in our benchmarks.
### Speculative decoding with Carbon-500M
Carbon-3B and Carbon-500M share the same tokenizer and DNA template format, so [Carbon-500M](https://huggingface.co/hf-carbon/) can be used as a draft model for speculative decoding with Carbon-3B (or Carbon-8B) as the target model, reducing wall-clock generation cost at no quality loss.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
draft = AutoModelForCausalLM.from_pretrained(
"hf-carbon/carbon-500M",
torch_dtype=torch.bfloat16,
).cuda().eval()
target = model # Carbon-3B, loaded above
inputs = tok(f"<dna>{dna_prompt}", return_tensors="pt", add_special_tokens=False).to("cuda")
out = target.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
assistant_model=draft,
)
```
### Conditional generation with metadata
The model is trained with a mixed-template objective; some examples are prefixed with species-type and/or gene-type metadata tokens. Generation can be conditioned on these by prepending the corresponding tokens:
```python
prompt = "<vertebrate_mammalian><protein_coding_region><dna>ATGCGCTAG..."
```
The unconditional `<dna>SEQUENCE</dna>` format remains supported and is the default. See the Carbon technical report for the full list of supported metadata tags.
## Evaluation
All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). The suite covers seven tasks across four capability families:
| Family | Task | Metric |
|---|---|---|
| Generative | **Sequence recovery** (eukaryote, bacteria, others splits) | Per-base accuracy on the next 30 bp |
| Variant-effect prediction (VEP) | **ClinVar coding** | AUROC, AUPRC (right-end / next-token scoring) |
| | **ClinVar non-coding** | AUROC, AUPRC |
| | **BRCA2** | AUROC, AUPRC, Spearman ρ (centered 8 kb window, full-LL delta) |
| | **TraitGym Mendelian** | AUROC, AUPRC, Spearman ρ |
| Sequence-level perturbation | **TATA-box perturbation v2** | Pairwise discrimination accuracy |
| | **Uniform synonymous-codon substitution v2** | Pairwise discrimination accuracy |
| Long-context retrieval | **Genome-NIAH** (4 task variants Γ— 6 context lengths, up to 786 kbp) | `gen_exact_match`, `ll_correct` |
Below we highlight the three short-context probes for which we report headline numbers in this card. Full results, including all VEP benchmarks and Genome-NIAH heatmaps, are in the Carbon technical report.
### Downstream tasks
| Category | Metric | Carbon 3B | GENERator-v2 3B | Evo2 7B (1M) |
|---|---|---|---|---|
| Generative | SR eukaryote % | **61.50** | 55.72 | <u>59.80</u> |
| Variant effect prediction | BRCA2 AUROC | **0.8464** | 0.8057 | <u>0.8352</u> |
| | TraitGym Mendelian AUPRC by-chrom | <u>0.3424</u> | 0.2068 | **0.3736** |
| | ClinVar coding AUROC (48 kb) | <u>0.9330</u> | 0.9198 | **0.9370** |
| | ClinVar non-coding AUROC (48 kb) | **0.9156** | <u>0.9061</u> | 0.9003 |
| Perturbation | TATA v2 % | **65.94** | 49.82 | <u>63.78</u> |
| | SYN v2 % | <u>82.78</u> | 74.08 | **84.90** |
Carbon-3B is competitive with Evo2-7B while being much faster to run.
> TODO update TATA v2 and SYN v2 scores with teh new results!
>
### Long-context retrieval (Genome-NIAH)
[Genome-NIAH](https://huggingface.co/datasets/hf-carbon/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
Below are the scores on `niah`:
| Context length | Carbon 3B 32k (native / YaRN 4Γ—) | GENERator-v2 3B | Evo2-7B |
|------------------------|----------------------------------|-----------------|---------|
| 16 k tokens (98 kbp) | 0.73 / β€” | 0.74 | 0.97 |
| 32 k tokens (196 kbp) | 0.55 / 0.90 | β€” | 0.95 |
| 64 k tokens (393 kbp) | β€” / 0.79 | β€” | 0.80 |
Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20 at 64k due to the slow inference speed.
> TODO try to run more 64k samples for Evo2 7B
- **4Γ— longer effective context than Generator-v2-3B.** Generator-v2-3B caps at 16 k tokens (β‰ˆ 98 kbp). Carbon-3B has a native context of 32 k tokens (β‰ˆ 197 kbp) and extends to 64 k tokens (β‰ˆ 384 kbp) at inference time with YaRN. It matches Generator-v2-3B on `niah` at 98 kbp.
- **Matches Evo2-7B (1 M context) on `niah` at 384 kbp** (64 k tokens) under YaRN, despite being substantially smaller.
- **YaRN helps at the native-context boundary.** At 32 k tokens (β‰ˆ 197 kbp), retrieval quality near the model's native maximum begins to degrade; applying YaRN at inference time smooths the boundary and recovers most of the lost accuracy.
### Inference efficiency
> TODO: add Ed's benchmarks
## Training
### Model
- **Architecture:** decoder-only Transformer (Llama-style), 30 layers, hidden 3072, FFN 8448, 32 attention heads with GQA (4 KV groups), SwiGLU, RMSNorm.
- **Tokenizer:** Carbon 6-mer hybrid (vocab β‰ˆ 156 k including DNA tags and metadata tokens and BPE tokens for future English & DNA continual pretraining).
- **Precision:** bfloat16
- **Positional embedding:** RoPE, base ΞΈ = 5 Γ— 10^6, max position 32,768.
### Pre-training
Carbon-3B is pre-trained for **1T 6-mer tokens (β‰ˆ 6T DNA base pairs)** at sequence length 8 192, with a global batch size of 256 sequences (β‰ˆ 2 M tokens / step). The optimizer is AdamW throughout.
The data mixture during the stable phases of pre-training (Phase 1 and the stable portion of Phase 2) is the one documented on the [`hf-carbon/carbon-pretraining-corpus` dataset card](https://huggingface.co/datasets/hf-carbon/carbon-pretraining-corpus): β‰ˆ 70 % Generator-style eukaryotic genomic DNA, with mRNA, splice-enriched mRNA, and GTDB bacterial genomes alongside metadata-conditioned templates.
The training uses a **staged objective and learning-rate schedule**:
- **Phase 1 β€” Cross-Entropy (0 β†’ 100B tokens)**. WSD learning-rate schedule with peak LR = 3e-4 and a 2,000-step linear warmup, then stable at peak through the end of Phase 1.
- **Phase 2 β€” Factorised Nucleotide Supervision (100B β†’ 1T tokens)**. Switch to the hybrid FNS loss, lower the peak LR to 2e-5, and continue with a WSD schedule whose decay phase covers the last 20% of Phase-2 steps. During the decay phase, we rebalance the data mixture to upsample mRNA and prokaryotic data β€” we found that mRNA in particular meaningfully helps downstream tasks β€” using the following ratios: 50% Generator-style eukaryotic genes Β· 25% mature mRNA Β· 10% splice-enriched mRNA Β· 15% GTDB bacterial genomes.
See the Carbon technical report for the full pre-training recipe.
### Long-context training
After pre-training, the model undergoes continued training for 50B additional tokens at sequence length 32,768, with the rotary base shifted from 5 Γ— 10^5 to 5 Γ— 10^6. The long-context training mixture is:
| Component | Fraction |
|---|---|
| Gener-style annotated genes (metadata-conditioned) | 35.0 % |
| Concatenated annotated genes (long-context-data) | 13.8 % |
| mRNA transcripts | 25.0 % |
| Splice-enriched mRNA | 10.0 % |
| GTDB bacterial genomes | 15.0 % |
| Promoter sequences | 1.2 % |
The optimizer is AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, weight decay = 0.1, gradient clipping = 1.0), with a WSD learning-rate schedule: 2,000 steps linear warmup from 0 to 3e-5, stable phase, then 4,000-step linear decay to 3e-6. Global batch size: 64 sequences Γ— 32,768 tokens.
### Software & hardware
- **GPUs:** 128 H100 (16 nodes Γ— 8 H100).
- **Wall clock (long-context phase):** β‰ˆ 35 hours.
- **Training framework:** [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) with the Carbon patch [Megatron-LM-Carbon](https://github.com/huggingface/Megatron-LM-Carbon).
- **Conversion:** [Megatron-Bridge](https://github.com/NVIDIA/Megatron-Bridge) (Megatron β†’ Hugging Face).
## Limitations
- **Primarily eukaryotic training.** Carbon-3B is trained mostly on eukaryotic genomic and transcript sequence; the pre-training mixture deliberately includes a smaller prokaryotic component (GTDB bacterial genomes) so that continual pre-training on prokaryotic data remains straightforward. Despite this modest prokaryotic share, we observed consistent gains on prokaryotic sequence recovery throughout training, and Carbon-3B already **matches [GENERator-v2-prokaryote-3B](https://huggingface.co/GenerTeam/GENERator-v2-prokaryote-3b-base)** β€” a model trained specifically on prokaryotes β€” on the prokaryote sequence-recovery split. We expect that a short continued-training phase on prokaryotic data would deliver substantially stronger prokaryote-specific performance.
- **YaRN beyond 64 k tokens degrades.** YaRN reliably extends Carbon-3B to 64 k tokens (β‰ˆ 384 kbp) at inference time with `factor=4`. Pushing further to 128 k tokens (β‰ˆ 786 kbp) causes retrieval quality to drop sharply in our long-context benchmarks.
## License
Apache 2.0.