Carbon-3B / README.md
danaaubakirova's picture
Update README.md
7074e55 verified
|
raw
history blame
18.2 kB
metadata
library_name: transformers
license: apache-2.0
language:
  - dna
tags:
  - dna
  - genomic
  - transformers

Carbon-3B

A generative DNA foundation model from the Carbon family.

TODO: add a banner

Table of Contents

  1. Model Summary
  2. How to use
  3. Evaluation
  4. Training
  5. Limitations
  6. License
  7. Citation

Model Summary

Carbon-3B is a 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences, with a primary focus on eukaryotes. It has a native context length of 32,768 6-mer tokens (β‰ˆ 197k DNA base pairs) and extends to 64,000 tokens (β‰ˆ 384 kbp) at inference time via YaRN. Carbon-3B is designed to be both strong and efficient: on generative tasks (sequence recovery), variant-effect prediction, and motif-perturbation discrimination, it matches the capability of substantially larger single-nucleotide baselines such as Evo2-7B while running several times faster.

Carbon-3B is the flagship model of the Carbon family. We also release Carbon-8B for users who need additional capability at higher inference cost, and Carbon-500M β€” a small generative model intended for speculative decoding alongside Carbon-3B (or Carbon-8B).

Key features

  • 3B parameters, 30 layers, hidden size 3072, GQA (32 heads, 4 KV groups), SwiGLU, RMSNorm.
  • Hybrid tokenizer: non-overlapping 6-mer tokenization for DNA combined with the Qwen3 BPE vocabulary for English text. We found 6-mer tokenization to work substantially better than BPE for DNA β€” hence the hybrid setup, which keeps 6-mer for DNA while preserving a BPE vocabulary for future joint English + DNA training, the direction in which we believe genomic foundation models should converge. Each DNA token encodes 6 nucleotides, so 1 DNA token β‰ˆ 6 bp.
  • Native context: 32,768 tokens β‰ˆ 197 kbp. Extendable to 64 k tokens (β‰ˆ 384 kbp) at inference time using YaRN.
  • Trained with a Cross-Entropy β†’ Factorised Nucleotide Supervision (FNS) objective schedule to bridge coarse tokenization and single-nucleotide resolution (see the Carbon technical report).
  • Metadata-conditioned: optional species-type and gene-type metadata tokens enable conditional generation.
  • Efficient inference: TODO

Across our zero-shot evaluation suite β€” sequence recovery, four variant-effect-prediction (VEP) benchmarks (ClinVar coding, ClinVar non-coding, BRCA2, TraitGym Mendelian), and two sequence-level perturbation tasks (TATA-box and synonymous codon) β€” Carbon-3B is competitive with Evo2-7B. It additionally works well on long context and retrieves needles reliably from up to β‰ˆ 384 kbp of distal context on the Genome-NIAH long-context benchmark, while remaining several times faster than Evo2-7B.

For full design rationale and ablations, see the Carbon technical report and the Carbon GitHub repository.

How to use

Carbon-3B is a standard Hugging Face causal LM. The custom DNA tokenizer requires trust_remote_code=True on the tokenizer; the model itself is stock LlamaForCausalLM and does not require it.

pip install -U transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "hf-carbon/Carbon-3B"

# Tokenizer needs trust_remote_code for the DNA-specific logic
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)

# Model is standard Llama-family β€” no trust_remote_code needed
model = AutoModelForCausalLM.from_pretrained(
    repo,
    torch_dtype=torch.bfloat16,
).cuda().eval()

# Wrap a DNA prompt with the <dna> tag (the model is trained with this format).
# DNA length should be a multiple of 6 β€” see the Tokenizer section below.
dna_prompt = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"   # 42 bp = 7 Γ— 6-mer
prompt = f"<dna>{dna_prompt}"

inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
out = model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=False,
)
# NOTE: do not pass skip_special_tokens=True β€” the hybrid tokenizer mis-handles TODO: fix
print(tok.decode(out[0][inputs.input_ids.shape[1]:]))

TODO: fix skip_special_tokens=True

Tokenizer: working with DNA inputs

The Carbon tokenizer is a hybrid of BPE (for English text) and a fixed 6-mer scheme (for DNA). 6-mer tokenization is the central DNA-modeling design choice in Carbon β€” we found it works substantially better than BPE for DNA (see the Carbon technical report for the analysis) β€” but it comes with a few practical constraints when feeding input to the model. The tokenizer only switches into 6-mer mode when it sees the <dna> tag and emits <oov> token for any token with letters not in [ATCG].

1. Always wrap DNA sequences with <dna> β€” this is critical

If you pass a raw DNA sequence without the <dna> tag, the tokenizer treats it as English text and applies BPE. BPE-tokenized DNA is essentially a different language for Carbon-3B, and performance collapses across every benchmark. Always prepend <dna> before any DNA content. Use </dna> to close the DNA block if you intend to follow it with non-DNA tokens.

# ❌ Wrong β€” tokenized as BPE, performance collapses
prompt = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAG"

# βœ… Correct β€” `<dna>` flips the tokenizer into 6-mer mode, for generation
prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAG"

# βœ… Also correct β€” explicitly close the DNA block
prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAG</dna>"

The <dna> / </dna> tell the tokenizer where to switch modes.

2. Only uppercase A, C, G, T are in the 6-mer vocab

Anything else inside a <dna>... block β€” lowercase bases, IUPAC ambiguity codes (N, Y, R, …), or any other character β€” is mapped to the <oov> token. Filter to canonical uppercase ACGT before passing input if you don't want <oov>.

3. DNA length should be a multiple of 6

Each DNA token encodes 6 nucleotides, so the tokenizer groups input in non-overlapping 6-mer blocks. If the sequence is not a multiple of 6, the current tokenizer right-pads the trailing partial block with As (e.g. ...CTAG β†’ token TAGAAA). We recommend truncating to a multiple of 6 before passing the sequence in:

def truncate_to_6mer(seq: str) -> str:
    return seq[: (len(seq) // 6) * 6]

prompt = f"<dna>{truncate_to_6mer(seq)}"

TODO add Kashif's PR for left padding and the auto dna tags flag. TODO edit text and example to say left padding instead of right padding

Likelihood-based scoring

For variant-effect or perturbation tasks, score sequences with the model's per-token log-probabilities. A minimal single-sequence helper:

import torch
import torch.nn.functional as F

@torch.no_grad()
def score(seq: str) -> float:
    """Mean log-prob per DNA token of `seq` (single sequence, no padding)."""
    text = f"<dna>{seq}</dna>"
    ids = tok(text, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
    logits = model(ids).logits[:, :-1, :]
    targets = ids[:, 1:]
    logp = F.log_softmax(logits.float(), dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
    return logp.mean().item()

For batched scoring with attention masking and full reproducible evaluation pipelines (sequence recovery, ClinVar / BRCA2 / TraitGym VEP, TATA / synonymous-codon perturbation, Genome-NIAH), use the official scripts in the Carbon evaluation directory β€” see perturbation_tasks.py for the canonical score_hf implementation and README.md for run instructions across all tasks.

Long context: extending to 64 k tokens (β‰ˆ 384 kbp) with YaRN

The released config.json is configured for the native 32 k context. To extend to 64 k tokens (β‰ˆ 384 kbp) at inference time, override max_position_embeddings and add a YaRN rope_scaling block. We recommend a YaRN factor of 4, which we observed gives better retrieval quality at 64 k than a tighter factor of 2:

from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(repo, trust_remote_code=True)
config.max_position_embeddings = 65536    # 64 k tokens β‰ˆ 384 kbp
config.rope_scaling = {
    "type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
}
model = AutoModelForCausalLM.from_pretrained(
    repo, config=config, torch_dtype=torch.bfloat16
).cuda().eval()

We do not recommend pushing beyond 64 k tokens: retrieval quality degrades sharply at 128 k context in our benchmarks.

Speculative decoding with Carbon-500M

Carbon-3B and Carbon-500M share the same tokenizer and DNA template format, so Carbon-500M can be used as a draft model for speculative decoding with Carbon-3B (or Carbon-8B) as the target model, reducing wall-clock generation cost at no quality loss.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

draft = AutoModelForCausalLM.from_pretrained(
    "hf-carbon/carbon-500M",
    torch_dtype=torch.bfloat16,
).cuda().eval()
target = model  # Carbon-3B, loaded above

inputs = tok(f"<dna>{dna_prompt}", return_tensors="pt", add_special_tokens=False).to("cuda")
out = target.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False,
    assistant_model=draft,
)

Conditional generation with metadata

The model is trained with a mixed-template objective; some examples are prefixed with species-type and/or gene-type metadata tokens. Generation can be conditioned on these by prepending the corresponding tokens:

prompt = "<vertebrate_mammalian><protein_coding_region><dna>ATGCGCTAG..."

The unconditional <dna>SEQUENCE</dna> format remains supported and is the default. See the Carbon technical report for the full list of supported metadata tags.

Evaluation

All evaluations are zero-shot and use the public Carbon evaluation pipeline. The suite covers seven tasks across four capability families:

Family Task Metric
Generative Sequence recovery (eukaryote, bacteria, others splits) Per-base accuracy on the next 30 bp
Variant-effect prediction (VEP) ClinVar coding AUROC, AUPRC (right-end / next-token scoring)
ClinVar non-coding AUROC, AUPRC
BRCA2 AUROC, AUPRC, Spearman ρ (centered 8 kb window, full-LL delta)
TraitGym Mendelian AUROC, AUPRC, Spearman ρ
Sequence-level perturbation TATA-box perturbation v2 Pairwise discrimination accuracy
Uniform synonymous-codon substitution v2 Pairwise discrimination accuracy
Long-context retrieval Genome-NIAH (4 task variants Γ— 6 context lengths, up to 786 kbp) gen_exact_match, ll_correct

Below we highlight the three short-context probes for which we report headline numbers in this card. Full results, including all VEP benchmarks and Genome-NIAH heatmaps, are in the Carbon technical report.

Downstream tasks

Category Metric Carbon 3B GENERator-v2 3B Evo2 7B (1M)
Generative SR eukaryote % 61.50 55.72 59.80
Variant effect prediction BRCA2 AUROC 0.8464 0.8057 0.8352
TraitGym Mendelian AUPRC by-chrom 0.3424 0.2068 0.3736
ClinVar coding AUROC (48 kb) 0.9330 0.9198 0.9370
ClinVar non-coding AUROC (48 kb) 0.9156 0.9061 0.9003
Perturbation TATA v2 % 65.94 49.82 63.78
SYN v2 % 82.78 74.08 84.90

Carbon-3B is competitive with Evo2-7B while being much faster to run.

TODO update TATA v2 and SYN v2 scores with teh new results!

Long-context retrieval (Genome-NIAH)

Genome-NIAH is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.

Below are the scores on niah:

Context length Carbon 3B 32k (native / YaRN 4Γ—) GENERator-v2 3B Evo2-7B
16 k tokens (98 kbp) 0.73 / β€” 0.74 0.97
32 k tokens (196 kbp) 0.55 / 0.90 β€” 0.95
64 k tokens (393 kbp) β€” / 0.79 β€” 0.80

Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20 at 64k due to the slow inference speed.

TODO try to run more 64k samples for Evo2 7B

  • 4Γ— longer effective context than Generator-v2-3B. Generator-v2-3B caps at 16 k tokens (β‰ˆ 98 kbp). Carbon-3B has a native context of 32 k tokens (β‰ˆ 197 kbp) and extends to 64 k tokens (β‰ˆ 384 kbp) at inference time with YaRN. It matches Generator-v2-3B on niah at 98 kbp.
  • Matches Evo2-7B (1 M context) on niah at 384 kbp (64 k tokens) under YaRN, despite being substantially smaller.
  • YaRN helps at the native-context boundary. At 32 k tokens (β‰ˆ 197 kbp), retrieval quality near the model's native maximum begins to degrade; applying YaRN at inference time smooths the boundary and recovers most of the lost accuracy.

Inference efficiency

TODO: add Ed's benchmarks

Training

Model

  • Architecture: decoder-only Transformer (Llama-style), 30 layers, hidden 3072, FFN 8448, 32 attention heads with GQA (4 KV groups), SwiGLU, RMSNorm.
  • Tokenizer: Carbon 6-mer hybrid (vocab β‰ˆ 156 k including DNA tags and metadata tokens and BPE tokens for future English & DNA continual pretraining).
  • Precision: bfloat16
  • Positional embedding: RoPE, base ΞΈ = 5 Γ— 10^6, max position 32,768.

Pre-training

Carbon-3B is pre-trained for 1T 6-mer tokens (β‰ˆ 6T DNA base pairs) at sequence length 8 192, with a global batch size of 256 sequences (β‰ˆ 2 M tokens / step). The optimizer is AdamW throughout.

The data mixture during the stable phases of pre-training (Phase 1 and the stable portion of Phase 2) is the one documented on the hf-carbon/carbon-pretraining-corpus dataset card: β‰ˆ 70 % Generator-style eukaryotic genomic DNA, with mRNA, splice-enriched mRNA, and GTDB bacterial genomes alongside metadata-conditioned templates.

The training uses a staged objective and learning-rate schedule:

  • Phase 1 β€” Cross-Entropy (0 β†’ 100B tokens). WSD learning-rate schedule with peak LR = 3e-4 and a 2,000-step linear warmup, then stable at peak through the end of Phase 1.
  • Phase 2 β€” Factorised Nucleotide Supervision (100B β†’ 1T tokens). Switch to the hybrid FNS loss, lower the peak LR to 2e-5, and continue with a WSD schedule whose decay phase covers the last 20% of Phase-2 steps. During the decay phase, we rebalance the data mixture to upsample mRNA and prokaryotic data β€” we found that mRNA in particular meaningfully helps downstream tasks β€” using the following ratios: 50% Generator-style eukaryotic genes Β· 25% mature mRNA Β· 10% splice-enriched mRNA Β· 15% GTDB bacterial genomes.

See the Carbon technical report for the full pre-training recipe.

Long-context training

After pre-training, the model undergoes continued training for 50B additional tokens at sequence length 32,768, with the rotary base shifted from 5 Γ— 10^5 to 5 Γ— 10^6. The long-context training mixture is:

Component Fraction
Gener-style annotated genes (metadata-conditioned) 35.0 %
Concatenated annotated genes (long-context-data) 13.8 %
mRNA transcripts 25.0 %
Splice-enriched mRNA 10.0 %
GTDB bacterial genomes 15.0 %
Promoter sequences 1.2 %

The optimizer is AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, weight decay = 0.1, gradient clipping = 1.0), with a WSD learning-rate schedule: 2,000 steps linear warmup from 0 to 3e-5, stable phase, then 4,000-step linear decay to 3e-6. Global batch size: 64 sequences Γ— 32,768 tokens.

Software & hardware

  • GPUs: 128 H100 (16 nodes Γ— 8 H100).
  • Wall clock (long-context phase): β‰ˆ 35 hours.
  • Training framework: Megatron-LM with the Carbon patch Megatron-LM-Carbon.
  • Conversion: Megatron-Bridge (Megatron β†’ Hugging Face).

Limitations

  • Primarily eukaryotic training. Carbon-3B is trained mostly on eukaryotic genomic and transcript sequence; the pre-training mixture deliberately includes a smaller prokaryotic component (GTDB bacterial genomes) so that continual pre-training on prokaryotic data remains straightforward. Despite this modest prokaryotic share, we observed consistent gains on prokaryotic sequence recovery throughout training, and Carbon-3B already matches GENERator-v2-prokaryote-3B β€” a model trained specifically on prokaryotes β€” on the prokaryote sequence-recovery split. We expect that a short continued-training phase on prokaryotic data would deliver substantially stronger prokaryote-specific performance.
  • YaRN beyond 64 k tokens degrades. YaRN reliably extends Carbon-3B to 64 k tokens (β‰ˆ 384 kbp) at inference time with factor=4. Pushing further to 128 k tokens (β‰ˆ 786 kbp) causes retrieval quality to drop sharply in our long-context benchmarks.

License

Apache 2.0.