File size: 8,860 Bytes
9cb7d0e
 
 
 
 
 
 
 
 
 
 
aee71eb
 
9cb7d0e
 
 
 
 
 
e5c6965
 
 
9cb7d0e
 
 
 
 
 
 
0cb5e66
 
9cb7d0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cb5e66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9cb7d0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a784e6c
9cb7d0e
a784e6c
 
 
 
 
 
6fac784
 
9cb7d0e
6fac784
9cb7d0e
6fac784
9cb7d0e
 
 
000ed11
 
 
 
9cb7d0e
 
 
 
 
 
 
a784e6c
 
9cb7d0e
 
 
8a8f321
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
library_name: transformers
license: apache-2.0
language:
  - dna
tags:
  - dna
  - genomic
  - transformers
---

![](figures/carbon-8b-banner.png)

# Carbon-8B

A larger, higher-capacity member of the **Carbon** family of generative DNA foundation models.

Carbon-8B is the 8B-parameter sibling of [Carbon-3B](https://huggingface.co/HuggingFaceBio/Carbon-3B). It is intended for users who can afford additional inference cost in exchange for stronger downstream performance. For the full design rationale, tokenizer specification, evaluation protocol, and usage details, please refer to the **[Carbon-3B model card](https://huggingface.co/HuggingFaceBio/Carbon-3B)** and the Carbon technical report — this card focuses only on what is specific to Carbon-8B.

- Technical report: https://github.com/huggingface/carbon/blob/main/tech-report.pdf
- Demo: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

## Model Summary

- **8B-parameter decoder-only autoregressive model** trained on DNA and RNA sequences with a primary focus on eukaryotes.
- **Same hybrid tokenizer** as Carbon-3B (non-overlapping 6-mer for DNA + Qwen3 BPE for English text). Each DNA token encodes 6 bp. Wrap DNA inputs with `<dna>...</dna>` — see the Carbon-3B card for tokenizer details and usage caveats.
- **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
- Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).

## How to use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "HuggingFaceBio/Carbon-8B"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, dtype=torch.bfloat16,
).cuda().eval()

prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"   # multiple of 6 bp
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

### Base-pair-level generation and scoring

The `fns` branch loads custom modeling code for Factorized Nucleotide Supervision (FNS). Carbon still uses its efficient 6-mer tokenizer, but during generation each selected 6-mer is assembled from six per-position nucleotide distributions, giving base-pair-level control over decoded DNA. Use this branch when you need exact base-pair counts, per-position masks, or temperature/top-p behavior applied at the nucleotide level rather than over the 4,096-way 6-mer distribution:

```py
import math
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceBio/Carbon-8B"
revision = "fns"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    revision=revision,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to(device).eval()

context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
n_bp = 60

inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=math.ceil(n_bp / tokenizer.k),
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]

print(generated_dna)
```

The same per-base marginals are exposed through `score_sequence()`, which returns the probability assigned to the observed base at each position. Taking the mean log probability gives a base-pair-level sequence score, where higher values indicate higher model likelihood:

```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceBio/Carbon-8B"
revision = "fns"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    revision=revision,
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to(device).eval()

reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"

with torch.no_grad():
    bp_probs, actual_probs = model.score_sequence([reference, perturbed])

scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]

print(f"reference mean bp logp: {scores[0]:.4f}")
print(f"perturbed mean bp logp: {scores[1]:.4f}")
print(f"reference preferred: {scores[0] > scores[1]}")
```


## Training

Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients:

- **Learning-rate schedule: cosine** (instead of the WSD schedule used for Carbon-3B).
- **Loss schedule:** after 100B tokens the loss switches from cross-entropy to FNS loss until the end of training.
- **Pre-training**: on 1T 6-mer tokens (≈ 6T DNA base pairs), with GBS=512, seq=8192 → 4.19 M tok/step. On 32 nodes (TP=4, DP=64), bfloat16, AdamW. We keep the same training mixture even in the decay phase with 70% Generator eukaryote data with metadata with dropout, 16% mRNA, 4% splice mRNA and 10% Prokaryote data.
- **Long-context extension stage.** After pre-training, Carbon-8B undergoes a long-context decay phase that extends the native context from 8,192 to 32,768 tokens (≈ 196 kbp). You can apply YaRN at 4× to further extrapolate to 128 k tokens (≈ 786 kbp).

Training infrastructure, framework ([Megatron-LM-Carbon](https://github.com/huggingface/Megatron-LM-Carbon)), and conversion path ([Megatron-Bridge](https://github.com/NVIDIA/Megatron-Bridge)) are identical to Carbon-3B.

## Evaluation

All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the full task suite, metrics, and methodology.

### Downstream tasks

| Category | Metric (%) | Carbon 3B | Carbon 8B | Δ |
|---|---|---|---|---|
| Generative | Sequence Recovery eukaryote | 61.54 | **64.05** | +2.51 |
| Variant effect prediction | BRCA2 | 84.63 | **85.72** | +1.09 |
| | TraitGym Mendelian | 33.65 | **36.43** | +2.78 |
| | ClinVar coding (24 kb) | 92.89 | **93.11** | +0.22 |
| | ClinVar non-coding (24 kb) | 91.14 | **91.63** | +0.49 |
| Perturbation | Nucleotide triplet-expansion | 85.20 | **89.05** | +3.85 |
| | Synonymous codon replacement | 88.89 | **91.46** | +2.57 |
| Long-context retrieval | Genomic-NIAH @ 393 kbp | 79.00 | **86.00** | +7.00 |

### Genomic-NIAH (long-context retrieval)

Genomic-NIAH measures how well a DNA model actually *uses* its long context. See the [`HuggingFaceBio/genomic-niah` dataset card](https://huggingface.co/datasets/HuggingFaceBio/genomic-niah) for the benchmark design.

| Context length         | Carbon 3B (native / YaRN 4×) | Carbon 8B (native / YaRN 4×) | Evo2 7B |
|------------------------|------------------------------|------------------------------|---------|
| 16 k tokens (98 kbp)   | 0.73 / 0.91                  | 0.78 / 0.89                  | **0.97**    |
| 32 k tokens (196 kbp)  | 0.55 / 0.90                  | 0.69 / 0.87                  | **0.95**    |
| 64 k tokens (393 kbp)  | — / 0.79                     | — / **0.86**                     | 0.80    |
| 128 k tokens (786 kbp) | — / 0.27                     | — / **0.65**                     | 0.53 |

Carbon-8B retrieves reliably up to its 32 k native boundary; **YaRN 4×** recovers most of the loss at the 32 k → 64 k boundary and extends usable retrieval to ≈ 786 kbp.

## Intended use

Generative modelling, variant-effect prediction, motif-perturbation analysis, and long-context retrieval on DNA sequences. For faster inference at shorter contexts, use **Carbon-3B**.

⚠️ **Genetic data is highly sensitive.** Depending on how this model is used (local download, inference API/endpoints, third-party inference providers, Spaces demos or others), input and output data may be processed or handled differently by different providers or space owners. Please make sure you understand and agree with how your data is handled before using the model.

## License

Apache 2.0.

## Acknowledgements

Carbon is a joint collaboration between the research teams at Hugging Face, Zhongguancun Academy, and TIGEM/University of Naples “Federico II”.