HuggingFaceBio
/

Carbon-3B

@@ -267,14 +267,14 @@ The data mixture during the stable phases of pre-training (Phase 1 and the stabl
 The training uses a **staged objective and learning-rate schedule**:
-- **Phase 1 — Cross-Entropy (0 → 100 B tokens).** WSD learning-rate schedule with peak LR $3 \times 10^{-4}$ and a 2 000-step linear warmup, then stable at peak through the end of Phase 1.
-- **Phase 2 — Factorised Nucleotide Supervision (100 B → 1 T tokens).** Switch to the hybrid FNS loss, lower the peak LR to $2 \times 10^{-5}$, and continue with a WSD schedule whose decay phase covers the last 20 % of Phase-2 steps. During the decay phase, we rebalance the data mixture to upsample mRNA and prokaryotic data — we found that mRNA in particular meaningfully helps downstream tasks — using the following ratios: 50 % Generator-style eukaryotic genes · 25 % mature mRNA · 10 % splice-enriched mRNA · 15 % GTDB bacterial genomes.
 See the Carbon technical report for the full pre-training recipe.
 ### Long-context training
-After pre-training, the model is continued-trained for **50 B additional tokens at sequence length 32 768**, with the rotary base shifted from $5 \times 10^{5}$ to $5 \times 10^{6}$. The long-context training mixture is:
 | Component | Fraction |
 |---|---|
@@ -285,7 +285,7 @@ After pre-training, the model is continued-trained for **50 B additional tokens
 | GTDB bacterial genomes | 15.0 % |
 | Promoter sequences | 1.2 % |
-The optimizer is AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.95$, $\epsilon = 10^{-8}$, weight decay 0.1, gradient clipping 1.0), with a WSD learning-rate schedule: 2 000 steps linear warmup from 0 to $3 \times 10^{-5}$, stable phase, then 4 000-step linear decay to $3 \times 10^{-6}$. Global batch size: 64 sequences × 32 768 tokens.
 ### Software & hardware

 The training uses a **staged objective and learning-rate schedule**:
+- **Phase 1 — Cross-Entropy (0 → 100B tokens)**. WSD learning-rate schedule with peak LR = 3e-4 and a 2,000-step linear warmup, then stable at peak through the end of Phase 1.
+- **Phase 2 — Factorised Nucleotide Supervision (100B → 1T tokens)**. Switch to the hybrid FNS loss, lower the peak LR to 2e-5, and continue with a WSD schedule whose decay phase covers the last 20% of Phase-2 steps. During the decay phase, we rebalance the data mixture to upsample mRNA and prokaryotic data — we found that mRNA in particular meaningfully helps downstream tasks — using the following ratios: 50% Generator-style eukaryotic genes · 25% mature mRNA · 10% splice-enriched mRNA · 15% GTDB bacterial genomes.
 See the Carbon technical report for the full pre-training recipe.
 ### Long-context training
+After pre-training, the model undergoes continued training for 50B additional tokens at sequence length 32,768, with the rotary base shifted from 5 × 10^5 to 5 × 10^6. The long-context training mixture is:
 | Component | Fraction |
 |---|---|
 | GTDB bacterial genomes | 15.0 % |
 | Promoter sequences | 1.2 % |
+The optimizer is AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, weight decay = 0.1, gradient clipping = 1.0), with a WSD learning-rate schedule: 2,000 steps linear warmup from 0 to 3e-5, stable phase, then 4,000-step linear decay to 3e-6. Global batch size: 64 sequences × 32,768 tokens.
 ### Software & hardware