danaaubakirova HF Staff commited on
Commit
5f76c0a
·
1 Parent(s): b2b5137

Update README.md (#2)

Browse files

- Update README.md (7074e5523216e156065f87ddeb964f060aada036)

Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -267,14 +267,14 @@ The data mixture during the stable phases of pre-training (Phase 1 and the stabl
267
 
268
  The training uses a **staged objective and learning-rate schedule**:
269
 
270
- - **Phase 1 — Cross-Entropy (0 → 100 B tokens).** WSD learning-rate schedule with peak LR $3 \times 10^{-4}$ and a 2 000-step linear warmup, then stable at peak through the end of Phase 1.
271
- - **Phase 2 — Factorised Nucleotide Supervision (100 B 1 T tokens).** Switch to the hybrid FNS loss, lower the peak LR to $2 \times 10^{-5}$, and continue with a WSD schedule whose decay phase covers the last 20 % of Phase-2 steps. During the decay phase, we rebalance the data mixture to upsample mRNA and prokaryotic data — we found that mRNA in particular meaningfully helps downstream tasks — using the following ratios: 50 % Generator-style eukaryotic genes · 25 % mature mRNA · 10 % splice-enriched mRNA · 15 % GTDB bacterial genomes.
272
 
273
  See the Carbon technical report for the full pre-training recipe.
274
 
275
  ### Long-context training
276
 
277
- After pre-training, the model is continued-trained for **50 B additional tokens at sequence length 32 768**, with the rotary base shifted from $5 \times 10^{5}$ to $5 \times 10^{6}$. The long-context training mixture is:
278
 
279
  | Component | Fraction |
280
  |---|---|
@@ -285,7 +285,7 @@ After pre-training, the model is continued-trained for **50 B additional tokens
285
  | GTDB bacterial genomes | 15.0 % |
286
  | Promoter sequences | 1.2 % |
287
 
288
- The optimizer is AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.95$, $\epsilon = 10^{-8}$, weight decay 0.1, gradient clipping 1.0), with a WSD learning-rate schedule: 2 000 steps linear warmup from 0 to $3 \times 10^{-5}$, stable phase, then 4 000-step linear decay to $3 \times 10^{-6}$. Global batch size: 64 sequences × 32 768 tokens.
289
 
290
  ### Software & hardware
291
 
 
267
 
268
  The training uses a **staged objective and learning-rate schedule**:
269
 
270
+ - **Phase 1 — Cross-Entropy (0 → 100B tokens)**. WSD learning-rate schedule with peak LR = 3e-4 and a 2,000-step linear warmup, then stable at peak through the end of Phase 1.
271
+ - **Phase 2 — Factorised Nucleotide Supervision (100B1T tokens)**. Switch to the hybrid FNS loss, lower the peak LR to 2e-5, and continue with a WSD schedule whose decay phase covers the last 20% of Phase-2 steps. During the decay phase, we rebalance the data mixture to upsample mRNA and prokaryotic data — we found that mRNA in particular meaningfully helps downstream tasks — using the following ratios: 50% Generator-style eukaryotic genes · 25% mature mRNA · 10% splice-enriched mRNA · 15% GTDB bacterial genomes.
272
 
273
  See the Carbon technical report for the full pre-training recipe.
274
 
275
  ### Long-context training
276
 
277
+ After pre-training, the model undergoes continued training for 50B additional tokens at sequence length 32,768, with the rotary base shifted from 5 × 10^5 to 5 × 10^6. The long-context training mixture is:
278
 
279
  | Component | Fraction |
280
  |---|---|
 
285
  | GTDB bacterial genomes | 15.0 % |
286
  | Promoter sequences | 1.2 % |
287
 
288
+ The optimizer is AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, weight decay = 0.1, gradient clipping = 1.0), with a WSD learning-rate schedule: 2,000 steps linear warmup from 0 to 3e-5, stable phase, then 4,000-step linear decay to 3e-6. Global batch size: 64 sequences × 32,768 tokens.
289
 
290
  ### Software & hardware
291