kashif HF Staff commited on
Commit
ddc127f
Β·
verified Β·
1 Parent(s): 098b919

model card: fix typos, remove TODOs, 48->49 kbp, add skip_special_tokens=True

Browse files
Files changed (1) hide show
  1. README.md +6 -10
README.md CHANGED
@@ -18,18 +18,16 @@ A small generative DNA model from the **Carbon** family.
18
 
19
  For the full design rationale, tokenizer specification, evaluation protocol, and usage notes (DNA tag wrapping, 6-mer constraints, scoring helpers), please refer to the **[Carbon-3B model card](https://huggingface.co/HuggingFaceBio/Carbon-3B)** β€” this card focuses only on facts specific to Carbon-500M.
20
 
21
- > TODO: update teh tokenizer code
22
-
23
  ## Facts
24
 
25
  - **500M-parameter decoder-only autoregressive DNA model** (Llama-style architecture).
26
  - **Hybrid tokenizer** shared with the rest of the Carbon family (6-mer for DNA + Qwen3 BPE for English text; each DNA token β‰ˆ 6 bp).
27
  - **Pre-training tokens:** 600B 6-mer tokens (β‰ˆ 3.6 T DNA base pairs).
28
- - **Sequence length:** 8 192tokens (β‰ˆ 48 kbp).
29
  - **Loss schedule:** cross-entropy 0 β†’ 300 B tokens, then switch to the hybrid Factorised Nucleotide Supervision (FNS) loss from 300 B β†’ 600 B tokens. The switch happens later than for Carbon-3B because Carbon-500M's training was very stable and tolerated the later transition.
30
  - **Data mixture:** identical to the **decay-phase mixture used by Carbon-3B** β€” 50 % Generator-style eukaryotic genes / 25 % mature mRNA / 10 % splice-enriched mRNA / 15 % GTDB bacterial genomes. Same weights across the whole 600 B run.
31
  - **Precision:** bfloat16. **Optimizer:** AdamW. **Positional embedding:** RoPE.
32
- - **No long-context training stage** β€” the model stays at its 8 k-token native context so 48kbp.
33
  - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
34
 
35
  ## How to use
@@ -49,7 +47,7 @@ model = AutoModelForCausalLM.from_pretrained(
49
  prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
50
  inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
51
  out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
52
- print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
53
  ```
54
 
55
  ### Recommended use: speculative decoding with Carbon-3B / Carbon-8B
@@ -74,17 +72,15 @@ out = target.generate(
74
  **inputs, max_new_tokens=256, do_sample=False,
75
  assistant_model=draft,
76
  )
77
- print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
78
  ```
79
 
80
  Output is guaranteed identical to greedy decoding with the target model alone; only wall-clock latency is reduced.
81
 
82
  ## Evaluation
83
 
84
- Carbon-500M is benchmarked against β‰ˆ 1B-parameter DNA models on the standard Carbon evaluation suite. See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the task definitions and methodology.
85
-
86
- > TODO Loubna: add one downstream table comparing Carbon-500M to other 1B-class baselines. -->
87
 
88
  ## License
89
 
90
- Apache 2.0.
 
18
 
19
  For the full design rationale, tokenizer specification, evaluation protocol, and usage notes (DNA tag wrapping, 6-mer constraints, scoring helpers), please refer to the **[Carbon-3B model card](https://huggingface.co/HuggingFaceBio/Carbon-3B)** β€” this card focuses only on facts specific to Carbon-500M.
20
 
 
 
21
  ## Facts
22
 
23
  - **500M-parameter decoder-only autoregressive DNA model** (Llama-style architecture).
24
  - **Hybrid tokenizer** shared with the rest of the Carbon family (6-mer for DNA + Qwen3 BPE for English text; each DNA token β‰ˆ 6 bp).
25
  - **Pre-training tokens:** 600B 6-mer tokens (β‰ˆ 3.6 T DNA base pairs).
26
+ - **Sequence length:** 8,192 tokens (β‰ˆ 49 kbp).
27
  - **Loss schedule:** cross-entropy 0 β†’ 300 B tokens, then switch to the hybrid Factorised Nucleotide Supervision (FNS) loss from 300 B β†’ 600 B tokens. The switch happens later than for Carbon-3B because Carbon-500M's training was very stable and tolerated the later transition.
28
  - **Data mixture:** identical to the **decay-phase mixture used by Carbon-3B** β€” 50 % Generator-style eukaryotic genes / 25 % mature mRNA / 10 % splice-enriched mRNA / 15 % GTDB bacterial genomes. Same weights across the whole 600 B run.
29
  - **Precision:** bfloat16. **Optimizer:** AdamW. **Positional embedding:** RoPE.
30
+ - **No long-context training stage** β€” the model stays at its 8,192-token native context (β‰ˆ 49 kbp).
31
  - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
32
 
33
  ## How to use
 
47
  prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
48
  inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
49
  out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
50
+ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
51
  ```
52
 
53
  ### Recommended use: speculative decoding with Carbon-3B / Carbon-8B
 
72
  **inputs, max_new_tokens=256, do_sample=False,
73
  assistant_model=draft,
74
  )
75
+ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
76
  ```
77
 
78
  Output is guaranteed identical to greedy decoding with the target model alone; only wall-clock latency is reduced.
79
 
80
  ## Evaluation
81
 
82
+ Carbon-500M is benchmarked against β‰ˆ 1B-parameter DNA models on the standard Carbon evaluation suite. See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the task definitions and methodology.
 
 
83
 
84
  ## License
85
 
86
+ Apache 2.0.