loubnabnl HF Staff commited on
Commit
c3a251e
Β·
verified Β·
1 Parent(s): fbabbaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md CHANGED
@@ -1,3 +1,89 @@
1
  ---
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
  license: apache-2.0
4
+ language:
5
+ - dna
6
+ tags:
7
+ - dna
8
+ - genomic
9
+ - transformers
10
+ - draft-model
11
+ - speculative-decoding
12
  ---
13
+
14
+ # Carbon-500M
15
+
16
+ A small generative DNA model from the **Carbon** family.
17
+
18
+ **Carbon-500M is intended primarily as a draft model for speculative decoding** β€” it shares the tokenizer and DNA template format of [Carbon-3B](https://huggingface.co/hf-carbon/Carbon-3B) and [Carbon-8B](https://huggingface.co/hf-carbon/Carbon-8B), so it can be paired with either as the target model to reduce wall-clock generation cost at no quality loss. It is not designed to be competitive with the 3B/8B Carbon models on downstream benchmarks.
19
+
20
+ For the full design rationale, tokenizer specification, evaluation protocol, and usage notes (DNA tag wrapping, 6-mer constraints, scoring helpers), please refer to the **[Carbon-3B model card](https://huggingface.co/hf-carbon/Carbon-3B)** β€” this card focuses only on facts specific to Carbon-500M.
21
+
22
+ ## Facts
23
+
24
+ - **500M-parameter decoder-only autoregressive DNA model** (Llama-style architecture).
25
+ - **Hybrid tokenizer** shared with the rest of the Carbon family (6-mer for DNA + Qwen3 BPE for English text; each DNA token β‰ˆ 6 bp).
26
+ - **Pre-training tokens:** 600B 6-mer tokens (β‰ˆ 3.6 T DNA base pairs).
27
+ - **Sequence length:** 8 192tokens (β‰ˆ 48 kbp).
28
+ - **Loss schedule:** cross-entropy 0 β†’ 300 B tokens, then switch to the hybrid Factorised Nucleotide Supervision (FNS) loss from 300 B β†’ 600 B tokens. The switch happens later than for Carbon-3B because Carbon-500M's training was very stable and tolerated the later transition.
29
+ - **Data mixture:** identical to the **decay-phase mixture used by Carbon-3B** β€” 50 % Generator-style eukaryotic genes / 25 % mature mRNA / 10 % splice-enriched mRNA / 15 % GTDB bacterial genomes. Same weights across the whole 600 B run.
30
+ - **Precision:** bfloat16. **Optimizer:** AdamW. **Positional embedding:** RoPE.
31
+ - **No long-context training stage** β€” the model stays at its 8 k-token native context so 48kbp.
32
+ - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
33
+
34
+ ## How to use
35
+
36
+ Wrap DNA in `<dna>...</dna>` exactly as for the larger models. See the [Carbon-3B card](https://huggingface.co/hf-carbon/Carbon-3B#tokenizer-working-with-dna-inputs) for tokenizer details.
37
+
38
+ ```python
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+ import torch
41
+
42
+ repo = "hf-carbon/Carbon-500M"
43
+ tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ repo, torch_dtype=torch.bfloat16,
46
+ ).cuda().eval()
47
+
48
+ prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
49
+ inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
50
+ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
51
+ print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
52
+ ```
53
+
54
+ ### Recommended use: speculative decoding with Carbon-3B / Carbon-8B
55
+
56
+ Carbon-500M is most useful when paired with a larger Carbon model as the verifier. Hugging Face Transformers supports this natively through the `assistant_model` argument:
57
+
58
+ ```python
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ import torch
61
+
62
+ tok = AutoTokenizer.from_pretrained("hf-carbon/Carbon-3B", trust_remote_code=True)
63
+ draft = AutoModelForCausalLM.from_pretrained(
64
+ "hf-carbon/Carbon-500M", torch_dtype=torch.bfloat16
65
+ ).cuda().eval()
66
+ target = AutoModelForCausalLM.from_pretrained(
67
+ "hf-carbon/Carbon-3B", torch_dtype=torch.bfloat16
68
+ ).cuda().eval()
69
+
70
+ prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
71
+ inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
72
+ out = target.generate(
73
+ **inputs, max_new_tokens=256, do_sample=False,
74
+ assistant_model=draft,
75
+ )
76
+ print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
77
+ ```
78
+
79
+ Output is guaranteed identical to greedy decoding with the target model alone; only wall-clock latency is reduced.
80
+
81
+ ## Evaluation
82
+
83
+ Carbon-500M is benchmarked against β‰ˆ 1B-parameter DNA models on the standard Carbon evaluation suite. See the [Carbon-3B card](https://huggingface.co/hf-carbon/Carbon-3B#evaluation) for the task definitions and methodology.
84
+
85
+ > TODO Loubna: add one downstream table comparing Carbon-500M to other 1B-class baselines. -->
86
+
87
+ ## License
88
+
89
+ Apache 2.0.