lewtun HF Staff commited on
Commit
4a06cee
Β·
verified Β·
1 Parent(s): f47e012

Update model card namespace references

Browse files

Replace all hf-carbon namespace references with HuggingFaceBio.

Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -28,7 +28,7 @@ A generative DNA foundation model from the **Carbon** family.
28
 
29
  **Carbon-3B** is a 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences, with a primary focus on eukaryotes. It has a native context length of **32,768 6-mer tokens (β‰ˆ 197k DNA base pairs)** and extends to **64,000 tokens (β‰ˆ 384 kbp)** at inference time via YaRN. Carbon-3B is designed to be both strong and efficient: on generative tasks (sequence recovery), variant-effect prediction, and motif-perturbation discrimination, it matches the capability of substantially larger single-nucleotide baselines such as Evo2-7B while running several times faster.
30
 
31
- Carbon-3B is the **flagship** model of the Carbon family. We also release [**Carbon-8B**](https://huggingface.co/hf-carbon/) for users who need additional capability at higher inference cost, and [**Carbon-500M**](https://huggingface.co/hf-carbon/) β€” a small generative model intended for speculative decoding alongside Carbon-3B (or Carbon-8B).
32
 
33
  ### Key features
34
 
@@ -55,7 +55,7 @@ pip install -U transformers
55
  from transformers import AutoModelForCausalLM, AutoTokenizer
56
  import torch
57
 
58
- repo = "hf-carbon/Carbon-3B"
59
 
60
  # Tokenizer needs trust_remote_code for the DNA-specific logic
61
  tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
@@ -164,14 +164,14 @@ We do **not** recommend pushing beyond 64 k tokens: retrieval quality degrades s
164
 
165
  ### Speculative decoding with Carbon-500M
166
 
167
- Carbon-3B and Carbon-500M share the same tokenizer and DNA template format, so [Carbon-500M](https://huggingface.co/hf-carbon/) can be used as a draft model for speculative decoding with Carbon-3B (or Carbon-8B) as the target model, reducing wall-clock generation cost at no quality loss.
168
 
169
  ```python
170
  from transformers import AutoModelForCausalLM, AutoTokenizer
171
  import torch
172
 
173
  draft = AutoModelForCausalLM.from_pretrained(
174
- "hf-carbon/carbon-500M",
175
  torch_dtype=torch.bfloat16,
176
  ).cuda().eval()
177
  target = model # Carbon-3B, loaded above
@@ -229,7 +229,7 @@ Carbon-3B is competitive with Evo2-7B while being much faster to run.
229
  >
230
  ### Long-context retrieval (Genome-NIAH)
231
 
232
- [Genome-NIAH](https://huggingface.co/datasets/hf-carbon/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
233
 
234
  Below are the scores on `niah`:
235
  | Context length | Carbon 3B 32k (native / YaRN 4Γ—) | GENERator-v2 3B | Evo2-7B |
@@ -263,7 +263,7 @@ Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20
263
 
264
  Carbon-3B is pre-trained for **1T 6-mer tokens (β‰ˆ 6T DNA base pairs)** at sequence length 8 192, with a global batch size of 256 sequences (β‰ˆ 2 M tokens / step). The optimizer is AdamW throughout.
265
 
266
- The data mixture during the stable phases of pre-training (Phase 1 and the stable portion of Phase 2) is the one documented on the [`hf-carbon/carbon-pretraining-corpus` dataset card](https://huggingface.co/datasets/hf-carbon/carbon-pretraining-corpus): β‰ˆ 70 % Generator-style eukaryotic genomic DNA, with mRNA, splice-enriched mRNA, and GTDB bacterial genomes alongside metadata-conditioned templates.
267
 
268
  The training uses a **staged objective and learning-rate schedule**:
269
 
 
28
 
29
  **Carbon-3B** is a 3B-parameter decoder-only autoregressive genomic foundation model trained on DNA and RNA sequences, with a primary focus on eukaryotes. It has a native context length of **32,768 6-mer tokens (β‰ˆ 197k DNA base pairs)** and extends to **64,000 tokens (β‰ˆ 384 kbp)** at inference time via YaRN. Carbon-3B is designed to be both strong and efficient: on generative tasks (sequence recovery), variant-effect prediction, and motif-perturbation discrimination, it matches the capability of substantially larger single-nucleotide baselines such as Evo2-7B while running several times faster.
30
 
31
+ Carbon-3B is the **flagship** model of the Carbon family. We also release [**Carbon-8B**](https://huggingface.co/HuggingFaceBio/) for users who need additional capability at higher inference cost, and [**Carbon-500M**](https://huggingface.co/HuggingFaceBio/) β€” a small generative model intended for speculative decoding alongside Carbon-3B (or Carbon-8B).
32
 
33
  ### Key features
34
 
 
55
  from transformers import AutoModelForCausalLM, AutoTokenizer
56
  import torch
57
 
58
+ repo = "HuggingFaceBio/Carbon-3B"
59
 
60
  # Tokenizer needs trust_remote_code for the DNA-specific logic
61
  tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
 
164
 
165
  ### Speculative decoding with Carbon-500M
166
 
167
+ Carbon-3B and Carbon-500M share the same tokenizer and DNA template format, so [Carbon-500M](https://huggingface.co/HuggingFaceBio/) can be used as a draft model for speculative decoding with Carbon-3B (or Carbon-8B) as the target model, reducing wall-clock generation cost at no quality loss.
168
 
169
  ```python
170
  from transformers import AutoModelForCausalLM, AutoTokenizer
171
  import torch
172
 
173
  draft = AutoModelForCausalLM.from_pretrained(
174
+ "HuggingFaceBio/carbon-500M",
175
  torch_dtype=torch.bfloat16,
176
  ).cuda().eval()
177
  target = model # Carbon-3B, loaded above
 
229
  >
230
  ### Long-context retrieval (Genome-NIAH)
231
 
232
+ [Genome-NIAH](https://huggingface.co/datasets/HuggingFaceBio/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
233
 
234
  Below are the scores on `niah`:
235
  | Context length | Carbon 3B 32k (native / YaRN 4Γ—) | GENERator-v2 3B | Evo2-7B |
 
263
 
264
  Carbon-3B is pre-trained for **1T 6-mer tokens (β‰ˆ 6T DNA base pairs)** at sequence length 8 192, with a global batch size of 256 sequences (β‰ˆ 2 M tokens / step). The optimizer is AdamW throughout.
265
 
266
+ The data mixture during the stable phases of pre-training (Phase 1 and the stable portion of Phase 2) is the one documented on the [`HuggingFaceBio/carbon-pretraining-corpus` dataset card](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus): β‰ˆ 70 % Generator-style eukaryotic genomic DNA, with mRNA, splice-enriched mRNA, and GTDB bacterial genomes alongside metadata-conditioned templates.
267
 
268
  The training uses a **staged objective and learning-rate schedule**:
269