kashif HF Staff commited on
Commit
deb09a9
Β·
verified Β·
1 Parent(s): eeaa854

docs: fix dtype, remove trust_remote_code for model, clean up internal comments

Browse files
Files changed (1) hide show
  1. README.md +7 -45
README.md CHANGED
@@ -57,13 +57,10 @@ import torch
57
 
58
  repo = "HuggingFaceBio/Carbon-3B"
59
 
60
- # Tokenizer needs trust_remote_code for the DNA-specific logic
61
  tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
62
-
63
- # Model is standard Llama-family β€” no trust_remote_code needed
64
  model = AutoModelForCausalLM.from_pretrained(
65
  repo,
66
- torch_dtype=torch.bfloat16,
67
  ).cuda().eval()
68
 
69
  # Wrap a DNA prompt with the <dna> tag (the model is trained with this format).
@@ -133,41 +130,6 @@ def score(seq: str) -> float:
133
  targets = ids[:, 1:]
134
  logp = F.log_softmax(logits.float(), dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
135
  return logp.mean().item()
136
-
137
- # QY: not sure if we still want to keep this per-token log-probabilities score function,
138
- # because we now have a more elegant one in modeling_carbon.py:
139
- import torch
140
- from transformers import AutoTokenizer, AutoModelForCausalLM
141
-
142
- repo = "HuggingFaceBio/Carbon-3B"
143
-
144
- # Load tokenizer and model
145
- tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
146
- model = AutoModelForCausalLM.from_pretrained(
147
- repo,
148
- torch_dtype=torch.bfloat16,
149
- trust_remote_code=True
150
- ).cuda().eval()
151
-
152
- # Setup tokenizer for bp-level scoring (required for score_sequence)
153
- model.setup_tokenizer(tok)
154
-
155
- # Score sequences - automatically handles BOS token and padding
156
- sequences = ["ATCG" * 1024, "ACAT" * 2048]
157
- bp_probs_list, actual_probs_list = model.score_sequence(sequences)
158
-
159
- # bp_probs_list: list of [seq_len_i, 4] tensors - probability distribution over A/T/C/G at each position
160
- # actual_probs_list: list of [seq_len_i] tensors - probability of the actual base at each position
161
-
162
- # Compute metrics for each sequence
163
- for i, (seq, actual_probs) in enumerate(zip(sequences, actual_probs_list)):
164
- log_likelihood = actual_probs.log().mean().item() # Total log-likelihood
165
- perplexity = torch.exp(-actual_probs.log().mean()).item() # Perplexity
166
-
167
- print(f"Sequence {i+1} (length {len(seq)}):")
168
- print(f" Mean log-likelihood: {log_likelihood:.2f}")
169
- print(f" Perplexity: {perplexity:.4f}")
170
- print(f" Mean probability: {actual_probs.mean().item():.4f}")
171
  ```
172
 
173
  For batched scoring with attention masking and full reproducible evaluation pipelines (sequence recovery, ClinVar / BRCA2 / TraitGym VEP, TATA / synonymous-codon perturbation, Genome-NIAH), use the official scripts in the [Carbon evaluation directory](https://github.com/huggingface/carbon/tree/main/evaluation) β€” see [`perturbation_tasks.py`](https://github.com/huggingface/carbon/blob/main/evaluation/perturbation_tasks.py) for the canonical `score_hf` implementation and [`README.md`](https://github.com/huggingface/carbon/blob/main/evaluation/README.md) for run instructions across all tasks.
@@ -187,7 +149,7 @@ config.rope_scaling = {
187
  "original_max_position_embeddings": 32768,
188
  }
189
  model = AutoModelForCausalLM.from_pretrained(
190
- repo, config=config, torch_dtype=torch.bfloat16
191
  ).cuda().eval()
192
  ```
193
 
@@ -202,8 +164,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
202
  import torch
203
 
204
  draft = AutoModelForCausalLM.from_pretrained(
205
- "HuggingFaceBio/carbon-500M",
206
- torch_dtype=torch.bfloat16,
207
  ).cuda().eval()
208
  target = model # Carbon-3B, loaded above
209
 
@@ -256,8 +218,8 @@ Below we highlight the three short-context probes for which we report headline n
256
  | | SYN v2 | <u>82.78</u> | 74.08 | **84.90** |
257
 
258
  Carbon-3B is competitive with Evo2-7B while being much faster to run.
259
- > TODO update TATA v2 and SYN v2 scores with teh new results!
260
- >
261
  ### Long-context retrieval (Genome-NIAH)
262
 
263
  [Genome-NIAH](https://huggingface.co/datasets/HuggingFaceBio/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
@@ -270,7 +232,7 @@ Below are the scores on `niah`:
270
  | 64 k tokens (393 kbp) | β€” / 0.79 | β€” | 0.80 |
271
 
272
  Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20 at 64k due to the slow inference speed.
273
- > TODO try to run more 64k samples for Evo2 7B
274
 
275
  - **4Γ— longer effective context than Generator-v2-3B.** Generator-v2-3B caps at 16 k tokens (β‰ˆ 98 kbp). Carbon-3B has a native context of 32 k tokens (β‰ˆ 197 kbp) and extends to 64 k tokens (β‰ˆ 384 kbp) at inference time with YaRN. It matches Generator-v2-3B on `niah` at 98 kbp.
276
  - **Matches Evo2-7B (1 M context) on `niah` at 384 kbp** (64 k tokens) under YaRN, despite being substantially smaller.
 
57
 
58
  repo = "HuggingFaceBio/Carbon-3B"
59
 
 
60
  tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
 
 
61
  model = AutoModelForCausalLM.from_pretrained(
62
  repo,
63
+ dtype=torch.bfloat16,
64
  ).cuda().eval()
65
 
66
  # Wrap a DNA prompt with the <dna> tag (the model is trained with this format).
 
130
  targets = ids[:, 1:]
131
  logp = F.log_softmax(logits.float(), dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
132
  return logp.mean().item()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  ```
134
 
135
  For batched scoring with attention masking and full reproducible evaluation pipelines (sequence recovery, ClinVar / BRCA2 / TraitGym VEP, TATA / synonymous-codon perturbation, Genome-NIAH), use the official scripts in the [Carbon evaluation directory](https://github.com/huggingface/carbon/tree/main/evaluation) β€” see [`perturbation_tasks.py`](https://github.com/huggingface/carbon/blob/main/evaluation/perturbation_tasks.py) for the canonical `score_hf` implementation and [`README.md`](https://github.com/huggingface/carbon/blob/main/evaluation/README.md) for run instructions across all tasks.
 
149
  "original_max_position_embeddings": 32768,
150
  }
151
  model = AutoModelForCausalLM.from_pretrained(
152
+ repo, config=config, dtype=torch.bfloat16
153
  ).cuda().eval()
154
  ```
155
 
 
164
  import torch
165
 
166
  draft = AutoModelForCausalLM.from_pretrained(
167
+ "HuggingFaceBio/Carbon-500M",
168
+ dtype=torch.bfloat16,
169
  ).cuda().eval()
170
  target = model # Carbon-3B, loaded above
171
 
 
218
  | | SYN v2 | <u>82.78</u> | 74.08 | **84.90** |
219
 
220
  Carbon-3B is competitive with Evo2-7B while being much faster to run.
221
+ > TODO: update TATA v2 and SYN v2 scores with the new results
222
+
223
  ### Long-context retrieval (Genome-NIAH)
224
 
225
  [Genome-NIAH](https://huggingface.co/datasets/HuggingFaceBio/genome-niah) is a long context benchmark, inspired from NIAH and RULER benchmarks for English. The model needs to retrieves a random 24 bp VALUE planted in a real-genome haystack at one of five depths, evaluated at six context lengths from 24 kbp to 786 kbp. The benchmark contains 500 examples per (task, context) cell.
 
232
  | 64 k tokens (393 kbp) | β€” / 0.79 | β€” | 0.80 |
233
 
234
  Sample sizes: Carbon & GENERator n=500. Evo2-7B n=150 at 16k, n=100 at 32k, n=20 at 64k due to the slow inference speed.
235
+ > TODO: run more 64k samples for Evo2 7B
236
 
237
  - **4Γ— longer effective context than Generator-v2-3B.** Generator-v2-3B caps at 16 k tokens (β‰ˆ 98 kbp). Carbon-3B has a native context of 32 k tokens (β‰ˆ 197 kbp) and extends to 64 k tokens (β‰ˆ 384 kbp) at inference time with YaRN. It matches Generator-v2-3B on `niah` at 98 kbp.
238
  - **Matches Evo2-7B (1 M context) on `niah` at 384 kbp** (64 k tokens) under YaRN, despite being substantially smaller.