lewtun HF Staff commited on
Commit
5c02535
·
verified ·
1 Parent(s): f748d2a

Add FNS code

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md CHANGED
@@ -188,6 +188,77 @@ prompt = "<vertebrate_mammalian><protein_coding_region><dna>ATGCGCTAG..."
188
 
189
  The unconditional `<dna>SEQUENCE</dna>` format remains supported and is the default. See the Carbon technical report for the full list of supported metadata tags.
190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
  ## Evaluation
192
 
193
  All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). The suite covers seven tasks across four capability families:
 
188
 
189
  The unconditional `<dna>SEQUENCE</dna>` format remains supported and is the default. See the Carbon technical report for the full list of supported metadata tags.
190
 
191
+ ### Base-pair-level generation and scoring
192
+
193
+ The `fns` branch loads custom modeling code for Factorized Nucleotide Supervision (FNS). Carbon still uses its efficient 6-mer tokenizer, but during generation each selected 6-mer is assembled from six per-position nucleotide distributions, giving base-pair-level control over decoded DNA. Use this branch when you need exact base-pair counts, per-position masks, or temperature/top-p behavior applied at the nucleotide level rather than over the 4,096-way 6-mer distribution:
194
+
195
+ ```py
196
+ import math
197
+ import torch
198
+ from transformers import AutoModelForCausalLM, AutoTokenizer
199
+
200
+ model_id = "HuggingFaceBio/Carbon-3B"
201
+ revision = "fns"
202
+ device = "cuda"
203
+
204
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
205
+ model = AutoModelForCausalLM.from_pretrained(
206
+ model_id,
207
+ revision=revision,
208
+ trust_remote_code=True,
209
+ dtype=torch.bfloat16,
210
+ ).to(device).eval()
211
+
212
+ context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
213
+ n_bp = 60
214
+
215
+ inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)
216
+
217
+ with torch.no_grad():
218
+ output_ids = model.generate(
219
+ **inputs,
220
+ max_new_tokens=math.ceil(n_bp / tokenizer.k),
221
+ do_sample=False,
222
+ pad_token_id=tokenizer.eos_token_id,
223
+ )
224
+
225
+ generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
226
+ generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]
227
+
228
+ print(generated_dna)
229
+ ```
230
+
231
+ The same per-base marginals are exposed through `score_sequence()`, which returns the probability assigned to the observed base at each position. Taking the mean log probability gives a base-pair-level sequence score, where higher values indicate higher model likelihood:
232
+
233
+ ```py
234
+ import torch
235
+ from transformers import AutoModelForCausalLM, AutoTokenizer
236
+
237
+ model_id = "HuggingFaceBio/Carbon-3B"
238
+ revision = "fns"
239
+ device = "cuda"
240
+
241
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
242
+ model = AutoModelForCausalLM.from_pretrained(
243
+ model_id,
244
+ revision=revision,
245
+ trust_remote_code=True,
246
+ dtype=torch.bfloat16,
247
+ ).to(device).eval()
248
+
249
+ reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
250
+ perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"
251
+
252
+ with torch.no_grad():
253
+ bp_probs, actual_probs = model.score_sequence([reference, perturbed])
254
+
255
+ scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]
256
+
257
+ print(f"reference mean bp logp: {scores[0]:.4f}")
258
+ print(f"perturbed mean bp logp: {scores[1]:.4f}")
259
+ print(f"reference preferred: {scores[0] > scores[1]}")
260
+ ```
261
+
262
  ## Evaluation
263
 
264
  All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). The suite covers seven tasks across four capability families: