HuggingFaceBio
/

Carbon-3B

@@ -77,10 +77,9 @@ out = model.generate(
     max_new_tokens=64,
     do_sample=False,
 )
-# NOTE: do not pass skip_special_tokens=True — the hybrid tokenizer mis-handles TODO: fix
-print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
 ```
-> TODO: fix skip_special_tokens=True
 ### Tokenizer: working with DNA inputs
 The Carbon tokenizer is a **hybrid** of BPE (for English text) and a fixed 6-mer scheme (for DNA). 6-mer tokenization is the central DNA-modeling design choice in Carbon — we found it works substantially better than BPE for DNA (see the Carbon technical report for the analysis) — but it comes with a few practical constraints when feeding input to the model. The tokenizer only switches into 6-mer mode when it sees the `<dna>` tag and emits `<oov>` token for any token with letters not in [ATCG].
@@ -117,9 +116,6 @@ def truncate_to_6mer(seq: str) -> str:
 prompt = f"<dna>{truncate_to_6mer(seq)}"
 ```
-> TODO add Kashif's PR for left padding and the auto dna tags flag.
-> TODO edit text and example to say left padding instead of right padding
 ### Likelihood-based scoring
 For variant-effect or perturbation tasks, score sequences with the model's per-token log-probabilities. A minimal single-sequence helper:

     max_new_tokens=64,
     do_sample=False,
 )
+print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
 ### Tokenizer: working with DNA inputs
 The Carbon tokenizer is a **hybrid** of BPE (for English text) and a fixed 6-mer scheme (for DNA). 6-mer tokenization is the central DNA-modeling design choice in Carbon — we found it works substantially better than BPE for DNA (see the Carbon technical report for the analysis) — but it comes with a few practical constraints when feeding input to the model. The tokenizer only switches into 6-mer mode when it sees the `<dna>` tag and emits `<oov>` token for any token with letters not in [ATCG].
 prompt = f"<dna>{truncate_to_6mer(seq)}"
 ```
 ### Likelihood-based scoring
 For variant-effect or perturbation tasks, score sequences with the model's per-token log-probabilities. A minimal single-sequence helper: