kashif HF Staff commited on
Commit
ceed708
·
verified ·
1 Parent(s): 1e17783

model card: fix skip_special_tokens decode (bug fixed), remove wrong left-padding TODO

Browse files
Files changed (1) hide show
  1. README.md +2 -6
README.md CHANGED
@@ -77,10 +77,9 @@ out = model.generate(
77
  max_new_tokens=64,
78
  do_sample=False,
79
  )
80
- # NOTE: do not pass skip_special_tokens=True — the hybrid tokenizer mis-handles TODO: fix
81
- print(tok.decode(out[0][inputs.input_ids.shape[1]:]))
82
  ```
83
- > TODO: fix skip_special_tokens=True
84
  ### Tokenizer: working with DNA inputs
85
 
86
  The Carbon tokenizer is a **hybrid** of BPE (for English text) and a fixed 6-mer scheme (for DNA). 6-mer tokenization is the central DNA-modeling design choice in Carbon — we found it works substantially better than BPE for DNA (see the Carbon technical report for the analysis) — but it comes with a few practical constraints when feeding input to the model. The tokenizer only switches into 6-mer mode when it sees the `<dna>` tag and emits `<oov>` token for any token with letters not in [ATCG].
@@ -117,9 +116,6 @@ def truncate_to_6mer(seq: str) -> str:
117
  prompt = f"<dna>{truncate_to_6mer(seq)}"
118
  ```
119
 
120
- > TODO add Kashif's PR for left padding and the auto dna tags flag.
121
- > TODO edit text and example to say left padding instead of right padding
122
-
123
  ### Likelihood-based scoring
124
 
125
  For variant-effect or perturbation tasks, score sequences with the model's per-token log-probabilities. A minimal single-sequence helper:
 
77
  max_new_tokens=64,
78
  do_sample=False,
79
  )
80
+ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 
81
  ```
82
+
83
  ### Tokenizer: working with DNA inputs
84
 
85
  The Carbon tokenizer is a **hybrid** of BPE (for English text) and a fixed 6-mer scheme (for DNA). 6-mer tokenization is the central DNA-modeling design choice in Carbon — we found it works substantially better than BPE for DNA (see the Carbon technical report for the analysis) — but it comes with a few practical constraints when feeding input to the model. The tokenizer only switches into 6-mer mode when it sees the `<dna>` tag and emits `<oov>` token for any token with letters not in [ATCG].
 
116
  prompt = f"<dna>{truncate_to_6mer(seq)}"
117
  ```
118
 
 
 
 
119
  ### Likelihood-based scoring
120
 
121
  For variant-effect or perturbation tasks, score sequences with the model's per-token log-probabilities. A minimal single-sequence helper: