martinsu commited on
Commit
36d091b
·
verified ·
1 Parent(s): 183763a

Fast tokenizer problem explained, use vLLM.

Browse files
Files changed (1) hide show
  1. README.md +29 -4
README.md CHANGED
@@ -35,8 +35,10 @@ I'll run and publish more tests, perhaps using quantization.
35
 
36
  On top of this fine-tune, one can use a lighter touch to nudge the model toward the right predictions.
37
 
 
 
38
  # Run in prod:
39
- - **1) TGI official docker**
40
  - **2) proper system prompt - correct language**
41
  - **3) proper RAG - model is RAG-tuned**
42
 
@@ -57,7 +59,9 @@ For proper prod usage check out:
57
  https://huggingface.co/spaces/martinsu/tildeopen-30b-mu-instruct-space/blob/main/app.py
58
  That code works.
59
 
60
- Runs on official TGI docker image - typical prod usage.
 
 
61
 
62
  Use correct prompt language and text as system role, it helps accurate token prediction for all languages - they are trained implicit control codes for model, not random text.
63
 
@@ -161,11 +165,31 @@ On the EuroBlocks multilingual benchmark, the models performed as follows:
161
  **Interpretation**: Higher scores may partly reflect output length matching reference length. Verbose models get penalized by ROUGE-L. No statistical significance computed. Single benchmark only - take with appropriate grain of salt.
162
 
163
  ## Known Issues
164
-
165
- - **Token 179 (Tokenizer Bug)**: fast tokenizer inserts token 179 as word boundary separator, which decodes to extra spaces. Fix: use `use_fast=False`. This is a base model tokenizer quirk, not introduced by fine-tuning.
 
 
166
  - **Phase 1 only**: SFT checkpoint, no tool use or DPO phases yet
167
  - **Use correct prompt language as system role**: It will scaffold model to predict given language tokens
168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
  ## Limitations & Safety
170
 
171
  - **Not safety-tuned**: No RLHF, no red-teaming, no toxicity filtering
@@ -174,6 +198,7 @@ On the EuroBlocks multilingual benchmark, the models performed as follows:
174
  - **Standard LLM caveats**: It's a smart token predictor, not a legal or medical professional. Can hallucinate. Use responsibly.
175
 
176
 
 
177
  ## Why It (Probably) Works
178
 
179
  **English-dominant (72%)**: Preserves base model's English token distribution and reasoning chains (likely optimized on English-heavy pretraining/instruction data) while extending multilingual generalization
 
35
 
36
  On top of this fine-tune, one can use a lighter touch to nudge the model toward the right predictions.
37
 
38
+ ATM im running more RAG SFT on model, need to add more of a grounding behaviour.
39
+
40
  # Run in prod:
41
+ - **1) TGI official docker will NOT WORK, use vLLM docker with --tokenizer-mode slow**
42
  - **2) proper system prompt - correct language**
43
  - **3) proper RAG - model is RAG-tuned**
44
 
 
59
  https://huggingface.co/spaces/martinsu/tildeopen-30b-mu-instruct-space/blob/main/app.py
60
  That code works.
61
 
62
+ Runs on official vLLM docker with **--tokenizer-mode slow** - typical prod usage.
63
+
64
+ **TGI will fail.** See further.
65
 
66
  Use correct prompt language and text as system role, it helps accurate token prediction for all languages - they are trained implicit control codes for model, not random text.
67
 
 
165
  **Interpretation**: Higher scores may partly reflect output length matching reference length. Verbose models get penalized by ROUGE-L. No statistical significance computed. Single benchmark only - take with appropriate grain of salt.
166
 
167
  ## Known Issues
168
+ - **Base model doesnt have fast tokenizer**
169
+ - **By default AutoTokenizer.from_pretrained() will fire up fast tokenizer(TGI will do this), since this model doesnt have one, it cooks up broken one on the fly with tokens, that model is mostly unfamiliar, for example 179, that degrades performance seriously**
170
+ - **The main problem is that model fails silently - it recognizes some tokens and generates with degraded performance**
171
+ - **However when decoding with broken tokenizer we get sensible output, because model generates tokens that are in vocabulary**
172
  - **Phase 1 only**: SFT checkpoint, no tool use or DPO phases yet
173
  - **Use correct prompt language as system role**: It will scaffold model to predict given language tokens
174
 
175
+ ## How vLLM(slow enabled) and TGI(default) tokenizes, example with curl:
176
+ This applies to base model too.
177
+
178
+ TGI docker - broken output.
179
+ ```bash
180
+ curl -X POST http://x:8081/tokenize -H 'Content-Type: application/json' -d '{"model":"tgi","inputs":" Hello world <|im_end|> ","add_special_tokens":true}'
181
+
182
+ [{"id":179,"text":" ","start":0,"stop":1},{"id":53914,"text":"Hello","start":1,"stop":6},{"id":179,"text":" ","start":6,"stop":7},{"id":8141,"text":"world","start":7,"stop":12},{"id":179,"text":" ","start":12,"stop":13},{"id":131074,"text":"<|im_end|>","start":13,"stop":23},{"id":179,"text":" ","start":23,"stop":24}]
183
+ ```
184
+ vLLM docker - right output.
185
+
186
+ ```bash
187
+ curl -X POST http://x:8081/tokenize -H "Content-Type: application/json" -d '{"model": "martinsu/tildeopen-30b-mu-instruct", "prompt": " Hello world <|im_end|> ", "temperature": 0.7, "max_tokens": 150, "add_special_tokens":true}'
188
+ {"count":6,"max_model_len":65536,"tokens":[453,63484,8141,128948,131074,453],"token_strs":null}
189
+ ```
190
+
191
+ They differ - vLLM tokenizer uses slow and outputs same tokens that model recognizes, TGI does not.
192
+
193
  ## Limitations & Safety
194
 
195
  - **Not safety-tuned**: No RLHF, no red-teaming, no toxicity filtering
 
198
  - **Standard LLM caveats**: It's a smart token predictor, not a legal or medical professional. Can hallucinate. Use responsibly.
199
 
200
 
201
+
202
  ## Why It (Probably) Works
203
 
204
  **English-dominant (72%)**: Preserves base model's English token distribution and reasoning chains (likely optimized on English-heavy pretraining/instruction data) while extending multilingual generalization