martinsu
/

tildeopen-30b-mu-instruct

@@ -35,8 +35,10 @@ I'll run and publish more tests, perhaps using quantization.
 On top of this fine-tune, one can use a lighter touch to nudge the model toward the right predictions.
 # Run in prod:
-- **1) TGI official docker**
 - **2) proper system prompt - correct language**
 - **3) proper RAG - model is RAG-tuned**
@@ -57,7 +59,9 @@ For proper prod usage check out:
 https://huggingface.co/spaces/martinsu/tildeopen-30b-mu-instruct-space/blob/main/app.py
 That code works.
-Runs on official TGI docker image - typical prod usage.
 Use correct prompt language and text as system role, it helps accurate token prediction for all languages - they are trained implicit control codes for model, not random text.
@@ -161,11 +165,31 @@ On the EuroBlocks multilingual benchmark, the models performed as follows:
 **Interpretation**: Higher scores may partly reflect output length matching reference length. Verbose models get penalized by ROUGE-L. No statistical significance computed. Single benchmark only - take with appropriate grain of salt.
 ## Known Issues
-- **Token 179 (Tokenizer Bug)**: fast tokenizer inserts token 179 as word boundary separator, which decodes to extra spaces. Fix: use `use_fast=False`. This is a base model tokenizer quirk, not introduced by fine-tuning.
 - **Phase 1 only**: SFT checkpoint, no tool use or DPO phases yet
 - **Use correct prompt language as system role**: It will scaffold model to predict given language tokens
 ## Limitations & Safety
 - **Not safety-tuned**: No RLHF, no red-teaming, no toxicity filtering
@@ -174,6 +198,7 @@ On the EuroBlocks multilingual benchmark, the models performed as follows:
 - **Standard LLM caveats**: It's a smart token predictor, not a legal or medical professional. Can hallucinate. Use responsibly.
 ## Why It (Probably) Works
 **English-dominant (72%)**: Preserves base model's English token distribution and reasoning chains (likely optimized on English-heavy pretraining/instruction data) while extending multilingual generalization

 On top of this fine-tune, one can use a lighter touch to nudge the model toward the right predictions.
+ATM im running more RAG SFT on model, need to add more of a grounding behaviour.
 # Run in prod:
+- **1) TGI official docker will NOT WORK, use vLLM docker with --tokenizer-mode slow**
 - **2) proper system prompt - correct language**
 - **3) proper RAG - model is RAG-tuned**
 https://huggingface.co/spaces/martinsu/tildeopen-30b-mu-instruct-space/blob/main/app.py
 That code works.
+Runs on official vLLM docker with **--tokenizer-mode slow** - typical prod usage.
+**TGI will fail.** See further.
 Use correct prompt language and text as system role, it helps accurate token prediction for all languages - they are trained implicit control codes for model, not random text.
 **Interpretation**: Higher scores may partly reflect output length matching reference length. Verbose models get penalized by ROUGE-L. No statistical significance computed. Single benchmark only - take with appropriate grain of salt.
 ## Known Issues
+- **Base model doesnt have fast tokenizer**
+- **By default AutoTokenizer.from_pretrained() will fire up fast tokenizer(TGI will do this), since this model doesnt have one, it cooks up broken one on the fly with tokens, that model is mostly unfamiliar, for example 179, that degrades performance seriously**
+- **The main problem is that model fails silently - it recognizes some tokens and generates with degraded performance**
+- **However when decoding with broken tokenizer we get sensible output, because model generates tokens that are in vocabulary**
 - **Phase 1 only**: SFT checkpoint, no tool use or DPO phases yet
 - **Use correct prompt language as system role**: It will scaffold model to predict given language tokens
+## How vLLM(slow enabled) and TGI(default) tokenizes, example with curl:
+This applies to base model too.
+TGI docker - broken output.
+```bash
+curl -X POST http://x:8081/tokenize -H 'Content-Type: application/json' -d '{"model":"tgi","inputs":" Hello world <|im_end|> ","add_special_tokens":true}'
+[{"id":179,"text":" ","start":0,"stop":1},{"id":53914,"text":"Hello","start":1,"stop":6},{"id":179,"text":" ","start":6,"stop":7},{"id":8141,"text":"world","start":7,"stop":12},{"id":179,"text":" ","start":12,"stop":13},{"id":131074,"text":"<|im_end|>","start":13,"stop":23},{"id":179,"text":" ","start":23,"stop":24}]
+```
+vLLM docker - right output.
+```bash
+curl -X POST http://x:8081/tokenize -H "Content-Type: application/json" -d '{"model": "martinsu/tildeopen-30b-mu-instruct", "prompt": " Hello world <|im_end|> ", "temperature": 0.7, "max_tokens": 150, "add_special_tokens":true}'
+{"count":6,"max_model_len":65536,"tokens":[453,63484,8141,128948,131074,453],"token_strs":null}
+```
+They differ - vLLM tokenizer uses slow and outputs same tokens that model recognizes, TGI does not.
 ## Limitations & Safety
 - **Not safety-tuned**: No RLHF, no red-teaming, no toxicity filtering
 - **Standard LLM caveats**: It's a smart token predictor, not a legal or medical professional. Can hallucinate. Use responsibly.
 ## Why It (Probably) Works
 **English-dominant (72%)**: Preserves base model's English token distribution and reasoning chains (likely optimized on English-heavy pretraining/instruction data) while extending multilingual generalization