Use a Russian-friendly tokenizer strategy
#17
by
jbakerx - opened
If the base tokenizer isn’t great for Cyrillic, you’ll see bloated token counts and weaker fluency. Options:
keep tokenizer but do more Russian pre-adaptation (Stage A LoRA) to compensate or consider a base model with a tokenizer proven for Cyrillic, then port your style LoRA recipe