--- base_model: TSjB/QM-4B-embeddings-only library_name: transformers model_name: QM-4B tags: - qarachay-malqar - caucasian-languages - turkic-languages - karachay-balkar - multilingual - trl - sft - unsloth language: - krc - ru - en license: cc-by-nc-sa-4.0 --- # QM-4B: with Qarachay-Malqar Language A model based on Qwen3-4B-Instruct-2507, fine-tuned to support the Qarachay-Malqar language. ## Description QM-4B is a language model built on Qwen3-4B-Instruct-2507 with an extended tokenizer and fine-tuning for Qarachay-Malqar language support (къарачай-малкъар тил). ### Training Stages: 1. **Tokenizer expansion** — added tokens for Qarachay-Malqar: replacement from 150k to 130k tokens (tokenizer trained in Qarachay-Malqar (76.5%), English (11.5%), Russian (11.5%) and Circassian (5%)) (the number of symbols/tokens has been increased in Qarachay-Malqar compared to the original tokenizer: 1.78 -> 5.38) 2. **Embeddings-only Training** — training only embedding layers (3 epochs, LR=2e-4) 3. **Full Fine-Tune** — full fine-tuning of all model layers (1 epoch, LR=5e-6) ## Training Metrics | Stage | Train Loss | Eval Loss | Parameters | |-------|------------|-----------|------------| | Embeddings-only | 4.27 | 4.49 | 8.4% (332M) | | Full FT (1 epoch) | 4.16 | 4.36 | 100% (3.97B) | ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "TSjB/QM-4B", dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "TSjB/QM-4B", trust_remote_code=True ) # With chat template messages = [ {"role": "system", "content": "Сен къарачай-малкъар тилде болушлукъчуса. Соруўлагъа къысха, тюз эм ачыкъ джуўабла бер. Орусча неда ингилизче сорсала — ол тилде джуўаб бер."}, {"role": "user", "content": "Не зат билесе Къарачай юсюнден?"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False ) inputs = tokenizer(text, return_tensors="pt").to(model.device) if 'token_type_ids' in inputs: inputs.pop('token_type_ids') outputs = model.generate( **inputs, max_new_tokens=100, temperature=0.7, top_p=0.9, do_sample=True, repetition_penalty=1.2, no_repeat_ngram_size=4, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Recommended Generation Parameters ```python generation_config = { "max_new_tokens": 200, "temperature": 0.7, "top_p": 0.9, "do_sample": True, "repetition_penalty": 1.2, # important to avoid repetitions "no_repeat_ngram_size": 3, # optional } ``` ## Supported Languages - Qarachay-Malqar (къарачай-малкъар тил) - Russian - English - Other languages from the base Qwen3 model ## Limitations - The model was fine-tuned on text data (continued pretraining), not on dialogues - May switch between languages within a single response - Additional instruction tuning is recommended for better instruction following ## Training Data The model was trained on a multilingual text corpus including: - Qarachay-Malqar texts - Russian texts - English texts ## License cc-by-nc-sa-4.0 ## Citation ```bibtex @misc{qm4b2024, title={QM-4B: Qarachay-Malqar language support}, author={TSjB}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/TSjB/QM-4B} } ``` ## Framework Versions - TRL: 0.24.0 - Transformers: 4.57.3 - Pytorch: 2.9.0 - Unsloth: optimized training ## Authors [Bogdan Tewunalany](https://t.me/bogdan_tewunalany), [Ali Berberov](https://t.me/ali_berberov)