File size: 3,902 Bytes
3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc 7b5dbad de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 53193bd de70f6b 3800ddc de70f6b 3800ddc de70f6b d0e7af3 de70f6b 3800ddc de70f6b 3800ddc 374cd2b de70f6b d0e7af3 de70f6b d0e7af3 de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b 3800ddc de70f6b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
base_model: TSjB/QM-4B-embeddings-only
library_name: transformers
model_name: QM-4B
tags:
- qarachay-malqar
- caucasian-languages
- turkic-languages
- karachay-balkar
- multilingual
- trl
- sft
- unsloth
language:
- krc
- ru
- en
license: cc-by-nc-sa-4.0
---
# QM-4B: with Qarachay-Malqar Language
A model based on Qwen3-4B-Instruct-2507, fine-tuned to support the Qarachay-Malqar language.
## Description
QM-4B is a language model built on Qwen3-4B-Instruct-2507 with an extended tokenizer and fine-tuning for Qarachay-Malqar language support (къарачай-малкъар тил).
### Training Stages:
1. **Tokenizer expansion** — added tokens for Qarachay-Malqar: replacement from 150k to 130k tokens (tokenizer trained in Qarachay-Malqar (76.5%), English (11.5%), Russian (11.5%) and Circassian (5%)) (the number of symbols/tokens has been increased in Qarachay-Malqar compared to the original tokenizer: 1.78 -> 5.38)
2. **Embeddings-only Training** — training only embedding layers (3 epochs, LR=2e-4)
3. **Full Fine-Tune** — full fine-tuning of all model layers (1 epoch, LR=5e-6)
## Training Metrics
| Stage | Train Loss | Eval Loss | Parameters |
|-------|------------|-----------|------------|
| Embeddings-only | 4.27 | 4.49 | 8.4% (332M) |
| Full FT (1 epoch) | 4.16 | 4.36 | 100% (3.97B) |
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"TSjB/QM-4B",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"TSjB/QM-4B",
trust_remote_code=True
)
# With chat template
messages = [
{"role": "system", "content": "Сен къарачай-малкъар тилде болушлукъчуса. Соруўлагъа къысха, тюз эм ачыкъ джуўабла бер. Орусча неда ингилизче сорсала — ол тилде джуўаб бер."},
{"role": "user", "content": "Не зат билесе Къарачай юсюнден?"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
if 'token_type_ids' in inputs:
inputs.pop('token_type_ids')
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.2,
no_repeat_ngram_size=4,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Recommended Generation Parameters
```python
generation_config = {
"max_new_tokens": 200,
"temperature": 0.7,
"top_p": 0.9,
"do_sample": True,
"repetition_penalty": 1.2, # important to avoid repetitions
"no_repeat_ngram_size": 3, # optional
}
```
## Supported Languages
- Qarachay-Malqar (къарачай-малкъар тил)
- Russian
- English
- Other languages from the base Qwen3 model
## Limitations
- The model was fine-tuned on text data (continued pretraining), not on dialogues
- May switch between languages within a single response
- Additional instruction tuning is recommended for better instruction following
## Training Data
The model was trained on a multilingual text corpus including:
- Qarachay-Malqar texts
- Russian texts
- English texts
## License
cc-by-nc-sa-4.0
## Citation
```bibtex
@misc{qm4b2024,
title={QM-4B: Qarachay-Malqar language support},
author={TSjB},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/TSjB/QM-4B}
}
```
## Framework Versions
- TRL: 0.24.0
- Transformers: 4.57.3
- Pytorch: 2.9.0
- Unsloth: optimized training
## Authors
[Bogdan Tewunalany](https://t.me/bogdan_tewunalany), [Ali Berberov](https://t.me/ali_berberov) |