File size: 3,902 Bytes
3800ddc
de70f6b
3800ddc
de70f6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3800ddc
 
de70f6b
3800ddc
de70f6b
3800ddc
de70f6b
3800ddc
de70f6b
3800ddc
de70f6b
3800ddc
7b5dbad
de70f6b
 
3800ddc
de70f6b
3800ddc
de70f6b
 
 
 
3800ddc
de70f6b
3800ddc
de70f6b
 
 
3800ddc
de70f6b
 
53193bd
de70f6b
 
 
3800ddc
de70f6b
 
 
 
3800ddc
de70f6b
d0e7af3
 
 
 
de70f6b
 
 
 
 
 
3800ddc
de70f6b
3800ddc
374cd2b
 
 
de70f6b
 
d0e7af3
de70f6b
 
 
 
d0e7af3
de70f6b
 
 
3800ddc
de70f6b
 
3800ddc
de70f6b
3800ddc
de70f6b
 
 
 
 
 
 
 
 
 
3800ddc
de70f6b
3800ddc
de70f6b
 
 
 
3800ddc
de70f6b
3800ddc
de70f6b
 
 
3800ddc
de70f6b
3800ddc
de70f6b
 
 
 
3800ddc
de70f6b
3800ddc
de70f6b
3800ddc
de70f6b
3800ddc
de70f6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
base_model: TSjB/QM-4B-embeddings-only
library_name: transformers
model_name: QM-4B
tags:
- qarachay-malqar
- caucasian-languages
- turkic-languages
- karachay-balkar
- multilingual
- trl
- sft
- unsloth
language:
- krc
- ru
- en
license: cc-by-nc-sa-4.0
---

# QM-4B: with Qarachay-Malqar Language

A model based on Qwen3-4B-Instruct-2507, fine-tuned to support the Qarachay-Malqar language.

## Description

QM-4B is a language model built on Qwen3-4B-Instruct-2507 with an extended tokenizer and fine-tuning for Qarachay-Malqar language support (къарачай-малкъар тил).

### Training Stages:

1. **Tokenizer expansion** — added tokens for Qarachay-Malqar: replacement from 150k to 130k tokens (tokenizer trained in Qarachay-Malqar (76.5%), English (11.5%), Russian (11.5%) and Circassian (5%)) (the number of symbols/tokens has been increased in Qarachay-Malqar compared to the original tokenizer: 1.78 -> 5.38)
2. **Embeddings-only Training** — training only embedding layers (3 epochs, LR=2e-4)
3. **Full Fine-Tune** — full fine-tuning of all model layers (1 epoch, LR=5e-6)

## Training Metrics

| Stage | Train Loss | Eval Loss | Parameters |
|-------|------------|-----------|------------|
| Embeddings-only | 4.27 | 4.49 | 8.4% (332M) |
| Full FT (1 epoch) | 4.16 | 4.36 | 100% (3.97B) |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "TSjB/QM-4B",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "TSjB/QM-4B",
    trust_remote_code=True
)

# With chat template
messages = [
{"role": "system", "content": "Сен къарачай-малкъар тилде болушлукъчуса. Соруўлагъа къысха, тюз эм ачыкъ джуўабла бер. Орусча неда ингилизче сорсала — ол тилде джуўаб бер."},
{"role": "user", "content": "Не зат билесе Къарачай юсюнден?"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

if 'token_type_ids' in inputs:
    inputs.pop('token_type_ids')

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.2,
    no_repeat_ngram_size=4,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Recommended Generation Parameters

```python
generation_config = {
    "max_new_tokens": 200,
    "temperature": 0.7,
    "top_p": 0.9,
    "do_sample": True,
    "repetition_penalty": 1.2,  # important to avoid repetitions
    "no_repeat_ngram_size": 3,  # optional
}
```

## Supported Languages

- Qarachay-Malqar (къарачай-малкъар тил)
- Russian
- English
- Other languages from the base Qwen3 model

## Limitations

- The model was fine-tuned on text data (continued pretraining), not on dialogues
- May switch between languages within a single response
- Additional instruction tuning is recommended for better instruction following

## Training Data

The model was trained on a multilingual text corpus including:
- Qarachay-Malqar texts
- Russian texts
- English texts

## License

cc-by-nc-sa-4.0

## Citation

```bibtex
@misc{qm4b2024,
  title={QM-4B: Qarachay-Malqar language support},
  author={TSjB},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/TSjB/QM-4B}
}
```

## Framework Versions

- TRL: 0.24.0
- Transformers: 4.57.3
- Pytorch: 2.9.0
- Unsloth: optimized training


## Authors
[Bogdan Tewunalany](https://t.me/bogdan_tewunalany), [Ali Berberov](https://t.me/ali_berberov)