|
|
--- |
|
|
base_model: TSjB/QM-4B-embeddings-only |
|
|
library_name: transformers |
|
|
model_name: QM-4B |
|
|
tags: |
|
|
- qarachay-malqar |
|
|
- caucasian-languages |
|
|
- turkic-languages |
|
|
- karachay-balkar |
|
|
- multilingual |
|
|
- trl |
|
|
- sft |
|
|
- unsloth |
|
|
language: |
|
|
- krc |
|
|
- ru |
|
|
- en |
|
|
license: cc-by-nc-sa-4.0 |
|
|
--- |
|
|
|
|
|
# QM-4B: with Qarachay-Malqar Language |
|
|
|
|
|
A model based on Qwen3-4B-Instruct-2507, fine-tuned to support the Qarachay-Malqar language. |
|
|
|
|
|
## Description |
|
|
|
|
|
QM-4B is a language model built on Qwen3-4B-Instruct-2507 with an extended tokenizer and fine-tuning for Qarachay-Malqar language support (къарачай-малкъар тил). |
|
|
|
|
|
### Training Stages: |
|
|
|
|
|
1. **Tokenizer expansion** — added tokens for Qarachay-Malqar: replacement from 150k to 130k tokens (tokenizer trained in Qarachay-Malqar (76.5%), English (11.5%), Russian (11.5%) and Circassian (5%)) (the number of symbols/tokens has been increased in Qarachay-Malqar compared to the original tokenizer: 1.78 -> 5.38) |
|
|
2. **Embeddings-only Training** — training only embedding layers (3 epochs, LR=2e-4) |
|
|
3. **Full Fine-Tune** — full fine-tuning of all model layers (1 epoch, LR=5e-6) |
|
|
|
|
|
## Training Metrics |
|
|
|
|
|
| Stage | Train Loss | Eval Loss | Parameters | |
|
|
|-------|------------|-----------|------------| |
|
|
| Embeddings-only | 4.27 | 4.49 | 8.4% (332M) | |
|
|
| Full FT (1 epoch) | 4.16 | 4.36 | 100% (3.97B) | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"TSjB/QM-4B", |
|
|
dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"TSjB/QM-4B", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# With chat template |
|
|
messages = [ |
|
|
{"role": "system", "content": "Сен къарачай-малкъар тилде болушлукъчуса. Соруўлагъа къысха, тюз эм ачыкъ джуўабла бер. Орусча неда ингилизче сорсала — ол тилде джуўаб бер."}, |
|
|
{"role": "user", "content": "Не зат билесе Къарачай юсюнден?"} |
|
|
] |
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
enable_thinking=False |
|
|
) |
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
|
|
|
|
if 'token_type_ids' in inputs: |
|
|
inputs.pop('token_type_ids') |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=100, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
do_sample=True, |
|
|
repetition_penalty=1.2, |
|
|
no_repeat_ngram_size=4, |
|
|
pad_token_id=tokenizer.pad_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Recommended Generation Parameters |
|
|
|
|
|
```python |
|
|
generation_config = { |
|
|
"max_new_tokens": 200, |
|
|
"temperature": 0.7, |
|
|
"top_p": 0.9, |
|
|
"do_sample": True, |
|
|
"repetition_penalty": 1.2, # important to avoid repetitions |
|
|
"no_repeat_ngram_size": 3, # optional |
|
|
} |
|
|
``` |
|
|
|
|
|
## Supported Languages |
|
|
|
|
|
- Qarachay-Malqar (къарачай-малкъар тил) |
|
|
- Russian |
|
|
- English |
|
|
- Other languages from the base Qwen3 model |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model was fine-tuned on text data (continued pretraining), not on dialogues |
|
|
- May switch between languages within a single response |
|
|
- Additional instruction tuning is recommended for better instruction following |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a multilingual text corpus including: |
|
|
- Qarachay-Malqar texts |
|
|
- Russian texts |
|
|
- English texts |
|
|
|
|
|
## License |
|
|
|
|
|
cc-by-nc-sa-4.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{qm4b2024, |
|
|
title={QM-4B: Qarachay-Malqar language support}, |
|
|
author={TSjB}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/TSjB/QM-4B} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Framework Versions |
|
|
|
|
|
- TRL: 0.24.0 |
|
|
- Transformers: 4.57.3 |
|
|
- Pytorch: 2.9.0 |
|
|
- Unsloth: optimized training |
|
|
|
|
|
|
|
|
## Authors |
|
|
[Bogdan Tewunalany](https://t.me/bogdan_tewunalany), [Ali Berberov](https://t.me/ali_berberov) |