Darmm Text Generation Kazakh v2

Darmm Kazakh v2 (Darmm/darmm-text-generation-kazakh-v2) is a significantly improved iteration of our Kazakh language model series. While v1 was built on synthetic data and mT5, v2 is built on organic, real-world data and the powerful Qwen 2.5 7B architecture.

This model is fine-tuned to understand and generate high-quality Kazakh text, with a focus on news articles, encyclopedic explanations, and general instruction following.

Key Improvements vs v1

Feature v1 (Old) v2 (New)
Base Model google/mt5-base (580M) Qwen/Qwen2.5-Coder-7B-Instruct (7B)
Data Source Synthetic / Templates Organic (Wikipedia + Egemen Qazaqstan)
Dataset Size ~5,000 synthetic pairs ~5,000 real articles (4895 samples)
Training 3 Epochs (CPU/Small GPU) 2.29 Epochs (A100 80GB, QLoRA)
Focus Short structured answers Long-form content generation

Usage

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 1. Config for 4-bit loading (Efficient)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# 2. Load Base Model
base_model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# 3. Load Darmm v2 Adapter
adapter_name = "Darmm/darmm-text-generation-kazakh-v2"
model = PeftModel.from_pretrained(model, adapter_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# 4. Generate
prompt_text = "Жасанды интеллект туралы мақала жаз." # Write an article about AI
prompt = f"### Instruction:\n{prompt_text}\n\n### Response:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1])

Model Description

  • Developed by: Darmm Lab
  • Language: Kazakh (kk)
  • Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
  • Fine-tuning Method: QLoRA (4-bit quantization with LoRA adapters)
  • Context Length: 1024 tokens

Training Details

The model was trained on a curated dataset scraped from high-quality Kazakh sources:

  • Wikipedia (kk): Encyclopedic knowledge, definitions, biographies.
  • Egemen Qazaqstan: Formal news style, economic and political vocabulary.

Hyperparameters

  • Epochs: 2.29
  • Batch size: 1 (Gradient Accumulation: 8) -> Effective Batch Size: 8
  • Learning rate: 2e-4
  • Scheduler: Cosine
  • Optimizer: AdamW
  • Hardware: NVIDIA A100 80GB
  • Final Loss: 0.7407

Training Loss

Step Loss Learning Rate
10 2.25 4e-5
100 1.13 1.99e-4
500 0.94 1.90e-4
1000 0.83 1.55e-4
1500 0.73 9.41e-5
1650 0.74 2.29e-5

Intended Use

  • Content Generation: Writing articles, summaries, and explanations in Kazakh.
  • Education: Generating study materials or answering questions about Kazakh history/culture.
  • Research: Baseline for further fine-tuning on specialized Kazakh domains (legal, medical).

Limitations

  • Hallucination: As with all LLMs, the model may generate factually incorrect information despite linguistic fluency.
  • English Bias: In rare cases of confusion, the model might revert to English or code-mixing due to the base model's strong English pre-training.

Citation

@misc{darmm_kazakh_v2,
  author = {Darmm Lab},
  title = {Darmm Text Generation Kazakh v2: Organic Data Scale-up},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
  howpublished = {\url{https://huggingface.co/Darmm/darmm-text-generation-kazakh-v2}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Darmm/darmm-text-generation-kazakh-v2

Base model

Qwen/Qwen2.5-7B
Finetuned
(287)
this model

Collection including Darmm/darmm-text-generation-kazakh-v2

Evaluation results