Turkish-LLM-14B-Instruct
An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge.
Motivation
Turkish is the native language of over 80 million speakers and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to:
- Grammatical errors in suffix agreement and vowel harmony
- Hallucinated or culturally inaccurate content
- Code-switching to English or other languages mid-response
- Poor performance on Turkish-specific knowledge (history, geography, institutions)
This model was developed to provide a high-quality, open-source Turkish language model that treats Turkish as a first-class language rather than an afterthought.
Model Details
| Attribute | Value |
|---|---|
| Developer | Ogulcan Aydogan |
| Base model | Qwen2.5-14B-Instruct |
| Parameters | 14.7B |
| Architecture | Transformer (decoder-only, causal LM) |
| Context length | 4,096 tokens |
| Precision | bfloat16 |
| Fine-tuning method | Supervised Fine-Tuning (SFT) |
| License | Apache 2.0 |
| Language | Turkish (tr) |
| Release date | March 2026 |
Model Family
This model is part of the Turkish-LLM family:
| Model | Parameters | Base | Method | Use Case |
|---|---|---|---|---|
| Turkish-LLM-14B-Instruct (this) | 14.7B | Qwen2.5-14B-Instruct | SFT | Higher quality, complex reasoning |
| Turkish-LLM-7B-Instruct | 7B | Turkcell-LLM-7b-v1 | LoRA | Lightweight, faster inference |
Training
Dataset
Training data was sourced from the Turkish-LLM-v10-Training dataset — a curated collection of 144,000 Turkish instruction-response pairs — with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment.
| Domain | Examples | Purpose |
|---|---|---|
| Science | Photosynthesis, water cycle, biology, physics, chemistry | Factual accuracy in Turkish scientific terminology |
| Turkish History | Ottoman Empire, War of Independence, Republic era | Culturally grounded historical knowledge |
| Geography | 7 geographical regions, rivers, lakes, climate | Location-specific Turkish knowledge |
| General Knowledge | Education, culture, daily life, technology | Broad conversational ability |
| Anti-Repetition | Specially crafted pairs | Fluent prose generation without output loops |
Training Configuration
| Parameter | Value |
|---|---|
| Hardware | NVIDIA A100 80GB |
| Framework | PyTorch + Transformers |
| Precision | bfloat16 (mixed precision) |
| Method | Full SFT alignment |
| Optimizer | AdamW |
| Focus | Pure Turkish responses, reduced hallucination |
Training Pipeline
Training was orchestrated using LowResource-LLM-Forge, a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages. The pipeline handles data preprocessing, tokenization, training, and evaluation in a single workflow.
Raw Turkish Data ──▶ Preprocessing ──▶ SFT Training ──▶ Evaluation ──▶ Deployment
(144K pairs) (filtering, (A100 80GB, (manual + (HF Hub,
dedup, bf16 mixed qualitative) Spaces,
formatting) precision) vLLM)
Design Decisions
Why Qwen2.5-14B-Instruct as a base? Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve.
Why SFT over RLHF/DPO? For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods.
Why 14B instead of 7B? The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
messages = [
{"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."},
{"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.15,
do_sample=True
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
vLLM (Production)
pip install vllm
vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \
--dtype float16 \
--max-model-len 4096
Ollama (Local)
ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct
Chat Template
This model uses the ChatML format:
<|im_start|>system
Sen yardimci bir Turkce yapay zeka asistanisin.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
Hardware Requirements
| Precision | VRAM Required | Recommended GPUs |
|---|---|---|
| FP16 / BF16 | ~30 GB | A100 80GB, A100 40GB, A6000 |
| INT8 | ~15 GB | RTX 4090, A10G |
| INT4 (GPTQ/AWQ) | ~8 GB | RTX 3090, RTX 4080, Apple M-series (24GB) |
For consumer hardware, INT4 quantized versions provide the best balance of quality and accessibility.
Intended Use
Recommended Applications
- Turkish chatbots and virtual assistants
- Turkish question answering systems
- Educational tools for Turkish-language content
- Turkish text summarization and generation
- Research on Turkish NLP and low-resource language modeling
Out-of-Scope Uses
- Medical, legal, or financial advice
- Production systems without additional safety alignment
- Generation of misleading or harmful content
- Tasks requiring high factual precision without human verification
Limitations and Risks
- Language drift: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts
- Hallucination: Like all LLMs, the model can generate plausible-sounding but incorrect information
- English degradation: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct
- Context length: Performance may degrade on inputs significantly exceeding 4,096 tokens
- Bias: The model may reflect biases present in its training data
- Safety: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards
Ethical Considerations
This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products.
Related Resources
| Resource | Link |
|---|---|
| 7B Model | Turkish-LLM-7B-Instruct |
| Training Dataset (144K) | Turkish-LLM-v10-Training |
| Live Demo (14B) | Turkish-LLM-14B-Chat |
| Live Demo (7B) | Turkish-LLM-7B-Chat |
| Training Pipeline | LowResource-LLM-Forge |
| Project Repository | Turkish-LLM on GitHub |
Citation
@misc{aydogan2026turkishllm14b,
title = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model},
author = {Aydogan, Ogulcan},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct}
}
Contact
- Website: ogulcanaydogan.com
- GitHub: github.com/ogulcanaydogan
- Hugging Face: huggingface.co/ogulcanaydogan
- LinkedIn: linkedin.com/in/ogulcanaydogan
- Downloads last month
- 45