---
language:
- tr
license: apache-2.0
base_model: Qwen/Qwen2.5-14B-Instruct
tags:
- turkish
- qwen2
- sft
- 14b
- text-generation
- instruction-tuned
- low-resource
- nlp
pipeline_tag: text-generation
model-index:
- name: Turkish-LLM-14B-Instruct
  results: []
---

# Turkish-LLM-14B-Instruct

An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge.

<p align="center">
  <a href="https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat"><img src="https://img.shields.io/badge/Demo-Live_Chat-blue?style=for-the-badge&logo=huggingface" alt="Demo"></a>
  <a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF"><img src="https://img.shields.io/badge/GGUF-Quantized_Versions-orange?style=for-the-badge&logo=huggingface" alt="GGUF"></a>
  <a href="https://github.com/ogulcanaydogan/Turkish-LLM"><img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub"></a>
  <a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct"><img src="https://img.shields.io/badge/Also_Available-7B_Model-yellow?style=for-the-badge&logo=huggingface" alt="7B"></a>
</p>

---

## Motivation

Turkish is the native language of over **80 million speakers** and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to:

- Grammatical errors in suffix agreement and vowel harmony
- Hallucinated or culturally inaccurate content
- Code-switching to English or other languages mid-response
- Poor performance on Turkish-specific knowledge (history, geography, institutions)

This model was developed to provide a **high-quality, open-source Turkish language model** that treats Turkish as a first-class language rather than an afterthought.

## Model Details

| Attribute | Value |
|-----------|-------|
| **Developer** | [Ogulcan Aydogan](https://ogulcanaydogan.com) |
| **Base model** | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) |
| **Parameters** | 14.7B |
| **Architecture** | Transformer (decoder-only, causal LM) |
| **Context length** | 4,096 tokens |
| **Precision** | bfloat16 |
| **Fine-tuning method** | Supervised Fine-Tuning (SFT) |
| **License** | Apache 2.0 |
| **Language** | Turkish (tr) |
| **Release date** | March 2026 |

### Model Family

| Model | Parameters | Base | Method | Use Case |
|-------|-----------|------|--------|----------|
| **Turkish-LLM-14B-Instruct** (this) | 14.7B | Qwen2.5-14B-Instruct | SFT | Higher quality, complex reasoning |
| [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) | 14.7B | This model | GGUF quantized | Local/edge deployment |
| [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) | 7B | Turkcell-LLM-7b-v1 | LoRA | Lightweight, faster inference |

## Training

### Dataset

Training data consists of a curated collection of **144,000 Turkish instruction-response pairs**, with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment.

| Domain | Examples | Purpose |
|--------|----------|---------|
| Science | Photosynthesis, water cycle, biology, physics, chemistry | Factual accuracy in Turkish scientific terminology |
| Turkish History | Ottoman Empire, War of Independence, Republic era | Culturally grounded historical knowledge |
| Geography | 7 geographical regions, rivers, lakes, climate | Location-specific Turkish knowledge |
| General Knowledge | Education, culture, daily life, technology | Broad conversational ability |
| Anti-Repetition | Specially crafted pairs | Fluent prose generation without output loops |

### Training Configuration

| Parameter | Value |
|-----------|-------|
| Hardware | NVIDIA A100 80GB |
| Framework | PyTorch + Transformers |
| Precision | bfloat16 (mixed precision) |
| Method | Full SFT alignment |
| Optimizer | AdamW |
| Focus | Pure Turkish responses, reduced hallucination |

### Training Pipeline

Training was orchestrated using [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge), a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages.

```
Raw Turkish Data --> Preprocessing --> SFT Training --> Evaluation --> Deployment
  (144K pairs)       (filtering,       (A100 80GB,      (manual +       (HF Hub,
                      dedup,            bf16 mixed       qualitative)     Spaces,
                      formatting)       precision)                        vLLM)
```

### Design Decisions

**Why Qwen2.5-14B-Instruct as a base?** Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve.

**Why SFT over RLHF/DPO?** For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods.

**Why 14B instead of 7B?** The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions.

## Usage

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."},
    {"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.15,
    do_sample=True
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### vLLM (Production)

```bash
pip install vllm
vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \
    --dtype float16 \
    --max-model-len 4096
```

### Ollama (Local)

```bash
ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF:Q4_K_M
```

### GGUF (llama.cpp / LM Studio)

Quantized GGUF versions (Q4_K_M, Q5_K_M, Q8_0, F16) are available at [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF).

### Chat Template

This model uses the ChatML format:

```
<|im_start|>system
Sen yardimci bir Turkce yapay zeka asistanisin.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
```

## Hardware Requirements

| Precision | VRAM Required | Recommended GPUs |
|-----------|--------------|------------------|
| FP16 / BF16 | ~30 GB | A100 80GB, A100 40GB, A6000 |
| INT8 | ~15 GB | RTX 4090, A10G |
| INT4 (GPTQ/AWQ) | ~8 GB | RTX 3090, RTX 4080, Apple M-series (24GB) |

For consumer hardware, use the [GGUF versions](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) for the best balance of quality and accessibility.

## Intended Use

### Recommended Applications

- Turkish chatbots and virtual assistants
- Turkish question answering systems
- Educational tools for Turkish-language content
- Turkish text summarization and generation
- Research on Turkish NLP and low-resource language modeling

### Out-of-Scope Uses

- Medical, legal, or financial advice
- Production systems without additional safety alignment
- Generation of misleading or harmful content
- Tasks requiring high factual precision without human verification


## Benchmark Results

Evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with 0-shot settings.

| Benchmark | Turkish-LLM-14B (SFT) | Qwen2.5-14B-Instruct (Base) |
|-----------|----------------------|----------------------------|
| **MMLU_TR** (57 subjects) | 59.38% | 59.47% |
| **XCOPA_TR** (causal reasoning) | 66.00% | 66.80% |
| **XNLI_TR** (natural language inference) | 42.97% | 41.53% |

> The SFT model maintains base model knowledge while gaining Turkish instruction-following capability. Benchmark scores are comparable; the real improvement is in conversational quality and cultural awareness.

### MMLU_TR Highlights

| Top Subjects | Score | Weakest Subjects | Score |
|-------------|-------|-----------------|-------|
| HS Computer Science | 84.0% | Moral Scenarios | 32.7% |
| Marketing | 77.9% | Abstract Algebra | 36.0% |
| HS US History | 77.7% | Prof. Accounting | 39.1% |
| HS European History | 76.7% | College Physics | 40.6% |
| HS Biology | 76.3% | Professional Law | 43.2% |

## Limitations and Risks

- **Language drift**: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts
- **Hallucination**: Like all LLMs, the model can generate plausible-sounding but incorrect information
- **English degradation**: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct
- **Context length**: Performance may degrade on inputs significantly exceeding 4,096 tokens
- **Bias**: The model may reflect biases present in its training data
- **Safety**: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards

## Ethical Considerations

This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products.

## Related Resources

| Resource | Link |
|----------|------|
| GGUF Versions | [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) |
| 7B Model | [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) |
| Live Demo (14B) | [Turkish-LLM-14B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat) |
| Live Demo (7B) | [Turkish-LLM-7B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-7B-Chat) |
| Training Pipeline | [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge) |
| Project Repository | [Turkish-LLM on GitHub](https://github.com/ogulcanaydogan/Turkish-LLM) |

## Citation

```bibtex
@misc{aydogan2026turkishllm14b,
  title     = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model},
  author    = {Aydogan, Ogulcan},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct}
}
```

## Contact

- Website: [ogulcanaydogan.com](https://ogulcanaydogan.com)
- GitHub: [github.com/ogulcanaydogan](https://github.com/ogulcanaydogan)
- Hugging Face: [huggingface.co/ogulcanaydogan](https://huggingface.co/ogulcanaydogan)
- LinkedIn: [linkedin.com/in/ogulcanaydogan](https://linkedin.com/in/ogulcanaydogan)