Turkish-LLM-14B-Instruct

An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge.

Demo GitHub Dataset 7B


Motivation

Turkish is the native language of over 80 million speakers and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to:

  • Grammatical errors in suffix agreement and vowel harmony
  • Hallucinated or culturally inaccurate content
  • Code-switching to English or other languages mid-response
  • Poor performance on Turkish-specific knowledge (history, geography, institutions)

This model was developed to provide a high-quality, open-source Turkish language model that treats Turkish as a first-class language rather than an afterthought.

Model Details

Attribute Value
Developer Ogulcan Aydogan
Base model Qwen2.5-14B-Instruct
Parameters 14.7B
Architecture Transformer (decoder-only, causal LM)
Context length 4,096 tokens
Precision bfloat16
Fine-tuning method Supervised Fine-Tuning (SFT)
License Apache 2.0
Language Turkish (tr)
Release date March 2026

Model Family

This model is part of the Turkish-LLM family:

Model Parameters Base Method Use Case
Turkish-LLM-14B-Instruct (this) 14.7B Qwen2.5-14B-Instruct SFT Higher quality, complex reasoning
Turkish-LLM-7B-Instruct 7B Turkcell-LLM-7b-v1 LoRA Lightweight, faster inference

Training

Dataset

Training data was sourced from the Turkish-LLM-v10-Training dataset — a curated collection of 144,000 Turkish instruction-response pairs — with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment.

Domain Examples Purpose
Science Photosynthesis, water cycle, biology, physics, chemistry Factual accuracy in Turkish scientific terminology
Turkish History Ottoman Empire, War of Independence, Republic era Culturally grounded historical knowledge
Geography 7 geographical regions, rivers, lakes, climate Location-specific Turkish knowledge
General Knowledge Education, culture, daily life, technology Broad conversational ability
Anti-Repetition Specially crafted pairs Fluent prose generation without output loops

Training Configuration

Parameter Value
Hardware NVIDIA A100 80GB
Framework PyTorch + Transformers
Precision bfloat16 (mixed precision)
Method Full SFT alignment
Optimizer AdamW
Focus Pure Turkish responses, reduced hallucination

Training Pipeline

Training was orchestrated using LowResource-LLM-Forge, a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages. The pipeline handles data preprocessing, tokenization, training, and evaluation in a single workflow.

Raw Turkish Data ──▶ Preprocessing ──▶ SFT Training ──▶ Evaluation ──▶ Deployment
  (144K pairs)        (filtering,       (A100 80GB,      (manual +       (HF Hub,
                       dedup,            bf16 mixed       qualitative)     Spaces,
                       formatting)       precision)                        vLLM)

Design Decisions

Why Qwen2.5-14B-Instruct as a base? Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve.

Why SFT over RLHF/DPO? For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods.

Why 14B instead of 7B? The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."},
    {"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.15,
    do_sample=True
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

vLLM (Production)

pip install vllm
vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \
    --dtype float16 \
    --max-model-len 4096

Ollama (Local)

ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct

Chat Template

This model uses the ChatML format:

<|im_start|>system
Sen yardimci bir Turkce yapay zeka asistanisin.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

Hardware Requirements

Precision VRAM Required Recommended GPUs
FP16 / BF16 ~30 GB A100 80GB, A100 40GB, A6000
INT8 ~15 GB RTX 4090, A10G
INT4 (GPTQ/AWQ) ~8 GB RTX 3090, RTX 4080, Apple M-series (24GB)

For consumer hardware, INT4 quantized versions provide the best balance of quality and accessibility.

Intended Use

Recommended Applications

  • Turkish chatbots and virtual assistants
  • Turkish question answering systems
  • Educational tools for Turkish-language content
  • Turkish text summarization and generation
  • Research on Turkish NLP and low-resource language modeling

Out-of-Scope Uses

  • Medical, legal, or financial advice
  • Production systems without additional safety alignment
  • Generation of misleading or harmful content
  • Tasks requiring high factual precision without human verification

Limitations and Risks

  • Language drift: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts
  • Hallucination: Like all LLMs, the model can generate plausible-sounding but incorrect information
  • English degradation: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct
  • Context length: Performance may degrade on inputs significantly exceeding 4,096 tokens
  • Bias: The model may reflect biases present in its training data
  • Safety: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards

Ethical Considerations

This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products.

Related Resources

Resource Link
7B Model Turkish-LLM-7B-Instruct
Training Dataset (144K) Turkish-LLM-v10-Training
Live Demo (14B) Turkish-LLM-14B-Chat
Live Demo (7B) Turkish-LLM-7B-Chat
Training Pipeline LowResource-LLM-Forge
Project Repository Turkish-LLM on GitHub

Citation

@misc{aydogan2026turkishllm14b,
  title     = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model},
  author    = {Aydogan, Ogulcan},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct}
}

Contact

Downloads last month
45
Safetensors
Model size
15B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ogulcanaydogan/Turkish-LLM-14B-Instruct

Base model

Qwen/Qwen2.5-14B
Finetuned
(377)
this model
Quantizations
2 models

Dataset used to train ogulcanaydogan/Turkish-LLM-14B-Instruct

Space using ogulcanaydogan/Turkish-LLM-14B-Instruct 1