| | --- |
| | language: |
| | - tr |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen2.5-14B-Instruct |
| | tags: |
| | - turkish |
| | - qwen2 |
| | - sft |
| | - 14b |
| | - text-generation |
| | - instruction-tuned |
| | - low-resource |
| | - nlp |
| | pipeline_tag: text-generation |
| | model-index: |
| | - name: Turkish-LLM-14B-Instruct |
| | results: [] |
| | --- |
| | |
| | # Turkish-LLM-14B-Instruct |
| |
|
| | An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge. |
| |
|
| | <p align="center"> |
| | <a href="https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat"><img src="https://img.shields.io/badge/Demo-Live_Chat-blue?style=for-the-badge&logo=huggingface" alt="Demo"></a> |
| | <a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF"><img src="https://img.shields.io/badge/GGUF-Quantized_Versions-orange?style=for-the-badge&logo=huggingface" alt="GGUF"></a> |
| | <a href="https://github.com/ogulcanaydogan/Turkish-LLM"><img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub"></a> |
| | <a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct"><img src="https://img.shields.io/badge/Also_Available-7B_Model-yellow?style=for-the-badge&logo=huggingface" alt="7B"></a> |
| | </p> |
| |
|
| | --- |
| |
|
| | ## Motivation |
| |
|
| | Turkish is the native language of over **80 million speakers** and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to: |
| |
|
| | - Grammatical errors in suffix agreement and vowel harmony |
| | - Hallucinated or culturally inaccurate content |
| | - Code-switching to English or other languages mid-response |
| | - Poor performance on Turkish-specific knowledge (history, geography, institutions) |
| |
|
| | This model was developed to provide a **high-quality, open-source Turkish language model** that treats Turkish as a first-class language rather than an afterthought. |
| |
|
| | ## Model Details |
| |
|
| | | Attribute | Value | |
| | |-----------|-------| |
| | | **Developer** | [Ogulcan Aydogan](https://ogulcanaydogan.com) | |
| | | **Base model** | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | |
| | | **Parameters** | 14.7B | |
| | | **Architecture** | Transformer (decoder-only, causal LM) | |
| | | **Context length** | 4,096 tokens | |
| | | **Precision** | bfloat16 | |
| | | **Fine-tuning method** | Supervised Fine-Tuning (SFT) | |
| | | **License** | Apache 2.0 | |
| | | **Language** | Turkish (tr) | |
| | | **Release date** | March 2026 | |
| |
|
| | ### Model Family |
| |
|
| | | Model | Parameters | Base | Method | Use Case | |
| | |-------|-----------|------|--------|----------| |
| | | **Turkish-LLM-14B-Instruct** (this) | 14.7B | Qwen2.5-14B-Instruct | SFT | Higher quality, complex reasoning | |
| | | [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) | 14.7B | This model | GGUF quantized | Local/edge deployment | |
| | | [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) | 7B | Turkcell-LLM-7b-v1 | LoRA | Lightweight, faster inference | |
| |
|
| | ## Training |
| |
|
| | ### Dataset |
| |
|
| | Training data consists of a curated collection of **144,000 Turkish instruction-response pairs**, with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment. |
| |
|
| | | Domain | Examples | Purpose | |
| | |--------|----------|---------| |
| | | Science | Photosynthesis, water cycle, biology, physics, chemistry | Factual accuracy in Turkish scientific terminology | |
| | | Turkish History | Ottoman Empire, War of Independence, Republic era | Culturally grounded historical knowledge | |
| | | Geography | 7 geographical regions, rivers, lakes, climate | Location-specific Turkish knowledge | |
| | | General Knowledge | Education, culture, daily life, technology | Broad conversational ability | |
| | | Anti-Repetition | Specially crafted pairs | Fluent prose generation without output loops | |
| |
|
| | ### Training Configuration |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Hardware | NVIDIA A100 80GB | |
| | | Framework | PyTorch + Transformers | |
| | | Precision | bfloat16 (mixed precision) | |
| | | Method | Full SFT alignment | |
| | | Optimizer | AdamW | |
| | | Focus | Pure Turkish responses, reduced hallucination | |
| |
|
| | ### Training Pipeline |
| |
|
| | Training was orchestrated using [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge), a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages. |
| |
|
| | ``` |
| | Raw Turkish Data --> Preprocessing --> SFT Training --> Evaluation --> Deployment |
| | (144K pairs) (filtering, (A100 80GB, (manual + (HF Hub, |
| | dedup, bf16 mixed qualitative) Spaces, |
| | formatting) precision) vLLM) |
| | ``` |
| |
|
| | ### Design Decisions |
| |
|
| | **Why Qwen2.5-14B-Instruct as a base?** Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve. |
| |
|
| | **Why SFT over RLHF/DPO?** For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods. |
| |
|
| | **Why 14B instead of 7B?** The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions. |
| |
|
| | ## Usage |
| |
|
| | ### Transformers |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | import torch |
| | |
| | model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | torch_dtype=torch.float16, |
| | device_map="auto" |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."}, |
| | {"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"} |
| | ] |
| | |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | top_p=0.9, |
| | repetition_penalty=1.15, |
| | do_sample=True |
| | ) |
| | print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ### vLLM (Production) |
| |
|
| | ```bash |
| | pip install vllm |
| | vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \ |
| | --dtype float16 \ |
| | --max-model-len 4096 |
| | ``` |
| |
|
| | ### Ollama (Local) |
| |
|
| | ```bash |
| | ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF:Q4_K_M |
| | ``` |
| |
|
| | ### GGUF (llama.cpp / LM Studio) |
| |
|
| | Quantized GGUF versions (Q4_K_M, Q5_K_M, Q8_0, F16) are available at [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF). |
| | |
| | ### Chat Template |
| | |
| | This model uses the ChatML format: |
| | |
| | ``` |
| | <|im_start|>system |
| | Sen yardimci bir Turkce yapay zeka asistanisin.<|im_end|> |
| | <|im_start|>user |
| | {user_message}<|im_end|> |
| | <|im_start|>assistant |
| | {assistant_response}<|im_end|> |
| | ``` |
| | |
| | ## Hardware Requirements |
| | |
| | | Precision | VRAM Required | Recommended GPUs | |
| | |-----------|--------------|------------------| |
| | | FP16 / BF16 | ~30 GB | A100 80GB, A100 40GB, A6000 | |
| | | INT8 | ~15 GB | RTX 4090, A10G | |
| | | INT4 (GPTQ/AWQ) | ~8 GB | RTX 3090, RTX 4080, Apple M-series (24GB) | |
| | |
| | For consumer hardware, use the [GGUF versions](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) for the best balance of quality and accessibility. |
| | |
| | ## Intended Use |
| | |
| | ### Recommended Applications |
| | |
| | - Turkish chatbots and virtual assistants |
| | - Turkish question answering systems |
| | - Educational tools for Turkish-language content |
| | - Turkish text summarization and generation |
| | - Research on Turkish NLP and low-resource language modeling |
| | |
| | ### Out-of-Scope Uses |
| | |
| | - Medical, legal, or financial advice |
| | - Production systems without additional safety alignment |
| | - Generation of misleading or harmful content |
| | - Tasks requiring high factual precision without human verification |
| | |
| | ## Limitations and Risks |
| | |
| | - **Language drift**: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts |
| | - **Hallucination**: Like all LLMs, the model can generate plausible-sounding but incorrect information |
| | - **English degradation**: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct |
| | - **Context length**: Performance may degrade on inputs significantly exceeding 4,096 tokens |
| | - **Bias**: The model may reflect biases present in its training data |
| | - **Safety**: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards |
| | |
| | ## Ethical Considerations |
| | |
| | This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products. |
| | |
| | ## Related Resources |
| | |
| | | Resource | Link | |
| | |----------|------| |
| | | GGUF Versions | [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) | |
| | | 7B Model | [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) | |
| | | Live Demo (14B) | [Turkish-LLM-14B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat) | |
| | | Live Demo (7B) | [Turkish-LLM-7B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-7B-Chat) | |
| | | Training Pipeline | [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge) | |
| | | Project Repository | [Turkish-LLM on GitHub](https://github.com/ogulcanaydogan/Turkish-LLM) | |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @misc{aydogan2026turkishllm14b, |
| | title = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model}, |
| | author = {Aydogan, Ogulcan}, |
| | year = {2026}, |
| | publisher = {Hugging Face}, |
| | url = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct} |
| | } |
| | ``` |
| | |
| | ## Contact |
| | |
| | - Website: [ogulcanaydogan.com](https://ogulcanaydogan.com) |
| | - GitHub: [github.com/ogulcanaydogan](https://github.com/ogulcanaydogan) |
| | - Hugging Face: [huggingface.co/ogulcanaydogan](https://huggingface.co/ogulcanaydogan) |
| | - LinkedIn: [linkedin.com/in/ogulcanaydogan](https://linkedin.com/in/ogulcanaydogan) |
| | |