--- language: - tr license: apache-2.0 base_model: Qwen/Qwen2.5-14B-Instruct tags: - turkish - qwen2 - sft - 14b - text-generation - instruction-tuned - low-resource - nlp pipeline_tag: text-generation model-index: - name: Turkish-LLM-14B-Instruct results: [] --- # Turkish-LLM-14B-Instruct An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge.

Demo GGUF GitHub 7B

--- ## Motivation Turkish is the native language of over **80 million speakers** and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to: - Grammatical errors in suffix agreement and vowel harmony - Hallucinated or culturally inaccurate content - Code-switching to English or other languages mid-response - Poor performance on Turkish-specific knowledge (history, geography, institutions) This model was developed to provide a **high-quality, open-source Turkish language model** that treats Turkish as a first-class language rather than an afterthought. ## Model Details | Attribute | Value | |-----------|-------| | **Developer** | [Ogulcan Aydogan](https://ogulcanaydogan.com) | | **Base model** | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) | | **Parameters** | 14.7B | | **Architecture** | Transformer (decoder-only, causal LM) | | **Context length** | 4,096 tokens | | **Precision** | bfloat16 | | **Fine-tuning method** | Supervised Fine-Tuning (SFT) | | **License** | Apache 2.0 | | **Language** | Turkish (tr) | | **Release date** | March 2026 | ### Model Family | Model | Parameters | Base | Method | Use Case | |-------|-----------|------|--------|----------| | **Turkish-LLM-14B-Instruct** (this) | 14.7B | Qwen2.5-14B-Instruct | SFT | Higher quality, complex reasoning | | [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) | 14.7B | This model | GGUF quantized | Local/edge deployment | | [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) | 7B | Turkcell-LLM-7b-v1 | LoRA | Lightweight, faster inference | ## Training ### Dataset Training data consists of a curated collection of **144,000 Turkish instruction-response pairs**, with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment. | Domain | Examples | Purpose | |--------|----------|---------| | Science | Photosynthesis, water cycle, biology, physics, chemistry | Factual accuracy in Turkish scientific terminology | | Turkish History | Ottoman Empire, War of Independence, Republic era | Culturally grounded historical knowledge | | Geography | 7 geographical regions, rivers, lakes, climate | Location-specific Turkish knowledge | | General Knowledge | Education, culture, daily life, technology | Broad conversational ability | | Anti-Repetition | Specially crafted pairs | Fluent prose generation without output loops | ### Training Configuration | Parameter | Value | |-----------|-------| | Hardware | NVIDIA A100 80GB | | Framework | PyTorch + Transformers | | Precision | bfloat16 (mixed precision) | | Method | Full SFT alignment | | Optimizer | AdamW | | Focus | Pure Turkish responses, reduced hallucination | ### Training Pipeline Training was orchestrated using [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge), a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages. ``` Raw Turkish Data --> Preprocessing --> SFT Training --> Evaluation --> Deployment (144K pairs) (filtering, (A100 80GB, (manual + (HF Hub, dedup, bf16 mixed qualitative) Spaces, formatting) precision) vLLM) ``` ### Design Decisions **Why Qwen2.5-14B-Instruct as a base?** Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve. **Why SFT over RLHF/DPO?** For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods. **Why 14B instead of 7B?** The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions. ## Usage ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) messages = [ {"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."}, {"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, repetition_penalty=1.15, do_sample=True ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ### vLLM (Production) ```bash pip install vllm vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \ --dtype float16 \ --max-model-len 4096 ``` ### Ollama (Local) ```bash ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF:Q4_K_M ``` ### GGUF (llama.cpp / LM Studio) Quantized GGUF versions (Q4_K_M, Q5_K_M, Q8_0, F16) are available at [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF). ### Chat Template This model uses the ChatML format: ``` <|im_start|>system Sen yardimci bir Turkce yapay zeka asistanisin.<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> ``` ## Hardware Requirements | Precision | VRAM Required | Recommended GPUs | |-----------|--------------|------------------| | FP16 / BF16 | ~30 GB | A100 80GB, A100 40GB, A6000 | | INT8 | ~15 GB | RTX 4090, A10G | | INT4 (GPTQ/AWQ) | ~8 GB | RTX 3090, RTX 4080, Apple M-series (24GB) | For consumer hardware, use the [GGUF versions](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) for the best balance of quality and accessibility. ## Intended Use ### Recommended Applications - Turkish chatbots and virtual assistants - Turkish question answering systems - Educational tools for Turkish-language content - Turkish text summarization and generation - Research on Turkish NLP and low-resource language modeling ### Out-of-Scope Uses - Medical, legal, or financial advice - Production systems without additional safety alignment - Generation of misleading or harmful content - Tasks requiring high factual precision without human verification ## Benchmark Results Evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with 0-shot settings. | Benchmark | Turkish-LLM-14B (SFT) | Qwen2.5-14B-Instruct (Base) | |-----------|----------------------|----------------------------| | **MMLU_TR** (57 subjects) | 59.38% | 59.47% | | **XCOPA_TR** (causal reasoning) | 66.00% | 66.80% | | **XNLI_TR** (natural language inference) | 42.97% | 41.53% | > The SFT model maintains base model knowledge while gaining Turkish instruction-following capability. Benchmark scores are comparable; the real improvement is in conversational quality and cultural awareness. ### MMLU_TR Highlights | Top Subjects | Score | Weakest Subjects | Score | |-------------|-------|-----------------|-------| | HS Computer Science | 84.0% | Moral Scenarios | 32.7% | | Marketing | 77.9% | Abstract Algebra | 36.0% | | HS US History | 77.7% | Prof. Accounting | 39.1% | | HS European History | 76.7% | College Physics | 40.6% | | HS Biology | 76.3% | Professional Law | 43.2% | ## Limitations and Risks - **Language drift**: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts - **Hallucination**: Like all LLMs, the model can generate plausible-sounding but incorrect information - **English degradation**: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct - **Context length**: Performance may degrade on inputs significantly exceeding 4,096 tokens - **Bias**: The model may reflect biases present in its training data - **Safety**: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards ## Ethical Considerations This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products. ## Related Resources | Resource | Link | |----------|------| | GGUF Versions | [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) | | 7B Model | [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) | | Live Demo (14B) | [Turkish-LLM-14B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat) | | Live Demo (7B) | [Turkish-LLM-7B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-7B-Chat) | | Training Pipeline | [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge) | | Project Repository | [Turkish-LLM on GitHub](https://github.com/ogulcanaydogan/Turkish-LLM) | ## Citation ```bibtex @misc{aydogan2026turkishllm14b, title = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model}, author = {Aydogan, Ogulcan}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct} } ``` ## Contact - Website: [ogulcanaydogan.com](https://ogulcanaydogan.com) - GitHub: [github.com/ogulcanaydogan](https://github.com/ogulcanaydogan) - Hugging Face: [huggingface.co/ogulcanaydogan](https://huggingface.co/ogulcanaydogan) - LinkedIn: [linkedin.com/in/ogulcanaydogan](https://linkedin.com/in/ogulcanaydogan)