ogulcanaydogan's picture
Upload README.md with huggingface_hub
397efe7 verified
---
language:
- tr
license: apache-2.0
base_model: Qwen/Qwen2.5-14B-Instruct
tags:
- turkish
- qwen2
- sft
- 14b
- text-generation
- instruction-tuned
- low-resource
- nlp
pipeline_tag: text-generation
model-index:
- name: Turkish-LLM-14B-Instruct
results: []
---
# Turkish-LLM-14B-Instruct
An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge.
<p align="center">
<a href="https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat"><img src="https://img.shields.io/badge/Demo-Live_Chat-blue?style=for-the-badge&logo=huggingface" alt="Demo"></a>
<a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF"><img src="https://img.shields.io/badge/GGUF-Quantized_Versions-orange?style=for-the-badge&logo=huggingface" alt="GGUF"></a>
<a href="https://github.com/ogulcanaydogan/Turkish-LLM"><img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub"></a>
<a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct"><img src="https://img.shields.io/badge/Also_Available-7B_Model-yellow?style=for-the-badge&logo=huggingface" alt="7B"></a>
</p>
---
## Motivation
Turkish is the native language of over **80 million speakers** and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to:
- Grammatical errors in suffix agreement and vowel harmony
- Hallucinated or culturally inaccurate content
- Code-switching to English or other languages mid-response
- Poor performance on Turkish-specific knowledge (history, geography, institutions)
This model was developed to provide a **high-quality, open-source Turkish language model** that treats Turkish as a first-class language rather than an afterthought.
## Model Details
| Attribute | Value |
|-----------|-------|
| **Developer** | [Ogulcan Aydogan](https://ogulcanaydogan.com) |
| **Base model** | [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) |
| **Parameters** | 14.7B |
| **Architecture** | Transformer (decoder-only, causal LM) |
| **Context length** | 4,096 tokens |
| **Precision** | bfloat16 |
| **Fine-tuning method** | Supervised Fine-Tuning (SFT) |
| **License** | Apache 2.0 |
| **Language** | Turkish (tr) |
| **Release date** | March 2026 |
### Model Family
| Model | Parameters | Base | Method | Use Case |
|-------|-----------|------|--------|----------|
| **Turkish-LLM-14B-Instruct** (this) | 14.7B | Qwen2.5-14B-Instruct | SFT | Higher quality, complex reasoning |
| [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) | 14.7B | This model | GGUF quantized | Local/edge deployment |
| [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) | 7B | Turkcell-LLM-7b-v1 | LoRA | Lightweight, faster inference |
## Training
### Dataset
Training data consists of a curated collection of **144,000 Turkish instruction-response pairs**, with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment.
| Domain | Examples | Purpose |
|--------|----------|---------|
| Science | Photosynthesis, water cycle, biology, physics, chemistry | Factual accuracy in Turkish scientific terminology |
| Turkish History | Ottoman Empire, War of Independence, Republic era | Culturally grounded historical knowledge |
| Geography | 7 geographical regions, rivers, lakes, climate | Location-specific Turkish knowledge |
| General Knowledge | Education, culture, daily life, technology | Broad conversational ability |
| Anti-Repetition | Specially crafted pairs | Fluent prose generation without output loops |
### Training Configuration
| Parameter | Value |
|-----------|-------|
| Hardware | NVIDIA A100 80GB |
| Framework | PyTorch + Transformers |
| Precision | bfloat16 (mixed precision) |
| Method | Full SFT alignment |
| Optimizer | AdamW |
| Focus | Pure Turkish responses, reduced hallucination |
### Training Pipeline
Training was orchestrated using [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge), a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages.
```
Raw Turkish Data --> Preprocessing --> SFT Training --> Evaluation --> Deployment
(144K pairs) (filtering, (A100 80GB, (manual + (HF Hub,
dedup, bf16 mixed qualitative) Spaces,
formatting) precision) vLLM)
```
### Design Decisions
**Why Qwen2.5-14B-Instruct as a base?** Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve.
**Why SFT over RLHF/DPO?** For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods.
**Why 14B instead of 7B?** The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions.
## Usage
### Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
messages = [
{"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."},
{"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.15,
do_sample=True
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
### vLLM (Production)
```bash
pip install vllm
vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \
--dtype float16 \
--max-model-len 4096
```
### Ollama (Local)
```bash
ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF:Q4_K_M
```
### GGUF (llama.cpp / LM Studio)
Quantized GGUF versions (Q4_K_M, Q5_K_M, Q8_0, F16) are available at [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF).
### Chat Template
This model uses the ChatML format:
```
<|im_start|>system
Sen yardimci bir Turkce yapay zeka asistanisin.<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
```
## Hardware Requirements
| Precision | VRAM Required | Recommended GPUs |
|-----------|--------------|------------------|
| FP16 / BF16 | ~30 GB | A100 80GB, A100 40GB, A6000 |
| INT8 | ~15 GB | RTX 4090, A10G |
| INT4 (GPTQ/AWQ) | ~8 GB | RTX 3090, RTX 4080, Apple M-series (24GB) |
For consumer hardware, use the [GGUF versions](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) for the best balance of quality and accessibility.
## Intended Use
### Recommended Applications
- Turkish chatbots and virtual assistants
- Turkish question answering systems
- Educational tools for Turkish-language content
- Turkish text summarization and generation
- Research on Turkish NLP and low-resource language modeling
### Out-of-Scope Uses
- Medical, legal, or financial advice
- Production systems without additional safety alignment
- Generation of misleading or harmful content
- Tasks requiring high factual precision without human verification
## Limitations and Risks
- **Language drift**: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts
- **Hallucination**: Like all LLMs, the model can generate plausible-sounding but incorrect information
- **English degradation**: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct
- **Context length**: Performance may degrade on inputs significantly exceeding 4,096 tokens
- **Bias**: The model may reflect biases present in its training data
- **Safety**: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards
## Ethical Considerations
This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products.
## Related Resources
| Resource | Link |
|----------|------|
| GGUF Versions | [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) |
| 7B Model | [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) |
| Live Demo (14B) | [Turkish-LLM-14B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat) |
| Live Demo (7B) | [Turkish-LLM-7B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-7B-Chat) |
| Training Pipeline | [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge) |
| Project Repository | [Turkish-LLM on GitHub](https://github.com/ogulcanaydogan/Turkish-LLM) |
## Citation
```bibtex
@misc{aydogan2026turkishllm14b,
title = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model},
author = {Aydogan, Ogulcan},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct}
}
```
## Contact
- Website: [ogulcanaydogan.com](https://ogulcanaydogan.com)
- GitHub: [github.com/ogulcanaydogan](https://github.com/ogulcanaydogan)
- Hugging Face: [huggingface.co/ogulcanaydogan](https://huggingface.co/ogulcanaydogan)
- LinkedIn: [linkedin.com/in/ogulcanaydogan](https://linkedin.com/in/ogulcanaydogan)