Upload README.md with huggingface_hub

397efe7 verified 41 minutes ago

10.6 kB

	---
	language:
	- tr
	license: apache-2.0
	base_model: Qwen/Qwen2.5-14B-Instruct
	tags:
	- turkish
	- qwen2
	- sft
	- 14b
	- text-generation
	- instruction-tuned
	- low-resource
	- nlp
	pipeline_tag: text-generation
	model-index:
	- name: Turkish-LLM-14B-Instruct
	results: []
	---

	# Turkish-LLM-14B-Instruct

	An open-source 14.7 billion parameter language model fine-tuned for native Turkish instruction following. Built on Qwen2.5-14B-Instruct using supervised fine-tuning (SFT) on a curated corpus of Turkish-language examples spanning science, history, geography, and general knowledge.

	<p align="center">
	<a href="https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat"><img src="https://img.shields.io/badge/Demo-Live_Chat-blue?style=for-the-badge&logo=huggingface" alt="Demo"></a>
	<a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF"><img src="https://img.shields.io/badge/GGUF-Quantized_Versions-orange?style=for-the-badge&logo=huggingface" alt="GGUF"></a>
	<a href="https://github.com/ogulcanaydogan/Turkish-LLM"><img src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github" alt="GitHub"></a>
	<a href="https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct"><img src="https://img.shields.io/badge/Also_Available-7B_Model-yellow?style=for-the-badge&logo=huggingface" alt="7B"></a>
	</p>

	---

	## Motivation

	Turkish is the native language of over 80 million speakers and an agglutinative language with complex morphology that presents unique challenges for language models. Despite this, Turkish remains significantly underrepresented in the open-source LLM ecosystem. Most multilingual models allocate a small fraction of their training data to Turkish, leading to:

	- Grammatical errors in suffix agreement and vowel harmony
	- Hallucinated or culturally inaccurate content
	- Code-switching to English or other languages mid-response
	- Poor performance on Turkish-specific knowledge (history, geography, institutions)

	This model was developed to provide a high-quality, open-source Turkish language model that treats Turkish as a first-class language rather than an afterthought.

	## Model Details

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Developer \| [Ogulcan Aydogan](https://ogulcanaydogan.com) \|
	\| Base model \| [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) \|
	\| Parameters \| 14.7B \|
	\| Architecture \| Transformer (decoder-only, causal LM) \|
	\| Context length \| 4,096 tokens \|
	\| Precision \| bfloat16 \|
	\| Fine-tuning method \| Supervised Fine-Tuning (SFT) \|
	\| License \| Apache 2.0 \|
	\| Language \| Turkish (tr) \|
	\| Release date \| March 2026 \|

	### Model Family

	\| Model \| Parameters \| Base \| Method \| Use Case \|
	\|-------\|-----------\|------\|--------\|----------\|
	\| Turkish-LLM-14B-Instruct (this) \| 14.7B \| Qwen2.5-14B-Instruct \| SFT \| Higher quality, complex reasoning \|
	\| [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) \| 14.7B \| This model \| GGUF quantized \| Local/edge deployment \|
	\| [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) \| 7B \| Turkcell-LLM-7b-v1 \| LoRA \| Lightweight, faster inference \|

	## Training

	### Dataset

	Training data consists of a curated collection of 144,000 Turkish instruction-response pairs, with a focused SFT subset of approximately 2,600 high-quality examples selected for alignment.

	\| Domain \| Examples \| Purpose \|
	\|--------\|----------\|---------\|
	\| Science \| Photosynthesis, water cycle, biology, physics, chemistry \| Factual accuracy in Turkish scientific terminology \|
	\| Turkish History \| Ottoman Empire, War of Independence, Republic era \| Culturally grounded historical knowledge \|
	\| Geography \| 7 geographical regions, rivers, lakes, climate \| Location-specific Turkish knowledge \|
	\| General Knowledge \| Education, culture, daily life, technology \| Broad conversational ability \|
	\| Anti-Repetition \| Specially crafted pairs \| Fluent prose generation without output loops \|

	### Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hardware \| NVIDIA A100 80GB \|
	\| Framework \| PyTorch + Transformers \|
	\| Precision \| bfloat16 (mixed precision) \|
	\| Method \| Full SFT alignment \|
	\| Optimizer \| AdamW \|
	\| Focus \| Pure Turkish responses, reduced hallucination \|

	### Training Pipeline

	Training was orchestrated using [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge), a custom pipeline built for efficient fine-tuning of LLMs for low-resource languages.

	```
	Raw Turkish Data --> Preprocessing --> SFT Training --> Evaluation --> Deployment
	(144K pairs) (filtering, (A100 80GB, (manual + (HF Hub,
	dedup, bf16 mixed qualitative) Spaces,
	formatting) precision) vLLM)
	```

	### Design Decisions

	Why Qwen2.5-14B-Instruct as a base? Qwen2.5 has strong multilingual foundations with good initial Turkish tokenization coverage. The 14B parameter count provides enough capacity for Turkish morphological complexity without being prohibitively expensive to fine-tune or serve.

	Why SFT over RLHF/DPO? For an initial release targeting factual accuracy and instruction following, SFT provides a reliable baseline. Future versions will explore preference optimization methods.

	Why 14B instead of 7B? The 7B model in the Turkish-LLM family performs well for general tasks, but struggles with complex reasoning, multi-step explanations, and nuanced Turkish grammar. The 14B model significantly improves on these dimensions.

	## Usage

	### Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "ogulcanaydogan/Turkish-LLM-14B-Instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	messages = [
	{"role": "system", "content": "Sen yardimci bir Turkce yapay zeka asistanisin."},
	{"role": "user", "content": "Turkiye'nin cografi bolgeleri nelerdir?"}
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.9,
	repetition_penalty=1.15,
	do_sample=True
	)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### vLLM (Production)

	```bash
	pip install vllm
	vllm serve ogulcanaydogan/Turkish-LLM-14B-Instruct \
	--dtype float16 \
	--max-model-len 4096
	```

	### Ollama (Local)

	```bash
	ollama run hf.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF:Q4_K_M
	```

	### GGUF (llama.cpp / LM Studio)

	Quantized GGUF versions (Q4_K_M, Q5_K_M, Q8_0, F16) are available at [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF).

	### Chat Template

	This model uses the ChatML format:

	```
	<\|im_start\|>system
	Sen yardimci bir Turkce yapay zeka asistanisin.<\|im_end\|>
	<\|im_start\|>user
	{user_message}<\|im_end\|>
	<\|im_start\|>assistant
	{assistant_response}<\|im_end\|>
	```

	## Hardware Requirements

	\| Precision \| VRAM Required \| Recommended GPUs \|
	\|-----------\|--------------\|------------------\|
	\| FP16 / BF16 \| ~30 GB \| A100 80GB, A100 40GB, A6000 \|
	\| INT8 \| ~15 GB \| RTX 4090, A10G \|
	\| INT4 (GPTQ/AWQ) \| ~8 GB \| RTX 3090, RTX 4080, Apple M-series (24GB) \|

	For consumer hardware, use the [GGUF versions](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) for the best balance of quality and accessibility.

	## Intended Use

	### Recommended Applications

	- Turkish chatbots and virtual assistants
	- Turkish question answering systems
	- Educational tools for Turkish-language content
	- Turkish text summarization and generation
	- Research on Turkish NLP and low-resource language modeling

	### Out-of-Scope Uses

	- Medical, legal, or financial advice
	- Production systems without additional safety alignment
	- Generation of misleading or harmful content
	- Tasks requiring high factual precision without human verification

	## Limitations and Risks

	- Language drift: The model may occasionally switch to English or Chinese (inherited from the base model) on ambiguous prompts
	- Hallucination: Like all LLMs, the model can generate plausible-sounding but incorrect information
	- English degradation: English capabilities are reduced compared to the base Qwen2.5-14B-Instruct
	- Context length: Performance may degrade on inputs significantly exceeding 4,096 tokens
	- Bias: The model may reflect biases present in its training data
	- Safety: No explicit safety alignment (RLHF/DPO) has been applied; not suitable for unmoderated user-facing applications without additional safeguards

	## Ethical Considerations

	This model is released under Apache 2.0 to support open research and development for the Turkish-speaking community. Users are responsible for ensuring appropriate use in their specific applications and jurisdictions. The developer recommends implementing additional safety measures before deploying in user-facing products.

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| GGUF Versions \| [Turkish-LLM-14B-Instruct-GGUF](https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct-GGUF) \|
	\| 7B Model \| [Turkish-LLM-7B-Instruct](https://huggingface.co/ogulcanaydogan/Turkish-LLM-7B-Instruct) \|
	\| Live Demo (14B) \| [Turkish-LLM-14B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-14B-Chat) \|
	\| Live Demo (7B) \| [Turkish-LLM-7B-Chat](https://huggingface.co/spaces/ogulcanaydogan/Turkish-LLM-7B-Chat) \|
	\| Training Pipeline \| [LowResource-LLM-Forge](https://github.com/ogulcanaydogan/LowResource-LLM-Forge) \|
	\| Project Repository \| [Turkish-LLM on GitHub](https://github.com/ogulcanaydogan/Turkish-LLM) \|

	## Citation

	```bibtex
	@misc{aydogan2026turkishllm14b,
	title = {Turkish-LLM-14B-Instruct: An Open-Source Turkish Language Model},
	author = {Aydogan, Ogulcan},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/ogulcanaydogan/Turkish-LLM-14B-Instruct}
	}
	```

	## Contact

	- Website: [ogulcanaydogan.com](https://ogulcanaydogan.com)
	- GitHub: [github.com/ogulcanaydogan](https://github.com/ogulcanaydogan)
	- Hugging Face: [huggingface.co/ogulcanaydogan](https://huggingface.co/ogulcanaydogan)
	- LinkedIn: [linkedin.com/in/ogulcanaydogan](https://linkedin.com/in/ogulcanaydogan)