|
|
--- |
|
|
language: tr |
|
|
license: mit |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- turkish |
|
|
- gpt |
|
|
- experimental |
|
|
- research |
|
|
- hallucination-analysis |
|
|
datasets: |
|
|
- wikipedia |
|
|
- mc4 |
|
|
--- |
|
|
|
|
|
# 🔬 Why Small Turkish GPTs Hallucinate Facts |
|
|
### An experimental 85M model trained from scratch |
|
|
|
|
|
**tl;dr:** This model demonstrates a critical lesson in language modeling: **loss ↓ ≠ factual accuracy ↑**. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎯 The Core Problem |
|
|
|
|
|
After much training steps on 500K Turkish documents: |
|
|
|
|
|
| Metric | Start | End | Improvement | |
|
|
|--------|-------|-----|-------------| |
|
|
| Validation Loss | 6.0 | 3.75 | 37% better ✅ | |
|
|
| Validation PPL | 397 | 42.7 | 90% better ✅ | |
|
|
| Factual Accuracy | ❌ | ❌ | Still inconsistent | |
|
|
|
|
|
--- |
|
|
|
|
|
## 📉 Loss vs Factuality Divergence |
|
|
|
|
|
### Training Progression for prompt "Türkiye'nin başkenti" |
|
|
|
|
|
| Step | Val Loss | Val PPL | Generated Capital | Correct? | |
|
|
|------|----------|---------|-------------------|----------| |
|
|
| 1000 | 5.98 | 397.3 | Ankara | ✅ | |
|
|
| 3000 | 3.94 | 51.7 | Ankara | ✅ | |
|
|
| 5000 | 4.02 | 56.2 | Random city | ❌ | |
|
|
| 6500 | 3.90 | 49.6 | Bolu | ❌ | |
|
|
| 7500 | 3.83 | 46.1 | Konya | ❌ | |
|
|
| 8000 | 3.80 | 44.8 | Bursa | ❌ | |
|
|
| 9000 | 3.75 | 42.7 | Ankara (sometimes) | ⚠️ | |
|
|
|
|
|
**Key observation:** Loss steadily decreases, but capital city prediction remains unstable. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Concrete Examples |
|
|
|
|
|
### Prompt: "Türkiye'nin başkenti" |
|
|
|
|
|
**Step 6500 output:** |
|
|
|
|
|
"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..." |
|
|
|
|
|
- ❌ Wrong: Bolu is not the capital |
|
|
- ✅ Right: Date format, legal language, formal tone, grammar |
|
|
|
|
|
**Step 7500 output:** |
|
|
|
|
|
"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..." |
|
|
|
|
|
- ❌ Wrong: Konya is not the capital |
|
|
- ✅ Right: Geographic context, date ranges, economic terminology |
|
|
|
|
|
**Step 9000 output:** |
|
|
|
|
|
"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..." |
|
|
|
|
|
- ✅ Finally correct! |
|
|
|
|
|
--- |
|
|
|
|
|
## 🤔 Why This Happens |
|
|
|
|
|
### What the Model Actually Learns |
|
|
|
|
|
Cross-entropy loss optimizes for: "What token is likely in this context?" |
|
|
|
|
|
In training data distribution: |
|
|
- "Türkiye'nin başkenti Ankara..." appears ~60% of patterns |
|
|
- "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts) |
|
|
|
|
|
The model learns **distributional probabilities**, not **factual truth**. |
|
|
|
|
|
From the model's perspective: |
|
|
- Sometimes generate "Ankara" (most frequent) |
|
|
- Sometimes generate other cities (contextually plausible) |
|
|
- Both reduce loss equally if they appear in training data |
|
|
|
|
|
### Why Loss Still Decreases |
|
|
|
|
|
Even with wrong facts, the model improves at: |
|
|
- ✅ Grammar (Turkish morphology) |
|
|
- ✅ Syntax (sentence structure) |
|
|
- ✅ Style (formal/informal tone matching) |
|
|
- ✅ Context coherence (topic consistency) |
|
|
- ✅ Pattern matching (Wikipedia-style text) |
|
|
|
|
|
**Loss measures linguistic fluency, NOT factual correctness.** |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 What 85M Parameters Can vs Cannot Do |
|
|
|
|
|
### ✅ Successfully Learned |
|
|
|
|
|
- Linguistic patterns: Grammar, morphology, syntax |
|
|
- Contextual coherence: Topic-appropriate vocabulary |
|
|
- Format mimicry: News articles, formal documents |
|
|
- Statistical associations: Common word pairings |
|
|
|
|
|
### ❌ Failed to Learn |
|
|
|
|
|
- Factual grounding: "Ankara = capital" as deterministic rule |
|
|
- Logical consistency: Same prompt should give same fact |
|
|
- Knowledge retrieval: Reliable information recall |
|
|
- Fact vs pattern: Distinguishing truth from plausibility |
|
|
|
|
|
### Model Size Comparison |
|
|
|
|
|
| Model | Parameters | Factual Reliability | |
|
|
|-------|------------|---------------------| |
|
|
| Kayra (this) | 85M | Poor - hallucinations common | |
|
|
| GPT-2 Small | 124M | Poor - similar issues | |
|
|
| GPT-2 Medium | 355M | Better but still unreliable | |
|
|
| GPT-3 | 175B | Good consistency | |
|
|
| GPT-4 | ~1.7T | + RLHF = reliable | |
|
|
|
|
|
**Conclusion:** 85M learns language patterns, not a knowledge base. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔬 Technical Details |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- Type: Transformer Decoder (GPT-style) |
|
|
- Layers: 10 |
|
|
- Hidden size: 640 |
|
|
- Attention heads: 10 |
|
|
- FFN size: 2560 |
|
|
- Vocabulary: 32,000 BPE tokens |
|
|
- Context: 512 tokens |
|
|
- Total: ~85M parameters |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- Wikipedia TR: 170K articles |
|
|
- mC4 Turkish: 330K web documents |
|
|
- Total: 500K deduplicated documents |
|
|
- Deduplication: MinHash LSH (85% threshold) |
|
|
|
|
|
### Training Setup |
|
|
|
|
|
- Effective batch: 64 (4 × 16 gradient accumulation) |
|
|
- Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup) |
|
|
- Optimizer: AdamW (β1=0.9, β2=0.95) |
|
|
- Hardware: NVIDIA T4 GPU (16GB) |
|
|
- Time: ~9 hours |
|
|
|
|
|
--- |
|
|
|
|
|
## 📈 Evaluation Summary |
|
|
|
|
|
### Fluency: ✅ Good |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Grammatical Turkish | 95%+ | |
|
|
| Natural sentence flow | 90%+ | |
|
|
| Coherent paragraphs | 85%+ | |
|
|
|
|
|
### Factuality: ❌ Poor |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Correct capital city | ~50% (random) | |
|
|
| Correct historical dates | ~40% | |
|
|
| Consistent facts across runs | ~30% | |
|
|
|
|
|
--- |
|
|
|
|
|
## 💡 Key Learnings |
|
|
|
|
|
### 1. Pretraining ≠ Knowledge Encoding (at this scale) |
|
|
|
|
|
85M parameters learn **how to speak Turkish**, not **what is true about Turkey**. |
|
|
|
|
|
### 2. Solutions Require Additional Steps |
|
|
|
|
|
**Option A: Bigger Model (1B+)** |
|
|
More parameters = better fact retention, but still needs instruction tuning |
|
|
|
|
|
**Option B: Instruction Tuning** |
|
|
Explicit "correct answer" supervision with contrastive examples |
|
|
|
|
|
**Option C: Retrieval Augmentation (RAG)** |
|
|
External knowledge base for fact verification |
|
|
|
|
|
### 3. Validation Loss is Misleading |
|
|
|
|
|
Low perplexity ≠ factual correctness. Always manually test: |
|
|
- Same prompt → consistent facts? |
|
|
- Known facts → correct retrieval? |
|
|
- Hallucination rate → human evaluation |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎯 Appropriate Use Cases |
|
|
|
|
|
### ✅ Recommended |
|
|
|
|
|
- Research on Turkish NLP limitations |
|
|
- Pretraining baseline comparisons |
|
|
- Hallucination pattern studies |
|
|
- Educational demonstrations |
|
|
- Understanding LLM failure modes |
|
|
|
|
|
### ❌ Not Recommended |
|
|
|
|
|
- Production applications |
|
|
- Factual question answering |
|
|
- Information retrieval systems |
|
|
- Educational content generation |
|
|
- Any task requiring accuracy |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Future: Kayra-v2 |
|
|
|
|
|
Planned improvements: |
|
|
- Larger model: 350M-750M parameters |
|
|
- Better tokenizer: NFC Unicode normalization |
|
|
- Instruction tuning: 10K QA pairs with verified answers |
|
|
- Alignment: RLHF or DPO for factual accuracy |
|
|
- Evaluation: Proper fact-checking benchmarks |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔧 Usage |
|
|
|
|
|
**⚠️ Requires trust_remote_code=True (custom architecture)** |
|
|
|
|
|
Load the model with: |
|
|
``` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"sixfingerdev/kayra-1-exp", |
|
|
trust_remote_code=True |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp") |
|
|
|
|
|
Generate with repetition penalty to reduce loops: |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=100, |
|
|
temperature=0.8, |
|
|
top_k=50, |
|
|
repetition_penalty=1.2, |
|
|
do_sample=True |
|
|
) |
|
|
``` |
|
|
**Expected behavior:** Fluent Turkish, possibly wrong facts. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
@misc{kayra2024hallucination, |
|
|
title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model}, |
|
|
author={sixfingerdev}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}}, |
|
|
note={Research on loss-factuality divergence in low-resource language models} |
|
|
} |
|
|
|
|
|
--- |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
- Inspiration: Eleuther AI's research on small model limitations |
|
|
- Data: Wikimedia Foundation, Common Crawl (mC4) |
|
|
- Framework: PyTorch, HuggingFace Transformers |
|
|
|
|
|
--- |
|
|
|
|
|
## 📜 License |
|
|
|
|
|
MIT License - Use freely for research and education. |
|
|
|
|
|
**Disclaimer:** This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool. |
|
|
|
|
|
--- |
|
|
|
|
|
**Kayra-1-exp** - Teaching us what 85M parameters cannot do 🔬 |
|
|
|
|
|
--- |
|
|
|
|
|
**Discussion:** Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷 |
|
|
# 🌙 Kayra-1-exp |
|
|
|
|
|
**Kayra** - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli. |
|
|
|
|
|
## 📊 Model Detayları |
|
|
|
|
|
- **Model türü:** Decoder-only Transformer (GPT-style) |
|
|
- **Parametreler:** ~85 milyon |
|
|
- **Validation PPL:** 42.7 |
|
|
- **Validation Loss:** 3.75 |
|
|
- **Dil:** Tamamen Türkçe |
|
|
- **Lisans:** MIT |
|
|
|
|
|
## 🏗️ Mimari |
|
|
|
|
|
- Layers: 10 |
|
|
- Hidden size: 640 |
|
|
- Attention heads: 10 |
|
|
- FFN size: 2560 |
|
|
- Vocabulary: 32,000 |
|
|
- Context length: 512 |
|
|
|
|
|
## 📚 Eğitim Verisi |
|
|
|
|
|
- **Wikipedia TR:** ~170K makale |
|
|
- **mC4 Turkish:** ~330K doküman |
|
|
- **Toplam:** ~500K dedupe edilmiş doküman (MinHash LSH) |
|
|
|
|
|
## 🚀 Kullanım |
|
|
|
|
|
Örnek kod: |
|
|
``` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"sixfingerdev/kayra-1-exp", |
|
|
trust_remote_code=True # ← ÖNEMLİ! |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp") |
|
|
|
|
|
prompt = "Türkiye'nin başkenti" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=100, |
|
|
temperature=0.2, |
|
|
top_k=50, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
## ⚠️ Limitasyonlar (Experimental) |
|
|
|
|
|
**Bu deneysel bir prototiptir:** |
|
|
|
|
|
- ❌ Çok nadir unicode bozuklukları var (NFD normalization) |
|
|
- ❌ Bazen yanlış bilgi üretebilir |
|
|
- ❌ Production kullanımı önerilmez |
|
|
|
|
|
### Örnekler: |
|
|
|
|
|
- "stadyumu" → "stad yumu" (Unicode parçalı) |
|
|
|
|
|
## 🔮 Gelecek (Kayra-1-stable) |
|
|
|
|
|
Düzeltilmiş versiyonda: |
|
|
- ✅ NFC Unicode normalization |
|
|
- ✅ Instruction fine-tuning |
|
|
- ✅ Production-ready |
|
|
|
|
|
## 📈 Eğitim Detayları |
|
|
|
|
|
- **Optimizer:** AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps) |
|
|
- **Batch size:** 4 × 16 (gradient accumulation) |
|
|
- **Precision:** Mixed FP16 |
|
|
- **Hardware:** Tesla T4 GPU |
|
|
- **Training time:** ~9 hours |
|
|
|
|
|
## 📜 Lisans |
|
|
|
|
|
MIT License - Ticari ve akademik kullanım serbesttir. |
|
|
|
|
|
## 🙏 Teşekkürler |
|
|
|
|
|
- **Veri:** Wikimedia, Common Crawl (mC4) |
|
|
- **İlham:** GPT-1, Kumru |
|
|
|
|
|
--- |
|
|
|
|
|
**Kayra** - *Türkçe'yi Yaratan Zeka* 🌙 |
|
|
|
|
|
Model: sixfingerdev/kayra-1-exp |
|
|
|