kayra-1-exp / README.md
sixfingerdev's picture
Update README.md
57c9060 verified
---
language: tr
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- turkish
- gpt
- experimental
- research
- hallucination-analysis
datasets:
- wikipedia
- mc4
---
# 🔬 Why Small Turkish GPTs Hallucinate Facts
### An experimental 85M model trained from scratch
**tl;dr:** This model demonstrates a critical lesson in language modeling: **loss ↓ ≠ factual accuracy ↑**. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why.
---
## 🎯 The Core Problem
After much training steps on 500K Turkish documents:
| Metric | Start | End | Improvement |
|--------|-------|-----|-------------|
| Validation Loss | 6.0 | 3.75 | 37% better ✅ |
| Validation PPL | 397 | 42.7 | 90% better ✅ |
| Factual Accuracy | ❌ | ❌ | Still inconsistent |
---
## 📉 Loss vs Factuality Divergence
### Training Progression for prompt "Türkiye'nin başkenti"
| Step | Val Loss | Val PPL | Generated Capital | Correct? |
|------|----------|---------|-------------------|----------|
| 1000 | 5.98 | 397.3 | Ankara | ✅ |
| 3000 | 3.94 | 51.7 | Ankara | ✅ |
| 5000 | 4.02 | 56.2 | Random city | ❌ |
| 6500 | 3.90 | 49.6 | Bolu | ❌ |
| 7500 | 3.83 | 46.1 | Konya | ❌ |
| 8000 | 3.80 | 44.8 | Bursa | ❌ |
| 9000 | 3.75 | 42.7 | Ankara (sometimes) | ⚠️ |
**Key observation:** Loss steadily decreases, but capital city prediction remains unstable.
---
## 🧪 Concrete Examples
### Prompt: "Türkiye'nin başkenti"
**Step 6500 output:**
"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..."
- ❌ Wrong: Bolu is not the capital
- ✅ Right: Date format, legal language, formal tone, grammar
**Step 7500 output:**
"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..."
- ❌ Wrong: Konya is not the capital
- ✅ Right: Geographic context, date ranges, economic terminology
**Step 9000 output:**
"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..."
- ✅ Finally correct!
---
## 🤔 Why This Happens
### What the Model Actually Learns
Cross-entropy loss optimizes for: "What token is likely in this context?"
In training data distribution:
- "Türkiye'nin başkenti Ankara..." appears ~60% of patterns
- "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts)
The model learns **distributional probabilities**, not **factual truth**.
From the model's perspective:
- Sometimes generate "Ankara" (most frequent)
- Sometimes generate other cities (contextually plausible)
- Both reduce loss equally if they appear in training data
### Why Loss Still Decreases
Even with wrong facts, the model improves at:
- ✅ Grammar (Turkish morphology)
- ✅ Syntax (sentence structure)
- ✅ Style (formal/informal tone matching)
- ✅ Context coherence (topic consistency)
- ✅ Pattern matching (Wikipedia-style text)
**Loss measures linguistic fluency, NOT factual correctness.**
---
## 📊 What 85M Parameters Can vs Cannot Do
### ✅ Successfully Learned
- Linguistic patterns: Grammar, morphology, syntax
- Contextual coherence: Topic-appropriate vocabulary
- Format mimicry: News articles, formal documents
- Statistical associations: Common word pairings
### ❌ Failed to Learn
- Factual grounding: "Ankara = capital" as deterministic rule
- Logical consistency: Same prompt should give same fact
- Knowledge retrieval: Reliable information recall
- Fact vs pattern: Distinguishing truth from plausibility
### Model Size Comparison
| Model | Parameters | Factual Reliability |
|-------|------------|---------------------|
| Kayra (this) | 85M | Poor - hallucinations common |
| GPT-2 Small | 124M | Poor - similar issues |
| GPT-2 Medium | 355M | Better but still unreliable |
| GPT-3 | 175B | Good consistency |
| GPT-4 | ~1.7T | + RLHF = reliable |
**Conclusion:** 85M learns language patterns, not a knowledge base.
---
## 🔬 Technical Details
### Architecture
- Type: Transformer Decoder (GPT-style)
- Layers: 10
- Hidden size: 640
- Attention heads: 10
- FFN size: 2560
- Vocabulary: 32,000 BPE tokens
- Context: 512 tokens
- Total: ~85M parameters
### Training Data
- Wikipedia TR: 170K articles
- mC4 Turkish: 330K web documents
- Total: 500K deduplicated documents
- Deduplication: MinHash LSH (85% threshold)
### Training Setup
- Effective batch: 64 (4 × 16 gradient accumulation)
- Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup)
- Optimizer: AdamW (β1=0.9, β2=0.95)
- Hardware: NVIDIA T4 GPU (16GB)
- Time: ~9 hours
---
## 📈 Evaluation Summary
### Fluency: ✅ Good
| Metric | Score |
|--------|-------|
| Grammatical Turkish | 95%+ |
| Natural sentence flow | 90%+ |
| Coherent paragraphs | 85%+ |
### Factuality: ❌ Poor
| Metric | Score |
|--------|-------|
| Correct capital city | ~50% (random) |
| Correct historical dates | ~40% |
| Consistent facts across runs | ~30% |
---
## 💡 Key Learnings
### 1. Pretraining ≠ Knowledge Encoding (at this scale)
85M parameters learn **how to speak Turkish**, not **what is true about Turkey**.
### 2. Solutions Require Additional Steps
**Option A: Bigger Model (1B+)**
More parameters = better fact retention, but still needs instruction tuning
**Option B: Instruction Tuning**
Explicit "correct answer" supervision with contrastive examples
**Option C: Retrieval Augmentation (RAG)**
External knowledge base for fact verification
### 3. Validation Loss is Misleading
Low perplexity ≠ factual correctness. Always manually test:
- Same prompt → consistent facts?
- Known facts → correct retrieval?
- Hallucination rate → human evaluation
---
## 🎯 Appropriate Use Cases
### ✅ Recommended
- Research on Turkish NLP limitations
- Pretraining baseline comparisons
- Hallucination pattern studies
- Educational demonstrations
- Understanding LLM failure modes
### ❌ Not Recommended
- Production applications
- Factual question answering
- Information retrieval systems
- Educational content generation
- Any task requiring accuracy
---
## 🚀 Future: Kayra-v2
Planned improvements:
- Larger model: 350M-750M parameters
- Better tokenizer: NFC Unicode normalization
- Instruction tuning: 10K QA pairs with verified answers
- Alignment: RLHF or DPO for factual accuracy
- Evaluation: Proper fact-checking benchmarks
---
## 🔧 Usage
**⚠️ Requires trust_remote_code=True (custom architecture)**
Load the model with:
```
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"sixfingerdev/kayra-1-exp",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")
Generate with repetition penalty to reduce loops:
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
temperature=0.8,
top_k=50,
repetition_penalty=1.2,
do_sample=True
)
```
**Expected behavior:** Fluent Turkish, possibly wrong facts.
---
## 📚 Citation
@misc{kayra2024hallucination,
title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model},
author={sixfingerdev},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}},
note={Research on loss-factuality divergence in low-resource language models}
}
---
## 🙏 Acknowledgments
- Inspiration: Eleuther AI's research on small model limitations
- Data: Wikimedia Foundation, Common Crawl (mC4)
- Framework: PyTorch, HuggingFace Transformers
---
## 📜 License
MIT License - Use freely for research and education.
**Disclaimer:** This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool.
---
**Kayra-1-exp** - Teaching us what 85M parameters cannot do 🔬
---
**Discussion:** Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷
# 🌙 Kayra-1-exp
**Kayra** - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli.
## 📊 Model Detayları
- **Model türü:** Decoder-only Transformer (GPT-style)
- **Parametreler:** ~85 milyon
- **Validation PPL:** 42.7
- **Validation Loss:** 3.75
- **Dil:** Tamamen Türkçe
- **Lisans:** MIT
## 🏗️ Mimari
- Layers: 10
- Hidden size: 640
- Attention heads: 10
- FFN size: 2560
- Vocabulary: 32,000
- Context length: 512
## 📚 Eğitim Verisi
- **Wikipedia TR:** ~170K makale
- **mC4 Turkish:** ~330K doküman
- **Toplam:** ~500K dedupe edilmiş doküman (MinHash LSH)
## 🚀 Kullanım
Örnek kod:
```
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"sixfingerdev/kayra-1-exp",
trust_remote_code=True # ← ÖNEMLİ!
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")
prompt = "Türkiye'nin başkenti"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
temperature=0.2,
top_k=50,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## ⚠️ Limitasyonlar (Experimental)
**Bu deneysel bir prototiptir:**
- ❌ Çok nadir unicode bozuklukları var (NFD normalization)
- ❌ Bazen yanlış bilgi üretebilir
- ❌ Production kullanımı önerilmez
### Örnekler:
- "stadyumu" → "stad yumu" (Unicode parçalı)
## 🔮 Gelecek (Kayra-1-stable)
Düzeltilmiş versiyonda:
- ✅ NFC Unicode normalization
- ✅ Instruction fine-tuning
- ✅ Production-ready
## 📈 Eğitim Detayları
- **Optimizer:** AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps)
- **Batch size:** 4 × 16 (gradient accumulation)
- **Precision:** Mixed FP16
- **Hardware:** Tesla T4 GPU
- **Training time:** ~9 hours
## 📜 Lisans
MIT License - Ticari ve akademik kullanım serbesttir.
## 🙏 Teşekkürler
- **Veri:** Wikimedia, Common Crawl (mC4)
- **İlham:** GPT-1, Kumru
---
**Kayra** - *Türkçe'yi Yaratan Zeka* 🌙
Model: sixfingerdev/kayra-1-exp