--- language: tr license: mit library_name: transformers pipeline_tag: text-generation tags: - turkish - gpt - experimental - research - hallucination-analysis datasets: - wikipedia - mc4 --- # 🔬 Why Small Turkish GPTs Hallucinate Facts ### An experimental 85M model trained from scratch **tl;dr:** This model demonstrates a critical lesson in language modeling: **loss ↓ ≠ factual accuracy ↑**. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why. --- ## 🎯 The Core Problem After much training steps on 500K Turkish documents: | Metric | Start | End | Improvement | |--------|-------|-----|-------------| | Validation Loss | 6.0 | 3.75 | 37% better ✅ | | Validation PPL | 397 | 42.7 | 90% better ✅ | | Factual Accuracy | ❌ | ❌ | Still inconsistent | --- ## 📉 Loss vs Factuality Divergence ### Training Progression for prompt "Türkiye'nin başkenti" | Step | Val Loss | Val PPL | Generated Capital | Correct? | |------|----------|---------|-------------------|----------| | 1000 | 5.98 | 397.3 | Ankara | ✅ | | 3000 | 3.94 | 51.7 | Ankara | ✅ | | 5000 | 4.02 | 56.2 | Random city | ❌ | | 6500 | 3.90 | 49.6 | Bolu | ❌ | | 7500 | 3.83 | 46.1 | Konya | ❌ | | 8000 | 3.80 | 44.8 | Bursa | ❌ | | 9000 | 3.75 | 42.7 | Ankara (sometimes) | ⚠️ | **Key observation:** Loss steadily decreases, but capital city prediction remains unstable. --- ## 🧪 Concrete Examples ### Prompt: "Türkiye'nin başkenti" **Step 6500 output:** "Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..." - ❌ Wrong: Bolu is not the capital - ✅ Right: Date format, legal language, formal tone, grammar **Step 7500 output:** "Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..." - ❌ Wrong: Konya is not the capital - ✅ Right: Geographic context, date ranges, economic terminology **Step 9000 output:** "Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..." - ✅ Finally correct! --- ## 🤔 Why This Happens ### What the Model Actually Learns Cross-entropy loss optimizes for: "What token is likely in this context?" In training data distribution: - "Türkiye'nin başkenti Ankara..." appears ~60% of patterns - "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts) The model learns **distributional probabilities**, not **factual truth**. From the model's perspective: - Sometimes generate "Ankara" (most frequent) - Sometimes generate other cities (contextually plausible) - Both reduce loss equally if they appear in training data ### Why Loss Still Decreases Even with wrong facts, the model improves at: - ✅ Grammar (Turkish morphology) - ✅ Syntax (sentence structure) - ✅ Style (formal/informal tone matching) - ✅ Context coherence (topic consistency) - ✅ Pattern matching (Wikipedia-style text) **Loss measures linguistic fluency, NOT factual correctness.** --- ## 📊 What 85M Parameters Can vs Cannot Do ### ✅ Successfully Learned - Linguistic patterns: Grammar, morphology, syntax - Contextual coherence: Topic-appropriate vocabulary - Format mimicry: News articles, formal documents - Statistical associations: Common word pairings ### ❌ Failed to Learn - Factual grounding: "Ankara = capital" as deterministic rule - Logical consistency: Same prompt should give same fact - Knowledge retrieval: Reliable information recall - Fact vs pattern: Distinguishing truth from plausibility ### Model Size Comparison | Model | Parameters | Factual Reliability | |-------|------------|---------------------| | Kayra (this) | 85M | Poor - hallucinations common | | GPT-2 Small | 124M | Poor - similar issues | | GPT-2 Medium | 355M | Better but still unreliable | | GPT-3 | 175B | Good consistency | | GPT-4 | ~1.7T | + RLHF = reliable | **Conclusion:** 85M learns language patterns, not a knowledge base. --- ## 🔬 Technical Details ### Architecture - Type: Transformer Decoder (GPT-style) - Layers: 10 - Hidden size: 640 - Attention heads: 10 - FFN size: 2560 - Vocabulary: 32,000 BPE tokens - Context: 512 tokens - Total: ~85M parameters ### Training Data - Wikipedia TR: 170K articles - mC4 Turkish: 330K web documents - Total: 500K deduplicated documents - Deduplication: MinHash LSH (85% threshold) ### Training Setup - Effective batch: 64 (4 × 16 gradient accumulation) - Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup) - Optimizer: AdamW (β1=0.9, β2=0.95) - Hardware: NVIDIA T4 GPU (16GB) - Time: ~9 hours --- ## 📈 Evaluation Summary ### Fluency: ✅ Good | Metric | Score | |--------|-------| | Grammatical Turkish | 95%+ | | Natural sentence flow | 90%+ | | Coherent paragraphs | 85%+ | ### Factuality: ❌ Poor | Metric | Score | |--------|-------| | Correct capital city | ~50% (random) | | Correct historical dates | ~40% | | Consistent facts across runs | ~30% | --- ## 💡 Key Learnings ### 1. Pretraining ≠ Knowledge Encoding (at this scale) 85M parameters learn **how to speak Turkish**, not **what is true about Turkey**. ### 2. Solutions Require Additional Steps **Option A: Bigger Model (1B+)** More parameters = better fact retention, but still needs instruction tuning **Option B: Instruction Tuning** Explicit "correct answer" supervision with contrastive examples **Option C: Retrieval Augmentation (RAG)** External knowledge base for fact verification ### 3. Validation Loss is Misleading Low perplexity ≠ factual correctness. Always manually test: - Same prompt → consistent facts? - Known facts → correct retrieval? - Hallucination rate → human evaluation --- ## 🎯 Appropriate Use Cases ### ✅ Recommended - Research on Turkish NLP limitations - Pretraining baseline comparisons - Hallucination pattern studies - Educational demonstrations - Understanding LLM failure modes ### ❌ Not Recommended - Production applications - Factual question answering - Information retrieval systems - Educational content generation - Any task requiring accuracy --- ## 🚀 Future: Kayra-v2 Planned improvements: - Larger model: 350M-750M parameters - Better tokenizer: NFC Unicode normalization - Instruction tuning: 10K QA pairs with verified answers - Alignment: RLHF or DPO for factual accuracy - Evaluation: Proper fact-checking benchmarks --- ## 🔧 Usage **⚠️ Requires trust_remote_code=True (custom architecture)** Load the model with: ``` from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "sixfingerdev/kayra-1-exp", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp") Generate with repetition penalty to reduce loops: outputs = model.generate( inputs.input_ids, max_new_tokens=100, temperature=0.8, top_k=50, repetition_penalty=1.2, do_sample=True ) ``` **Expected behavior:** Fluent Turkish, possibly wrong facts. --- ## 📚 Citation @misc{kayra2024hallucination, title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model}, author={sixfingerdev}, year={2024}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}}, note={Research on loss-factuality divergence in low-resource language models} } --- ## 🙏 Acknowledgments - Inspiration: Eleuther AI's research on small model limitations - Data: Wikimedia Foundation, Common Crawl (mC4) - Framework: PyTorch, HuggingFace Transformers --- ## 📜 License MIT License - Use freely for research and education. **Disclaimer:** This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool. --- **Kayra-1-exp** - Teaching us what 85M parameters cannot do 🔬 --- **Discussion:** Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷 # 🌙 Kayra-1-exp **Kayra** - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli. ## 📊 Model Detayları - **Model türü:** Decoder-only Transformer (GPT-style) - **Parametreler:** ~85 milyon - **Validation PPL:** 42.7 - **Validation Loss:** 3.75 - **Dil:** Tamamen Türkçe - **Lisans:** MIT ## 🏗️ Mimari - Layers: 10 - Hidden size: 640 - Attention heads: 10 - FFN size: 2560 - Vocabulary: 32,000 - Context length: 512 ## 📚 Eğitim Verisi - **Wikipedia TR:** ~170K makale - **mC4 Turkish:** ~330K doküman - **Toplam:** ~500K dedupe edilmiş doküman (MinHash LSH) ## 🚀 Kullanım Örnek kod: ``` from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "sixfingerdev/kayra-1-exp", trust_remote_code=True # ← ÖNEMLİ! ) tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp") prompt = "Türkiye'nin başkenti" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( inputs.input_ids, max_new_tokens=100, temperature=0.2, top_k=50, do_sample=True ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## ⚠️ Limitasyonlar (Experimental) **Bu deneysel bir prototiptir:** - ❌ Çok nadir unicode bozuklukları var (NFD normalization) - ❌ Bazen yanlış bilgi üretebilir - ❌ Production kullanımı önerilmez ### Örnekler: - "stadyumu" → "stad yumu" (Unicode parçalı) ## 🔮 Gelecek (Kayra-1-stable) Düzeltilmiş versiyonda: - ✅ NFC Unicode normalization - ✅ Instruction fine-tuning - ✅ Production-ready ## 📈 Eğitim Detayları - **Optimizer:** AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps) - **Batch size:** 4 × 16 (gradient accumulation) - **Precision:** Mixed FP16 - **Hardware:** Tesla T4 GPU - **Training time:** ~9 hours ## 📜 Lisans MIT License - Ticari ve akademik kullanım serbesttir. ## 🙏 Teşekkürler - **Veri:** Wikimedia, Common Crawl (mC4) - **İlham:** GPT-1, Kumru --- **Kayra** - *Türkçe'yi Yaratan Zeka* 🌙 Model: sixfingerdev/kayra-1-exp