kayra-1-exp / README.md

Update README.md

57c9060 verified 2 months ago

10.3 kB

	---
	language: tr
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- turkish
	- gpt
	- experimental
	- research
	- hallucination-analysis
	datasets:
	- wikipedia
	- mc4
	---

	# 🔬 Why Small Turkish GPTs Hallucinate Facts
	### An experimental 85M model trained from scratch

	tl;dr: This model demonstrates a critical lesson in language modeling: loss ↓ ≠ factual accuracy ↑. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why.

	---

	## 🎯 The Core Problem

	After much training steps on 500K Turkish documents:

	\| Metric \| Start \| End \| Improvement \|
	\|--------\|-------\|-----\|-------------\|
	\| Validation Loss \| 6.0 \| 3.75 \| 37% better ✅ \|
	\| Validation PPL \| 397 \| 42.7 \| 90% better ✅ \|
	\| Factual Accuracy \| ❌ \| ❌ \| Still inconsistent \|

	---

	## 📉 Loss vs Factuality Divergence

	### Training Progression for prompt "Türkiye'nin başkenti"

	\| Step \| Val Loss \| Val PPL \| Generated Capital \| Correct? \|
	\|------\|----------\|---------\|-------------------\|----------\|
	\| 1000 \| 5.98 \| 397.3 \| Ankara \| ✅ \|
	\| 3000 \| 3.94 \| 51.7 \| Ankara \| ✅ \|
	\| 5000 \| 4.02 \| 56.2 \| Random city \| ❌ \|
	\| 6500 \| 3.90 \| 49.6 \| Bolu \| ❌ \|
	\| 7500 \| 3.83 \| 46.1 \| Konya \| ❌ \|
	\| 8000 \| 3.80 \| 44.8 \| Bursa \| ❌ \|
	\| 9000 \| 3.75 \| 42.7 \| Ankara (sometimes) \| ⚠️ \|

	Key observation: Loss steadily decreases, but capital city prediction remains unstable.

	---

	## 🧪 Concrete Examples

	### Prompt: "Türkiye'nin başkenti"

	Step 6500 output:

	"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..."

	- ❌ Wrong: Bolu is not the capital
	- ✅ Right: Date format, legal language, formal tone, grammar

	Step 7500 output:

	"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..."

	- ❌ Wrong: Konya is not the capital
	- ✅ Right: Geographic context, date ranges, economic terminology

	Step 9000 output:

	"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..."

	- ✅ Finally correct!

	---

	## 🤔 Why This Happens

	### What the Model Actually Learns

	Cross-entropy loss optimizes for: "What token is likely in this context?"

	In training data distribution:
	- "Türkiye'nin başkenti Ankara..." appears ~60% of patterns
	- "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts)

	The model learns distributional probabilities, not factual truth.

	From the model's perspective:
	- Sometimes generate "Ankara" (most frequent)
	- Sometimes generate other cities (contextually plausible)
	- Both reduce loss equally if they appear in training data

	### Why Loss Still Decreases

	Even with wrong facts, the model improves at:
	- ✅ Grammar (Turkish morphology)
	- ✅ Syntax (sentence structure)
	- ✅ Style (formal/informal tone matching)
	- ✅ Context coherence (topic consistency)
	- ✅ Pattern matching (Wikipedia-style text)

	Loss measures linguistic fluency, NOT factual correctness.

	---

	## 📊 What 85M Parameters Can vs Cannot Do

	### ✅ Successfully Learned

	- Linguistic patterns: Grammar, morphology, syntax
	- Contextual coherence: Topic-appropriate vocabulary
	- Format mimicry: News articles, formal documents
	- Statistical associations: Common word pairings

	### ❌ Failed to Learn

	- Factual grounding: "Ankara = capital" as deterministic rule
	- Logical consistency: Same prompt should give same fact
	- Knowledge retrieval: Reliable information recall
	- Fact vs pattern: Distinguishing truth from plausibility

	### Model Size Comparison

	\| Model \| Parameters \| Factual Reliability \|
	\|-------\|------------\|---------------------\|
	\| Kayra (this) \| 85M \| Poor - hallucinations common \|
	\| GPT-2 Small \| 124M \| Poor - similar issues \|
	\| GPT-2 Medium \| 355M \| Better but still unreliable \|
	\| GPT-3 \| 175B \| Good consistency \|
	\| GPT-4 \| ~1.7T \| + RLHF = reliable \|

	Conclusion: 85M learns language patterns, not a knowledge base.

	---

	## 🔬 Technical Details

	### Architecture

	- Type: Transformer Decoder (GPT-style)
	- Layers: 10
	- Hidden size: 640
	- Attention heads: 10
	- FFN size: 2560
	- Vocabulary: 32,000 BPE tokens
	- Context: 512 tokens
	- Total: ~85M parameters

	### Training Data

	- Wikipedia TR: 170K articles
	- mC4 Turkish: 330K web documents
	- Total: 500K deduplicated documents
	- Deduplication: MinHash LSH (85% threshold)

	### Training Setup

	- Effective batch: 64 (4 × 16 gradient accumulation)
	- Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup)
	- Optimizer: AdamW (β1=0.9, β2=0.95)
	- Hardware: NVIDIA T4 GPU (16GB)
	- Time: ~9 hours

	---

	## 📈 Evaluation Summary

	### Fluency: ✅ Good

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Grammatical Turkish \| 95%+ \|
	\| Natural sentence flow \| 90%+ \|
	\| Coherent paragraphs \| 85%+ \|

	### Factuality: ❌ Poor

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Correct capital city \| ~50% (random) \|
	\| Correct historical dates \| ~40% \|
	\| Consistent facts across runs \| ~30% \|

	---

	## 💡 Key Learnings

	### 1. Pretraining ≠ Knowledge Encoding (at this scale)

	85M parameters learn how to speak Turkish, not what is true about Turkey.

	### 2. Solutions Require Additional Steps

	Option A: Bigger Model (1B+)
	More parameters = better fact retention, but still needs instruction tuning

	Option B: Instruction Tuning
	Explicit "correct answer" supervision with contrastive examples

	Option C: Retrieval Augmentation (RAG)
	External knowledge base for fact verification

	### 3. Validation Loss is Misleading

	Low perplexity ≠ factual correctness. Always manually test:
	- Same prompt → consistent facts?
	- Known facts → correct retrieval?
	- Hallucination rate → human evaluation

	---

	## 🎯 Appropriate Use Cases

	### ✅ Recommended

	- Research on Turkish NLP limitations
	- Pretraining baseline comparisons
	- Hallucination pattern studies
	- Educational demonstrations
	- Understanding LLM failure modes

	### ❌ Not Recommended

	- Production applications
	- Factual question answering
	- Information retrieval systems
	- Educational content generation
	- Any task requiring accuracy

	---

	## 🚀 Future: Kayra-v2

	Planned improvements:
	- Larger model: 350M-750M parameters
	- Better tokenizer: NFC Unicode normalization
	- Instruction tuning: 10K QA pairs with verified answers
	- Alignment: RLHF or DPO for factual accuracy
	- Evaluation: Proper fact-checking benchmarks

	---

	## 🔧 Usage

	⚠️ Requires trust_remote_code=True (custom architecture)

	Load the model with:
	```
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"sixfingerdev/kayra-1-exp",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

	Generate with repetition penalty to reduce loops:

	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=100,
	temperature=0.8,
	top_k=50,
	repetition_penalty=1.2,
	do_sample=True
	)
	```
	Expected behavior: Fluent Turkish, possibly wrong facts.

	---

	## 📚 Citation

	@misc{kayra2024hallucination,
	title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model},
	author={sixfingerdev},
	year={2024},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}},
	note={Research on loss-factuality divergence in low-resource language models}
	}

	---

	## 🙏 Acknowledgments

	- Inspiration: Eleuther AI's research on small model limitations
	- Data: Wikimedia Foundation, Common Crawl (mC4)
	- Framework: PyTorch, HuggingFace Transformers

	---

	## 📜 License

	MIT License - Use freely for research and education.

	Disclaimer: This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool.

	---

	Kayra-1-exp - Teaching us what 85M parameters cannot do 🔬

	---

	Discussion: Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷
	# 🌙 Kayra-1-exp

	Kayra - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli.

	## 📊 Model Detayları

	- Model türü: Decoder-only Transformer (GPT-style)
	- Parametreler: ~85 milyon
	- Validation PPL: 42.7
	- Validation Loss: 3.75
	- Dil: Tamamen Türkçe
	- Lisans: MIT

	## 🏗️ Mimari

	- Layers: 10
	- Hidden size: 640
	- Attention heads: 10
	- FFN size: 2560
	- Vocabulary: 32,000
	- Context length: 512

	## 📚 Eğitim Verisi

	- Wikipedia TR: ~170K makale
	- mC4 Turkish: ~330K doküman
	- Toplam: ~500K dedupe edilmiş doküman (MinHash LSH)

	## 🚀 Kullanım

	Örnek kod:
	```
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"sixfingerdev/kayra-1-exp",
	trust_remote_code=True # ← ÖNEMLİ!
	)
	tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

	prompt = "Türkiye'nin başkenti"
	inputs = tokenizer(prompt, return_tensors="pt")

	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=100,
	temperature=0.2,
	top_k=50,
	do_sample=True
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```
	## ⚠️ Limitasyonlar (Experimental)

	Bu deneysel bir prototiptir:

	- ❌ Çok nadir unicode bozuklukları var (NFD normalization)
	- ❌ Bazen yanlış bilgi üretebilir
	- ❌ Production kullanımı önerilmez

	### Örnekler:

	- "stadyumu" → "stad yumu" (Unicode parçalı)

	## 🔮 Gelecek (Kayra-1-stable)

	Düzeltilmiş versiyonda:
	- ✅ NFC Unicode normalization
	- ✅ Instruction fine-tuning
	- ✅ Production-ready

	## 📈 Eğitim Detayları

	- Optimizer: AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps)
	- Batch size: 4 × 16 (gradient accumulation)
	- Precision: Mixed FP16
	- Hardware: Tesla T4 GPU
	- Training time: ~9 hours

	## 📜 Lisans

	MIT License - Ticari ve akademik kullanım serbesttir.

	## 🙏 Teşekkürler

	- Veri: Wikimedia, Common Crawl (mC4)
	- İlham: GPT-1, Kumru

	---

	Kayra - Türkçe'yi Yaratan Zeka 🌙

	Model: sixfingerdev/kayra-1-exp