---
language: tr
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- turkish
- gpt
- experimental
- research
- hallucination-analysis
datasets:
- wikipedia
- mc4
---

# 🔬 Why Small Turkish GPTs Hallucinate Facts
### An experimental 85M model trained from scratch

**tl;dr:** This model demonstrates a critical lesson in language modeling: **loss ↓ ≠ factual accuracy ↑**. Despite achieving PPL 42.7, it confidently generates wrong facts. This repo documents why.

---

## 🎯 The Core Problem

After much training steps on 500K Turkish documents:

| Metric | Start | End | Improvement |
|--------|-------|-----|-------------|
| Validation Loss | 6.0 | 3.75 | 37% better ✅ |
| Validation PPL | 397 | 42.7 | 90% better ✅ |
| Factual Accuracy | ❌ | ❌ | Still inconsistent |

---

## 📉 Loss vs Factuality Divergence

### Training Progression for prompt "Türkiye'nin başkenti"

| Step | Val Loss | Val PPL | Generated Capital | Correct? |
|------|----------|---------|-------------------|----------|
| 1000 | 5.98 | 397.3 | Ankara | ✅ |
| 3000 | 3.94 | 51.7 | Ankara | ✅ |
| 5000 | 4.02 | 56.2 | Random city | ❌ |
| 6500 | 3.90 | 49.6 | Bolu | ❌ |
| 7500 | 3.83 | 46.1 | Konya | ❌ |
| 8000 | 3.80 | 44.8 | Bursa | ❌ |
| 9000 | 3.75 | 42.7 | Ankara (sometimes) | ⚠️ |

**Key observation:** Loss steadily decreases, but capital city prediction remains unstable.

---

## 🧪 Concrete Examples

### Prompt: "Türkiye'nin başkenti"

**Step 6500 output:**

"Türkiye'nin başkenti Bolu'da 24 Mayıs 2012 tarihinde toplanan ve 23. dönem TBMM'de kabul edilen 6360 sayılı kanun..."

- ❌ Wrong: Bolu is not the capital
- ✅ Right: Date format, legal language, formal tone, grammar

**Step 7500 output:**

"Türkiye'nin başkenti Konya, Türkiye'nin en büyük kenti Konya'dan Avrupa'ya yayılan ilk yatırımını 25-30 Mayıs 2020..."

- ❌ Wrong: Konya is not the capital
- ✅ Right: Geographic context, date ranges, economic terminology

**Step 9000 output:**

"Türkiye'nin başkenti Ankara'da düzenlenen Dünya Kadınlar Basketbol Şampiyonası'nda..."

- ✅ Finally correct!

---

## 🤔 Why This Happens

### What the Model Actually Learns

Cross-entropy loss optimizes for: "What token is likely in this context?"

In training data distribution:
- "Türkiye'nin başkenti Ankara..." appears ~60% of patterns
- "Başkent Bursa/Konya/İzmir..." appears ~40% (from various contexts)

The model learns **distributional probabilities**, not **factual truth**.

From the model's perspective:
- Sometimes generate "Ankara" (most frequent)
- Sometimes generate other cities (contextually plausible)
- Both reduce loss equally if they appear in training data

### Why Loss Still Decreases

Even with wrong facts, the model improves at:
- ✅ Grammar (Turkish morphology)
- ✅ Syntax (sentence structure)
- ✅ Style (formal/informal tone matching)
- ✅ Context coherence (topic consistency)
- ✅ Pattern matching (Wikipedia-style text)

**Loss measures linguistic fluency, NOT factual correctness.**

---

## 📊 What 85M Parameters Can vs Cannot Do

### ✅ Successfully Learned

- Linguistic patterns: Grammar, morphology, syntax
- Contextual coherence: Topic-appropriate vocabulary
- Format mimicry: News articles, formal documents
- Statistical associations: Common word pairings

### ❌ Failed to Learn

- Factual grounding: "Ankara = capital" as deterministic rule
- Logical consistency: Same prompt should give same fact
- Knowledge retrieval: Reliable information recall
- Fact vs pattern: Distinguishing truth from plausibility

### Model Size Comparison

| Model | Parameters | Factual Reliability |
|-------|------------|---------------------|
| Kayra (this) | 85M | Poor - hallucinations common |
| GPT-2 Small | 124M | Poor - similar issues |
| GPT-2 Medium | 355M | Better but still unreliable |
| GPT-3 | 175B | Good consistency |
| GPT-4 | ~1.7T | + RLHF = reliable |

**Conclusion:** 85M learns language patterns, not a knowledge base.

---

## 🔬 Technical Details

### Architecture

- Type: Transformer Decoder (GPT-style)
- Layers: 10
- Hidden size: 640
- Attention heads: 10
- FFN size: 2560
- Vocabulary: 32,000 BPE tokens
- Context: 512 tokens
- Total: ~85M parameters

### Training Data

- Wikipedia TR: 170K articles
- mC4 Turkish: 330K web documents
- Total: 500K deduplicated documents
- Deduplication: MinHash LSH (85% threshold)

### Training Setup

- Effective batch: 64 (4 × 16 gradient accumulation)
- Learning rate: 1e-4 → 3e-4 (cosine with 2K warmup)
- Optimizer: AdamW (β1=0.9, β2=0.95)
- Hardware: NVIDIA T4 GPU (16GB)
- Time: ~9 hours

---

## 📈 Evaluation Summary

### Fluency: ✅ Good

| Metric | Score |
|--------|-------|
| Grammatical Turkish | 95%+ |
| Natural sentence flow | 90%+ |
| Coherent paragraphs | 85%+ |

### Factuality: ❌ Poor

| Metric | Score |
|--------|-------|
| Correct capital city | ~50% (random) |
| Correct historical dates | ~40% |
| Consistent facts across runs | ~30% |

---

## 💡 Key Learnings

### 1. Pretraining ≠ Knowledge Encoding (at this scale)

85M parameters learn **how to speak Turkish**, not **what is true about Turkey**.

### 2. Solutions Require Additional Steps

**Option A: Bigger Model (1B+)**
More parameters = better fact retention, but still needs instruction tuning

**Option B: Instruction Tuning**
Explicit "correct answer" supervision with contrastive examples

**Option C: Retrieval Augmentation (RAG)**
External knowledge base for fact verification

### 3. Validation Loss is Misleading

Low perplexity ≠ factual correctness. Always manually test:
- Same prompt → consistent facts?
- Known facts → correct retrieval?
- Hallucination rate → human evaluation

---

## 🎯 Appropriate Use Cases

### ✅ Recommended

- Research on Turkish NLP limitations
- Pretraining baseline comparisons
- Hallucination pattern studies
- Educational demonstrations
- Understanding LLM failure modes

### ❌ Not Recommended

- Production applications
- Factual question answering
- Information retrieval systems
- Educational content generation
- Any task requiring accuracy

---

## 🚀 Future: Kayra-v2

Planned improvements:
- Larger model: 350M-750M parameters
- Better tokenizer: NFC Unicode normalization
- Instruction tuning: 10K QA pairs with verified answers
- Alignment: RLHF or DPO for factual accuracy
- Evaluation: Proper fact-checking benchmarks

---

## 🔧 Usage

**⚠️ Requires trust_remote_code=True (custom architecture)**

Load the model with:
```
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model = AutoModelForCausalLM.from_pretrained(
        "sixfingerdev/kayra-1-exp",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

Generate with repetition penalty to reduce loops:

    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        repetition_penalty=1.2,
        do_sample=True
    )
```
**Expected behavior:** Fluent Turkish, possibly wrong facts.

---

## 📚 Citation

    @misc{kayra2024hallucination,
      title={Why Small Turkish GPTs Hallucinate Facts: An Experimental 85M Model},
      author={sixfingerdev},
      year={2024},
      publisher={HuggingFace},
      howpublished={\url{https://huggingface.co/sixfingerdev/kayra-1-exp}},
      note={Research on loss-factuality divergence in low-resource language models}
    }

---

## 🙏 Acknowledgments

- Inspiration: Eleuther AI's research on small model limitations
- Data: Wikimedia Foundation, Common Crawl (mC4)
- Framework: PyTorch, HuggingFace Transformers

---

## 📜 License

MIT License - Use freely for research and education.

**Disclaimer:** This model is intentionally shared with its flaws documented. It serves as a learning resource demonstrating why small LMs hallucinate, not as a production tool.

---

**Kayra-1-exp** - Teaching us what 85M parameters cannot do 🔬

---

**Discussion:** Found interesting hallucination patterns? Share your findings in the community discussions tab. Let's learn together why small LMs hallucinate. 🇹🇷
# 🌙 Kayra-1-exp

**Kayra** - Sıfırdan Türkçe ile eğitilmiş ilk deneysel GPT modeli.

## 📊 Model Detayları

- **Model türü:** Decoder-only Transformer (GPT-style)
- **Parametreler:** ~85 milyon
- **Validation PPL:** 42.7
- **Validation Loss:** 3.75
- **Dil:** Tamamen Türkçe
- **Lisans:** MIT

## 🏗️ Mimari

- Layers: 10
- Hidden size: 640
- Attention heads: 10
- FFN size: 2560
- Vocabulary: 32,000
- Context length: 512

## 📚 Eğitim Verisi

- **Wikipedia TR:** ~170K makale
- **mC4 Turkish:** ~330K doküman
- **Toplam:** ~500K dedupe edilmiş doküman (MinHash LSH)

## 🚀 Kullanım

Örnek kod:
```
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "sixfingerdev/kayra-1-exp",
    trust_remote_code=True  # ← ÖNEMLİ!
)
tokenizer = AutoTokenizer.from_pretrained("sixfingerdev/kayra-1-exp")

prompt = "Türkiye'nin başkenti"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=100,
    temperature=0.2,
    top_k=50,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## ⚠️ Limitasyonlar (Experimental)

**Bu deneysel bir prototiptir:**

- ❌ Çok nadir unicode bozuklukları var (NFD normalization)
- ❌ Bazen yanlış bilgi üretebilir
- ❌ Production kullanımı önerilmez

### Örnekler:

- "stadyumu" → "stad yumu" (Unicode parçalı)

## 🔮 Gelecek (Kayra-1-stable)

Düzeltilmiş versiyonda:
- ✅ NFC Unicode normalization
- ✅ Instruction fine-tuning
- ✅ Production-ready

## 📈 Eğitim Detayları

- **Optimizer:** AdamW (lr: 1e-4 → 3e-4, warmup: 2000 steps)
- **Batch size:** 4 × 16 (gradient accumulation)
- **Precision:** Mixed FP16
- **Hardware:** Tesla T4 GPU
- **Training time:** ~9 hours

## 📜 Lisans

MIT License - Ticari ve akademik kullanım serbesttir.

## 🙏 Teşekkürler

- **Veri:** Wikimedia, Common Crawl (mC4)
- **İlham:** GPT-1, Kumru

---

**Kayra** - *Türkçe'yi Yaratan Zeka* 🌙

Model: sixfingerdev/kayra-1-exp