Turkish Lemmatizer with spaCy + Transformers

This is a fine-tuned spaCy pipeline for Turkish lemmatization using a transformer backbone (dbmdz/bert-base-turkish-cased).
It combines a transformer component with a trainable_lemmatizer to produce high-quality lemmas for Turkish text.

⚡ Dependencies

Install spaCy and spaCy Transformers:

pip install -U spacy spacy-transformers

---
## Usage

import spacy
from huggingface_hub import snapshot_download

# Download the model from Hugging Face
model_path = snapshot_download("umit/turkish-lemmatizer")

# Load with spaCy
nlp = spacy.load(model_path)

# Test lemmatization
doc = nlp("Ayşe kitapları masanın üzerine koydu.")
print([token.lemma_ for token in doc])
# Expected output: ['Ayşe', 'kitap', 'masa', 'üzer', 'koy', '.']

---

## Training Details

- **Language:** Turkish (`tr`)  
- **Pipeline components:** `transformer`, `trainable_lemmatizer`  
- **Transformer backbone:** `dbmdz/bert-base-turkish-cased`  
- **Batch size:** 128  
- **Max epochs:** 10 (or 200,000 steps)  
- **Lemmatizer:** trainable with `orth` backoff

---

## Training Performance

The model was trained on **3,402,790** tokens (total number of unique tokens = 128,663).
The model was evaluated on a **held-out test set** consisting of 854,544 tokens that were **not seen during training**.  

| Step  | Lemma Accuracy (%) | Score |
|-------|------------------|-------|
| 0     | 55.66            | 0.56  |
| 2000  | 87.72            | 0.88  |
| 4000  | 94.28            | 0.94  |
| 6000  | 95.89            | 0.96  |
| 8000  | 96.57            | 0.97  |
| 10000 | 96.99            | 0.97  |
| 13000 | 97.29            | 0.97  |

> ✅ The model reached over **97% lemmatization accuracy** during training.

## Citation / Contributors

If you use this model, please cite the following contributors for the data and model development:

**Contributors:**  
Ümit Atlamaz, Yasin Demirtaş, Oğuz Özgür Uğur, Feyza Budan, Özkan Yavuz

@misc{umit2026turkishlemmatizer,
  title={Turkish Lemmatizer with spaCy + Transformers},
  author={Ümit Atlamaz and Yasin Demirtaş and Oğuz Özgür Uğur and Feyza Budan and Özkan Yavuz},
  year={2026},
  howpublished={\url{https://huggingface.co/umit/turkish-lemmatizer}}
}

Downloads last month: -