MemoirNER-BERTurk
Model Description
MemoirNER-BERTurk is a fine-tuned Named Entity Recognition (NER) model based on BERTurk, specifically trained on Ottoman Turkish memoirs from the 1900-1950 period. The model excels at identifying historical figures, locations, and organizations mentioned in memoir texts from the late Ottoman Empire and early Turkish Republic era.
Model Details
- Model Type: Named Entity Recognition (NER)
- Base Model: dbmdz/bert-base-turkish-cased
- Language: Turkish (Ottoman Turkish memoirs)
- Training Data: Historical memoirs and biographical texts (1900-1950)
- License: CC BY-NC 4.0
Training Data
The model was trained on a carefully curated dataset consisting of:
- Total segments: 10,431 text segments
- Total entities: 15,688 annotated entities
- Entity distribution:
- PERSON: 11,498 entities (73.3%)
- LOC (Location): 2,259 entities (14.4%)
- ORG (Organization): 1,931 entities (12.3%)
The dataset focuses on memoirs from the transitional period of Ottoman Empire to Turkish Republic, capturing historical figures, places, and institutions of that era.
Performance
| Entity Type | Precision | Recall | F1 Score |
|---|---|---|---|
| Person | 96.85% | 93.80% | 95.30% |
| Location | 66.14% | 89.60% | 76.10% |
| Organization | 66.05% | 90.26% | 76.28% |
Intended Use
Primary Use Cases
- Historical text analysis and digitization projects
- Named entity extraction from Ottoman Turkish and early Republican memoirs
- Academic research on late Ottoman and early Republican periods
- Digital humanities projects focusing on Turkish historical texts
Out-of-Scope Use
- Modern Turkish NER tasks (use contemporary Turkish NER models)
- Non-Turkish language texts
- Real-time news or social media text processing
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbbiyte/MemoirNER-BERTurk")
model = AutoModelForTokenClassification.from_pretrained("dbbiyte/MemoirNER-BERTurk")
# Create NER pipeline
ner_pipeline = pipeline("ner",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple")
# Example usage
text = "Leyla Hanım'ın Beşiktaş'taki dârülacezede verdiği musiki resitalinde, nağmelerinin ruhuma işlediğini söyledim."
entities = ner_pipeline(text)
print(entities)
Training Details
Training Procedure
- Base Model: BERTurk (Turkish BERT)
- Training Framework: PyTorch with Transformers
- Optimization: AdamW optimizer with learning rate scheduling
- Loss Function: Focal Loss with class weighting for imbalanced dataset
- Batch Size: Adaptive based on GPU memory (16-32)
- Max Sequence Length: 512 tokens
- Training Epochs: 10 epochs with early stopping
Training Infrastructure
- Hardware: CUDA-enabled GPU
- Software: PyTorch, Transformers, scikit-learn
Limitations and Bias
Limitations
- The model is specifically trained on historical memoirs (1900-1950) and may not perform well on modern Turkish texts
- Performance on location and organization entities is lower than person entities due to data imbalance
- Limited to three entity types (PERSON, LOC, ORG)
Bias Considerations
- The training data reflects the perspective and language use of memoir writers from 1900-1950
- May have geographical bias towards regions frequently mentioned in available memoirs
- Historical context may affect entity recognition for modern equivalents of historical places/organizations
👥 Authors
İzmir Institute of Technology - Digital Humanities and AI Laboratory:
- Dr. Mustafa İLTER - İzmir Institute of Technology
- Dr. Doğan EVECEN - İzmir Institute of Technology
- Dr. Buket ERŞAHİN - İzmir Institute of Technology
- Dr. Yasemin ÖZCAN GÖNÜLAL - İzmir Institute of Technology
- Assoc. Prof. Selma TEKİR - İzmir Institute of Technology
Pamukkale University:
- Assoc. Prof. Sezen KARABULUT - Pamukkale University
- İbrahim BERCİ - Pamukkale University
- Emre ONUÇ - Pamukkale University
🏦 Funding & Acknowledgments
This work was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under project number 323K372. We thank TÜBİTAK for their support.
📚 BERTurk Reference
This model uses BERTurk developed by Stefan Schweter, a BERT model pre-trained on 35GB of Turkish text, optimized for Turkish natural language processing tasks.
📄 License and Usage Terms
This model is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
✅ Permitted Uses:
- Academic research (citation required)
- Educational purposes
- Non-profit projects
- Personal experimental studies
❌ Prohibited Uses:
- Commercial applications
- Profit-driven projects
- Commercial product/service development
📄 Citation Requirement
When using this model, please cite as:
@misc{ilter2025memoirner,
author = {İlter, Mustafa and Onuç, Emre and Evecen, Doğan and Erşahin, Buket and Özcan Gönülal, Yasemin and Karabulut, Sezen and Berci, İbrahim and Tekir, Selma},
title = {MemoirNER-BERTurk: Named Entity Recognition for Ottoman Turkish Memoirs},
howpublished = {Deep Learning Model},
doi = {10.57967/hf/6141},
publisher = {Hugging Face},
url = {https://huggingface.co/dbbiyte/MemoirNER-BERTurk},
year = {2025},
}
Model Version: 1.0
Last Updated: August 2025
- Downloads last month
- 140