MemoirNER-BERTurk

Model Description

MemoirNER-BERTurk is a fine-tuned Named Entity Recognition (NER) model based on BERTurk, specifically trained on Ottoman Turkish memoirs from the 1900-1950 period. The model excels at identifying historical figures, locations, and organizations mentioned in memoir texts from the late Ottoman Empire and early Turkish Republic era.

Model Details

Model Type: Named Entity Recognition (NER)
Base Model: dbmdz/bert-base-turkish-cased
Language: Turkish (Ottoman Turkish memoirs)
Training Data: Historical memoirs and biographical texts (1900-1950)
License: CC BY-NC 4.0

Training Data

The model was trained on a carefully curated dataset consisting of:

Total segments: 10,431 text segments
Total entities: 15,688 annotated entities
Entity distribution:
- PERSON: 11,498 entities (73.3%)
- LOC (Location): 2,259 entities (14.4%)
- ORG (Organization): 1,931 entities (12.3%)

The dataset focuses on memoirs from the transitional period of Ottoman Empire to Turkish Republic, capturing historical figures, places, and institutions of that era.

Performance

Entity Type	Precision	Recall	F1 Score
Person	96.85%	93.80%	95.30%
Location	66.14%	89.60%	76.10%
Organization	66.05%	90.26%	76.28%

Intended Use

Primary Use Cases

Historical text analysis and digitization projects
Named entity extraction from Ottoman Turkish and early Republican memoirs
Academic research on late Ottoman and early Republican periods
Digital humanities projects focusing on Turkish historical texts

Out-of-Scope Use

Modern Turkish NER tasks (use contemporary Turkish NER models)
Non-Turkish language texts
Real-time news or social media text processing

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbbiyte/MemoirNER-BERTurk")
model = AutoModelForTokenClassification.from_pretrained("dbbiyte/MemoirNER-BERTurk")

# Create NER pipeline
ner_pipeline = pipeline("ner", 
                       model=model, 
                       tokenizer=tokenizer,
                       aggregation_strategy="simple")

# Example usage
text = "Leyla Hanım'ın Beşiktaş'taki dârülacezede verdiği musiki resitalinde, nağmelerinin ruhuma işlediğini söyledim."
entities = ner_pipeline(text)
print(entities)

Training Details

Training Procedure

Base Model: BERTurk (Turkish BERT)
Training Framework: PyTorch with Transformers
Optimization: AdamW optimizer with learning rate scheduling
Loss Function: Focal Loss with class weighting for imbalanced dataset
Batch Size: Adaptive based on GPU memory (16-32)
Max Sequence Length: 512 tokens
Training Epochs: 10 epochs with early stopping

Training Infrastructure

Hardware: CUDA-enabled GPU
Software: PyTorch, Transformers, scikit-learn

Limitations and Bias

Limitations

The model is specifically trained on historical memoirs (1900-1950) and may not perform well on modern Turkish texts
Performance on location and organization entities is lower than person entities due to data imbalance
Limited to three entity types (PERSON, LOC, ORG)

Bias Considerations

The training data reflects the perspective and language use of memoir writers from 1900-1950
May have geographical bias towards regions frequently mentioned in available memoirs
Historical context may affect entity recognition for modern equivalents of historical places/organizations

👥 Authors

İzmir Institute of Technology - Digital Humanities and AI Laboratory:

Dr. Mustafa İLTER - İzmir Institute of Technology
Dr. Doğan EVECEN - İzmir Institute of Technology
Dr. Buket ERŞAHİN - İzmir Institute of Technology
Dr. Yasemin ÖZCAN GÖNÜLAL - İzmir Institute of Technology
Assoc. Prof. Selma TEKİR - İzmir Institute of Technology

Pamukkale University:

Assoc. Prof. Sezen KARABULUT - Pamukkale University
İbrahim BERCİ - Pamukkale University
Emre ONUÇ - Pamukkale University

🏦 Funding & Acknowledgments

This work was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under project number 323K372. We thank TÜBİTAK for their support.

📚 BERTurk Reference

This model uses BERTurk developed by Stefan Schweter, a BERT model pre-trained on 35GB of Turkish text, optimized for Turkish natural language processing tasks.

📄 License and Usage Terms

This model is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

✅ Permitted Uses:

Academic research (citation required)
Educational purposes
Non-profit projects
Personal experimental studies

❌ Prohibited Uses:

Commercial applications
Profit-driven projects
Commercial product/service development

📄 Citation Requirement

When using this model, please cite as:

@misc{ilter2025memoirner,
  author = {İlter, Mustafa and Onuç, Emre and Evecen, Doğan and Erşahin, Buket and Özcan Gönülal, Yasemin and Karabulut, Sezen and Berci, İbrahim and Tekir, Selma},
  title = {MemoirNER-BERTurk: Named Entity Recognition for Ottoman Turkish Memoirs},
  howpublished = {Deep Learning Model},
  doi = {10.57967/hf/6141},
  publisher = {Hugging Face},
  url = {https://huggingface.co/dbbiyte/MemoirNER-BERTurk},
  year = {2025},
}

Model Version: 1.0
Last Updated: August 2025

Downloads last month: 23

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

dbbiyte
/

MemoirNER-BERTurk