Safetensors
bert

MemoirNER-BERTurk

Model Description

MemoirNER-BERTurk is a fine-tuned Named Entity Recognition (NER) model based on BERTurk, specifically trained on Ottoman Turkish memoirs from the 1900-1950 period. The model excels at identifying historical figures, locations, and organizations mentioned in memoir texts from the late Ottoman Empire and early Turkish Republic era.

Model Details

  • Model Type: Named Entity Recognition (NER)
  • Base Model: dbmdz/bert-base-turkish-cased
  • Language: Turkish (Ottoman Turkish memoirs)
  • Training Data: Historical memoirs and biographical texts (1900-1950)
  • License: CC BY-NC 4.0

Training Data

The model was trained on a carefully curated dataset consisting of:

  • Total segments: 10,431 text segments
  • Total entities: 15,688 annotated entities
  • Entity distribution:
    • PERSON: 11,498 entities (73.3%)
    • LOC (Location): 2,259 entities (14.4%)
    • ORG (Organization): 1,931 entities (12.3%)

The dataset focuses on memoirs from the transitional period of Ottoman Empire to Turkish Republic, capturing historical figures, places, and institutions of that era.

Performance

Entity Type Precision Recall F1 Score
Person 96.85% 93.80% 95.30%
Location 66.14% 89.60% 76.10%
Organization 66.05% 90.26% 76.28%

Intended Use

Primary Use Cases

  • Historical text analysis and digitization projects
  • Named entity extraction from Ottoman Turkish and early Republican memoirs
  • Academic research on late Ottoman and early Republican periods
  • Digital humanities projects focusing on Turkish historical texts

Out-of-Scope Use

  • Modern Turkish NER tasks (use contemporary Turkish NER models)
  • Non-Turkish language texts
  • Real-time news or social media text processing

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbbiyte/MemoirNER-BERTurk")
model = AutoModelForTokenClassification.from_pretrained("dbbiyte/MemoirNER-BERTurk")

# Create NER pipeline
ner_pipeline = pipeline("ner", 
                       model=model, 
                       tokenizer=tokenizer,
                       aggregation_strategy="simple")

# Example usage
text = "Leyla Hanım'ın Beşiktaş'taki dârülacezede verdiği musiki resitalinde, nağmelerinin ruhuma işlediğini söyledim."
entities = ner_pipeline(text)
print(entities)

Training Details

Training Procedure

  • Base Model: BERTurk (Turkish BERT)
  • Training Framework: PyTorch with Transformers
  • Optimization: AdamW optimizer with learning rate scheduling
  • Loss Function: Focal Loss with class weighting for imbalanced dataset
  • Batch Size: Adaptive based on GPU memory (16-32)
  • Max Sequence Length: 512 tokens
  • Training Epochs: 10 epochs with early stopping

Training Infrastructure

  • Hardware: CUDA-enabled GPU
  • Software: PyTorch, Transformers, scikit-learn

Limitations and Bias

Limitations

  • The model is specifically trained on historical memoirs (1900-1950) and may not perform well on modern Turkish texts
  • Performance on location and organization entities is lower than person entities due to data imbalance
  • Limited to three entity types (PERSON, LOC, ORG)

Bias Considerations

  • The training data reflects the perspective and language use of memoir writers from 1900-1950
  • May have geographical bias towards regions frequently mentioned in available memoirs
  • Historical context may affect entity recognition for modern equivalents of historical places/organizations

👥 Authors

İzmir Institute of Technology - Digital Humanities and AI Laboratory:

  • Dr. Mustafa İLTER - İzmir Institute of Technology
  • Dr. Doğan EVECEN - İzmir Institute of Technology
  • Dr. Buket ERŞAHİN - İzmir Institute of Technology
  • Dr. Yasemin ÖZCAN GÖNÜLAL - İzmir Institute of Technology
  • Assoc. Prof. Selma TEKİR - İzmir Institute of Technology

Pamukkale University:

  • Assoc. Prof. Sezen KARABULUT - Pamukkale University
  • İbrahim BERCİ - Pamukkale University
  • Emre ONUÇ - Pamukkale University

🏦 Funding & Acknowledgments

This work was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under project number 323K372. We thank TÜBİTAK for their support.

📚 BERTurk Reference

This model uses BERTurk developed by Stefan Schweter, a BERT model pre-trained on 35GB of Turkish text, optimized for Turkish natural language processing tasks.

📄 License and Usage Terms

This model is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

✅ Permitted Uses:

  • Academic research (citation required)
  • Educational purposes
  • Non-profit projects
  • Personal experimental studies

❌ Prohibited Uses:

  • Commercial applications
  • Profit-driven projects
  • Commercial product/service development

📄 Citation Requirement

When using this model, please cite as:

@misc{ilter2025memoirner,
  author = {İlter, Mustafa and Onuç, Emre and Evecen, Doğan and Erşahin, Buket and Özcan Gönülal, Yasemin and Karabulut, Sezen and Berci, İbrahim and Tekir, Selma},
  title = {MemoirNER-BERTurk: Named Entity Recognition for Ottoman Turkish Memoirs},
  howpublished = {Deep Learning Model},
  doi = {10.57967/hf/6141},
  publisher = {Hugging Face},
  url = {https://huggingface.co/dbbiyte/MemoirNER-BERTurk},
  year = {2025},
}

Model Version: 1.0
Last Updated: August 2025

Downloads last month
140
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using dbbiyte/MemoirNER-BERTurk 1

Collection including dbbiyte/MemoirNER-BERTurk