--- license: cc-by-nc-4.0 --- --- --- # CISA-BERTurk-Sentiment: Cross-Individual Sentiment Analysis for Historical Turkish This model performs **Cross-Individual Sentiment Analysis (CISA)** on historical Turkish texts (1900-1950), analyzing the **author's sentiment toward specific individuals** mentioned in the text, rather than the overall text sentiment. ## 🎯 Model Details - **Model Name**: CISA-BERTurk-Sentiment - **Base Model**: [BERTurk](https://huggingface.co/dbmdz/bert-base-turkish-cased) (dbmdz/bert-base-turkish-cased) - **Architecture**: DECA-EBSA (Dual-Encoder Context-Aware Entity-Based Sentiment Analysis) - **Language**: Turkish - **Period**: Historical Turkish texts (1900-1950) - **Task**: Cross-Individual Sentiment Analysis - **Classes**: - 0: Negative - 1: Neutral - 2: Positive ## 🆚 CISA vs Standard Sentiment Analysis ### Example Comparison: **Text**: *"Ali Bey'in vefatı bizleri elem-i azîme sevk etmişti"* (Ali Bey's death filled us all with sadness) | Analysis Type | Result | Explanation | |--------------|--------|-------------| | **Standard SA** | ❌ Negative | Overall text tone is sad | | **CISA** | ✅ Positive | Author's respect/love for Ali Bey | ### CISA Advantages: - ✅ **Person-focused** sentiment detection - ✅ **Author perspective** analysis - ✅ **Entity-based** precision - ✅ **Context-aware** evaluation ## 📊 Performance Metrics | Metric | Value | |--------|-------| | **Accuracy** | **87.08%** | | **Precision** | **87.07%** | | **Recall** | **87.08%** | | **F1-Score** | **87.05%** | ## 📈 Dataset Information - **Total Texts**: 7,816 - **Total Entities**: 9,249 - **Average Entities per Text**: 1.18 - **Sentiment Distribution**: - Negative: 2,357 (25.5%) - Neutral: 3,563 (38.5%) - Positive: 3,329 (36.0%) ## 🚀 Usage **Note**: This model uses a complex DECA-EBSA architecture with enhanced attention mechanisms, Turkish linguistic features, and contextual encoding. The full implementation requires the complete model architecture from the training code. ### Model Loading ```python from transformers import AutoTokenizer from huggingface_hub import hf_hub_download # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("dbbiyte/CISA-BERTurk-sentiment") # Download model weights weights_path = hf_hub_download("dbbiyte/CISA-BERTurk-sentiment", "pytorch_model.bin") print("Model weights downloaded successfully!") print("For full CISA analysis, use the complete PositionAwareDualEncoderEBSA architecture from the training code.") ``` ### Expected CISA Results For the examples in our test set: | Text | Entity | Standard SA | CISA Result | |------|--------|-------------|-------------| | "Ali Bey'in vefatı hepimizi hüzne boğmuştu" | Ali Bey | Negative | **Positive** | | "Leyla Hanım'ın musiki resitalinde, nağmelerinin ruhuma işledi" | Leyla Hanım | Positive | **Positive** | **CISA Key Insight**: The model analyzes the author's sentiment toward the mentioned person, not the overall text sentiment. ## 🏗️ DECA-EBSA Architecture ### Dual-Encoder Structure: 1. **Text Encoder**: Full text context processing 2. **Entity Encoder**: Entity + local context processing ### Key Features: - **Enhanced Entity-Context Attention**: 12-head cross-attention - **Position-Aware Modeling**: Entity position information - **Turkish Linguistic Features**: Ottoman Turkish specific patterns - **Context-Aware Classification**: Formal/informal distinction - **Adaptive Focal Loss**: Focus on difficult examples - **R-Drop Regularization**: Consistency enforcement ## 🔬 Research Contributions ### 1. Cross-Individual Sentiment Analysis (CISA) - **First application** of CISA to historical Turkish - **Author perspective** focused sentiment analysis - **Entity-based approach** for person-specific emotions ### 2. DECA-EBSA Methodology - **Dual-Encoder** architecture - **Context-Aware** modeling - **Entity-Based** attention mechanisms ### 3. Historical Turkish NLP Contributions - **1900-1950 period** specialized dataset - **Ottoman Turkish** linguistic features - **Formal/informal** context distinction ## 👥 Authors **İzmir Institute of Technology - Digital Humanities and AI Laboratory**: - **Dr. Mustafa İLTER** - İzmir Institute of Technology - **Dr. Doğan EVECEN** - İzmir Institute of Technology - **Dr. Buket ERŞAHİN** - İzmir Institute of Technology - **Dr. Yasemin ÖZCAN GÖNÜLAL** - İzmir Institute of Technology - **Assoc. Prof.. Selma TEKİR** - İzmir Institute of Technology **Pamukkale University**: - **Assoc. Prof. Sezen KARABULUT** - Pamukkale University - **İbrahim BERCİ** - Pamukkale University - **Emre ONUÇ** - Pamukkale University ## 🏦 Funding & Acknowledgments This work was supported by **The Scientific and Technological Research Council of Turkey (TÜBİTAK)** under project number **323K372**. We thank TÜBİTAK for their support. ## 📚 BERTurk Reference This model uses [BERTurk](https://github.com/stefan-it/turkish-bert) developed by Stefan Schweter, a BERT model pre-trained on 35GB of Turkish text, optimized for Turkish natural language processing tasks. ## 📄 License and Usage Terms This model is released under **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license. ### ✅ Permitted Uses: - **Academic research** (citation required) - **Educational purposes** - **Non-profit projects** - **Personal experimental studies** ### ❌ Prohibited Uses: - **Commercial applications** - **Profit-driven projects** - **Commercial product/service development** ### 📄 Citation Requirement: When using this model, please cite as: ```bibtex @misc{ilter2025cisa, author = {İlter, Mustafa and Evecen, Doğan and Erşahin, Buket and Özcan Gönülal, Yasemin and Karabulut, Sezen and Berci, İbrahim and Onuç, Emre and Tekir, Selma}, title = {CISA-BERTurk-Sentiment: Cross-Individual Sentiment Analysis for Historical Turkish}, howpublished = {Deep Learning Model}, publisher = {Hugging Face}, url = {https://huggingface.co/dbbiyte/CISA-BERTurk-sentiment}, doi = {10.57967/hf/6142}, year = {2025}, } ``` ## 🚨 Limitations - Model is optimized specifically for **1900-1950 period Turkish texts** - Performance may vary on **modern Turkish texts** - **Historical spelling conventions** and **archaic vocabulary** should be considered - Maximum sequence length is **256 tokens** ## 🏷️ Model Tags `turkish` `sentiment-analysis` `historical-texts` `entity-based` `cross-individual` `berturk` `bert` `1900-1950` `pytorch` `safetensors`