Update README.md

eebee54 verified 6 months ago

6.58 kB

	---
	license: cc-by-nc-4.0
	---
	---
	---

	# CISA-BERTurk-Sentiment: Cross-Individual Sentiment Analysis for Historical Turkish

	This model performs Cross-Individual Sentiment Analysis (CISA) on historical Turkish texts (1900-1950), analyzing the author's sentiment toward specific individuals mentioned in the text, rather than the overall text sentiment.

	## 🎯 Model Details

	- Model Name: CISA-BERTurk-Sentiment
	- Base Model: [BERTurk](https://huggingface.co/dbmdz/bert-base-turkish-cased) (dbmdz/bert-base-turkish-cased)
	- Architecture: DECA-EBSA (Dual-Encoder Context-Aware Entity-Based Sentiment Analysis)
	- Language: Turkish
	- Period: Historical Turkish texts (1900-1950)
	- Task: Cross-Individual Sentiment Analysis
	- Classes:
	- 0: Negative
	- 1: Neutral
	- 2: Positive

	## 🆚 CISA vs Standard Sentiment Analysis

	### Example Comparison:
	Text: "Ali Bey'in vefatı bizleri elem-i azîme sevk etmişti" (Ali Bey's death filled us all with sadness)

	\| Analysis Type \| Result \| Explanation \|
	\|--------------\|--------\|-------------\|
	\| Standard SA \| ❌ Negative \| Overall text tone is sad \|
	\| CISA \| ✅ Positive \| Author's respect/love for Ali Bey \|

	### CISA Advantages:
	- ✅ Person-focused sentiment detection
	- ✅ Author perspective analysis
	- ✅ Entity-based precision
	- ✅ Context-aware evaluation

	## 📊 Performance Metrics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 87.08% \|
	\| Precision \| 87.07% \|
	\| Recall \| 87.08% \|
	\| F1-Score \| 87.05% \|

	## 📈 Dataset Information

	- Total Texts: 7,816
	- Total Entities: 9,249
	- Average Entities per Text: 1.18
	- Sentiment Distribution:
	- Negative: 2,357 (25.5%)
	- Neutral: 3,563 (38.5%)
	- Positive: 3,329 (36.0%)

	## 🚀 Usage

	Note: This model uses a complex DECA-EBSA architecture with enhanced attention mechanisms, Turkish linguistic features, and contextual encoding. The full implementation requires the complete model architecture from the training code.

	### Model Loading
	```python
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("dbbiyte/CISA-BERTurk-sentiment")

	# Download model weights
	weights_path = hf_hub_download("dbbiyte/CISA-BERTurk-sentiment", "pytorch_model.bin")

	print("Model weights downloaded successfully!")
	print("For full CISA analysis, use the complete PositionAwareDualEncoderEBSA architecture from the training code.")
	```

	### Expected CISA Results
	For the examples in our test set:

	\| Text \| Entity \| Standard SA \| CISA Result \|
	\|------\|--------\|-------------\|-------------\|
	\| "Ali Bey'in vefatı hepimizi hüzne boğmuştu" \| Ali Bey \| Negative \| Positive \|
	\| "Leyla Hanım'ın musiki resitalinde, nağmelerinin ruhuma işledi" \| Leyla Hanım \| Positive \| Positive \|

	CISA Key Insight: The model analyzes the author's sentiment toward the mentioned person, not the overall text sentiment.

	## 🏗️ DECA-EBSA Architecture

	### Dual-Encoder Structure:
	1. Text Encoder: Full text context processing
	2. Entity Encoder: Entity + local context processing

	### Key Features:
	- Enhanced Entity-Context Attention: 12-head cross-attention
	- Position-Aware Modeling: Entity position information
	- Turkish Linguistic Features: Ottoman Turkish specific patterns
	- Context-Aware Classification: Formal/informal distinction
	- Adaptive Focal Loss: Focus on difficult examples
	- R-Drop Regularization: Consistency enforcement

	## 🔬 Research Contributions

	### 1. Cross-Individual Sentiment Analysis (CISA)
	- First application of CISA to historical Turkish
	- Author perspective focused sentiment analysis
	- Entity-based approach for person-specific emotions

	### 2. DECA-EBSA Methodology
	- Dual-Encoder architecture
	- Context-Aware modeling
	- Entity-Based attention mechanisms

	### 3. Historical Turkish NLP Contributions
	- 1900-1950 period specialized dataset
	- Ottoman Turkish linguistic features
	- Formal/informal context distinction

	## 👥 Authors

	İzmir Institute of Technology - Digital Humanities and AI Laboratory:
	- Dr. Mustafa İLTER - İzmir Institute of Technology
	- Dr. Doğan EVECEN - İzmir Institute of Technology
	- Dr. Buket ERŞAHİN - İzmir Institute of Technology
	- Dr. Yasemin ÖZCAN GÖNÜLAL - İzmir Institute of Technology
	- Assoc. Prof.. Selma TEKİR - İzmir Institute of Technology

	Pamukkale University:
	- Assoc. Prof. Sezen KARABULUT - Pamukkale University
	- İbrahim BERCİ - Pamukkale University
	- Emre ONUÇ - Pamukkale University

	## 🏦 Funding & Acknowledgments

	This work was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under project number 323K372. We thank TÜBİTAK for their support.

	## 📚 BERTurk Reference

	This model uses [BERTurk](https://github.com/stefan-it/turkish-bert) developed by Stefan Schweter, a BERT model pre-trained on 35GB of Turkish text, optimized for Turkish natural language processing tasks.

	## 📄 License and Usage Terms

	This model is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

	### ✅ Permitted Uses:
	- Academic research (citation required)
	- Educational purposes
	- Non-profit projects
	- Personal experimental studies

	### ❌ Prohibited Uses:
	- Commercial applications
	- Profit-driven projects
	- Commercial product/service development

	### 📄 Citation Requirement:
	When using this model, please cite as:

	```bibtex
	@misc{ilter2025cisa,
	author = {İlter, Mustafa and Evecen, Doğan and Erşahin, Buket and Özcan Gönülal, Yasemin and Karabulut, Sezen and Berci, İbrahim and Onuç, Emre and Tekir, Selma},
	title = {CISA-BERTurk-Sentiment: Cross-Individual Sentiment Analysis for Historical Turkish},
	howpublished = {Deep Learning Model},
	publisher = {Hugging Face},
	url = {https://huggingface.co/dbbiyte/CISA-BERTurk-sentiment},
	doi = {10.57967/hf/6142},
	year = {2025},
	}
	```

	## 🚨 Limitations

	- Model is optimized specifically for 1900-1950 period Turkish texts
	- Performance may vary on modern Turkish texts
	- Historical spelling conventions and archaic vocabulary should be considered
	- Maximum sequence length is 256 tokens

	## 🏷️ Model Tags

	`turkish` `sentiment-analysis` `historical-texts` `entity-based` `cross-individual` `berturk` `bert` `1900-1950` `pytorch` `safetensors`

	---
	license: cc-by-nc-4.0
	---
	---
	---

	# CISA-BERTurk-Sentiment: Cross-Individual Sentiment Analysis for Historical Turkish

	This model performs Cross-Individual Sentiment Analysis (CISA) on historical Turkish texts (1900-1950), analyzing the author's sentiment toward specific individuals mentioned in the text, rather than the overall text sentiment.

	## 🎯 Model Details

	- Model Name: CISA-BERTurk-Sentiment
	- Base Model: [BERTurk](https://huggingface.co/dbmdz/bert-base-turkish-cased) (dbmdz/bert-base-turkish-cased)
	- Architecture: DECA-EBSA (Dual-Encoder Context-Aware Entity-Based Sentiment Analysis)
	- Language: Turkish
	- Period: Historical Turkish texts (1900-1950)
	- Task: Cross-Individual Sentiment Analysis
	- Classes:
	- 0: Negative
	- 1: Neutral
	- 2: Positive

	## 🆚 CISA vs Standard Sentiment Analysis

	### Example Comparison:
	Text: "Ali Bey'in vefatı bizleri elem-i azîme sevk etmişti" (Ali Bey's death filled us all with sadness)

	\| Analysis Type \| Result \| Explanation \|
	\|--------------\|--------\|-------------\|
	\| Standard SA \| ❌ Negative \| Overall text tone is sad \|
	\| CISA \| ✅ Positive \| Author's respect/love for Ali Bey \|

	### CISA Advantages:
	- ✅ Person-focused sentiment detection
	- ✅ Author perspective analysis
	- ✅ Entity-based precision
	- ✅ Context-aware evaluation

	## 📊 Performance Metrics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 87.08% \|
	\| Precision \| 87.07% \|
	\| Recall \| 87.08% \|
	\| F1-Score \| 87.05% \|

	## 📈 Dataset Information

	- Total Texts: 7,816
	- Total Entities: 9,249
	- Average Entities per Text: 1.18
	- Sentiment Distribution:
	- Negative: 2,357 (25.5%)
	- Neutral: 3,563 (38.5%)
	- Positive: 3,329 (36.0%)

	## 🚀 Usage

	Note: This model uses a complex DECA-EBSA architecture with enhanced attention mechanisms, Turkish linguistic features, and contextual encoding. The full implementation requires the complete model architecture from the training code.

	### Model Loading
	```python
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("dbbiyte/CISA-BERTurk-sentiment")

	# Download model weights
	weights_path = hf_hub_download("dbbiyte/CISA-BERTurk-sentiment", "pytorch_model.bin")

	print("Model weights downloaded successfully!")
	print("For full CISA analysis, use the complete PositionAwareDualEncoderEBSA architecture from the training code.")
	```

	### Expected CISA Results
	For the examples in our test set:

	\| Text \| Entity \| Standard SA \| CISA Result \|
	\|------\|--------\|-------------\|-------------\|
	\| "Ali Bey'in vefatı hepimizi hüzne boğmuştu" \| Ali Bey \| Negative \| Positive \|
	\| "Leyla Hanım'ın musiki resitalinde, nağmelerinin ruhuma işledi" \| Leyla Hanım \| Positive \| Positive \|

	CISA Key Insight: The model analyzes the author's sentiment toward the mentioned person, not the overall text sentiment.

	## 🏗️ DECA-EBSA Architecture

	### Dual-Encoder Structure:
	1. Text Encoder: Full text context processing
	2. Entity Encoder: Entity + local context processing

	### Key Features:
	- Enhanced Entity-Context Attention: 12-head cross-attention
	- Position-Aware Modeling: Entity position information
	- Turkish Linguistic Features: Ottoman Turkish specific patterns
	- Context-Aware Classification: Formal/informal distinction
	- Adaptive Focal Loss: Focus on difficult examples
	- R-Drop Regularization: Consistency enforcement

	## 🔬 Research Contributions

	### 1. Cross-Individual Sentiment Analysis (CISA)
	- First application of CISA to historical Turkish
	- Author perspective focused sentiment analysis
	- Entity-based approach for person-specific emotions

	### 2. DECA-EBSA Methodology
	- Dual-Encoder architecture
	- Context-Aware modeling
	- Entity-Based attention mechanisms

	### 3. Historical Turkish NLP Contributions
	- 1900-1950 period specialized dataset
	- Ottoman Turkish linguistic features
	- Formal/informal context distinction

	## 👥 Authors

	İzmir Institute of Technology - Digital Humanities and AI Laboratory:
	- Dr. Mustafa İLTER - İzmir Institute of Technology
	- Dr. Doğan EVECEN - İzmir Institute of Technology
	- Dr. Buket ERŞAHİN - İzmir Institute of Technology
	- Dr. Yasemin ÖZCAN GÖNÜLAL - İzmir Institute of Technology
	- Assoc. Prof.. Selma TEKİR - İzmir Institute of Technology

	Pamukkale University:
	- Assoc. Prof. Sezen KARABULUT - Pamukkale University
	- İbrahim BERCİ - Pamukkale University
	- Emre ONUÇ - Pamukkale University

	## 🏦 Funding & Acknowledgments

	This work was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under project number 323K372. We thank TÜBİTAK for their support.

	## 📚 BERTurk Reference

	This model uses [BERTurk](https://github.com/stefan-it/turkish-bert) developed by Stefan Schweter, a BERT model pre-trained on 35GB of Turkish text, optimized for Turkish natural language processing tasks.

	## 📄 License and Usage Terms

	This model is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

	### ✅ Permitted Uses:
	- Academic research (citation required)
	- Educational purposes
	- Non-profit projects
	- Personal experimental studies

	### ❌ Prohibited Uses:
	- Commercial applications
	- Profit-driven projects
	- Commercial product/service development

	### 📄 Citation Requirement:
	When using this model, please cite as:

	```bibtex
	@misc{ilter2025cisa,
	author = {İlter, Mustafa and Evecen, Doğan and Erşahin, Buket and Özcan Gönülal, Yasemin and Karabulut, Sezen and Berci, İbrahim and Onuç, Emre and Tekir, Selma},
	title = {CISA-BERTurk-Sentiment: Cross-Individual Sentiment Analysis for Historical Turkish},
	howpublished = {Deep Learning Model},
	publisher = {Hugging Face},
	url = {https://huggingface.co/dbbiyte/CISA-BERTurk-sentiment},
	doi = {10.57967/hf/6142},
	year = {2025},
	}
	```

	## 🚨 Limitations

	- Model is optimized specifically for 1900-1950 period Turkish texts
	- Performance may vary on modern Turkish texts
	- Historical spelling conventions and archaic vocabulary should be considered
	- Maximum sequence length is 256 tokens

	## 🏷️ Model Tags

	`turkish` `sentiment-analysis` `historical-texts` `entity-based` `cross-individual` `berturk` `bert` `1900-1950` `pytorch` `safetensors`