emotion-tr / README.md

Upload model via Fine-tune Assistant

2c6d2a9 verified about 2 months ago

11 kB

	---
	language: tr
	license: other
	license_name: siriusai-premium-v1
	license_link: LICENSE
	tags:
	- turkish
	- text-classification
	- bert
	- nlp
	- transformers
	- siriusai
	- production-ready
	- enterprise
	base_model: dbmdz/bert-base-turkish-uncased
	datasets:
	- custom
	metrics:
	- f1
	- precision
	- recall
	- accuracy
	- mcc
	library_name: transformers
	pipeline_tag: text-classification
	model-index:
	- name: emotion-tr
	results:
	- task:
	type: text-classification
	name: Text Classification
	metrics:
	- type: f1
	value: 0.9744976471619214
	name: Macro F1
	- type: mcc
	value: 0.9610214790438847
	---

	# emotion-tr - Turkish Emotion Classification Model

	<p align="center">
	<a href="https://huggingface.co/hayatiali/emotion-tr"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-emotion--tr-yellow" alt="Hugging Face"></a>
	<a href="https://huggingface.co/hayatiali/emotion-tr"><img src="https://img.shields.io/badge/Model-Production%20Ready-brightgreen" alt="Production Ready"></a>
	<img src="https://img.shields.io/badge/Language-Turkish-blue" alt="Turkish">
	<img src="https://img.shields.io/badge/Task-Text%20Classification-orange" alt="Text Classification">
	</p>

	This model is designed for the classification of emotional sentiments in Turkish text.

	Developed by SiriusAI Tech Brain Team

	---

	## Mission

	> To provide advanced sentiment analysis capabilities for Turkish text, empowering businesses and researchers to understand emotional tones effectively.

	The `emotion-tr` model leverages the BERT architecture to deliver high-performance text classification, specifically tailored for the Turkish language. By analyzing sentiments as negative, neutral, or positive, this model facilitates a deeper understanding of customer feedback, social media interactions, and other textual data, proving essential for sentiment-driven applications in various domains.

	### Why This Model Matters

	- High Accuracy: Achieves over 97% accuracy, making it reliable for various applications.
	- Robust Performance: Exhibits superior performance across all sentiment categories.
	- Enterprise-Ready: Designed to meet the demands of production environments with efficient response times.
	- Customizable: Can be fine-tuned for specific applications beyond emotion classification.
	- Comprehensive Documentation: Provides extensive guidance for integration and usage.

	---

	## Model Overview

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| BertForSequenceClassification \|
	\| Base Model \| `dbmdz/bert-base-turkish-uncased` \|
	\| Task \| Text Classification \|
	\| Language \| Turkish (tr) \|
	\| Categories \| 3 labels \|
	\| Model Size \| ~110M parameters \|
	\| Inference Time \| ~10-15ms (GPU) / ~40-50ms (CPU) \|

	---

	## Performance Metrics

	### Final Evaluation Results

	\| Metric \| Score \| Description \|
	\|--------\|-------\|-------------\|
	\| Macro F1 \| 0.9744976471619214 \| Harmonic mean of precision and recall \|
	\| MCC \| 0.9610214790438847 \| Matthews Correlation Coefficient \|
	\| Accuracy \| 97.5557461406518% \| Overall accuracy of the model \|

	### Per-Class Performance

	\| Category \| Accuracy \| Correct \| Total \|
	\|----------\|----------\|---------\|-------\|
	\| negatif \| 97.0% \| 700 \| 722 \|
	\| notr \| 98.0% \| 1,069 \| 1,091 \|
	\| pozitif \| 97.5% \| 506 \| 519 \|

	---

	## Dataset

	### Dataset Statistics

	\| Split \| Samples \| Purpose \|
	\|-------\|---------\|---------\|
	\| Train \| 9,322 \| Model training \|
	\| Test \| 2,332 \| Model evaluation \|
	\| Total \| 11,654 \| Complete dataset \|

	### Category Distribution

	\| Category \| Samples \| Percentage \| Description \|
	\|----------\|---------\|------------\|-------------\|
	\| sentiment_3class \| 11,654 \| 100.0% \| sentiment_3class category \|

	### Subcategory Breakdown

	\| Category \| Subcategories \|
	\|----------\|---------------\|
	\| sentiment_3class \| pozitif, negatif, notr \|

	---

	## Label Definitions

	\| Label \| ID \| Description \| Turkish Examples \|
	\|-------\|-----\|-------------\|------------------\|
	\| negatif \| 0 \| Indicates negative sentiment \| "Bu çok kötü bir film." "Hizmet berbattı." \|
	\| notr \| 1 \| Indicates neutral sentiment \| "Bugün hava güzel." "Toplantı yapıldı." \|
	\| pozitif \| 2 \| Indicates positive sentiment \| "Harika bir deneyim!" "Çok memnun kaldım." \|

	### Important: Category Boundaries

	When classifying sentiments, the distinction between notr and negatif can be subtle; for instance, "Bu film sıradan" might be interpreted as neutral, while "Bu film kötü" is clearly negative.

	---

	## Training Procedure

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| `dbmdz/bert-base-turkish-uncased` \|
	\| Max Sequence Length \| 128 tokens \|
	\| Batch Size \| 16 \|
	\| Learning Rate \| 2e-5 \|
	\| Epochs \| 3 \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|
	\| Loss Function \| CrossEntropyLoss / Focal Loss \|
	\| Problem Type \| Single-label Classification \|

	### Training Environment

	\| Resource \| Specification \|
	\|----------\|---------------\|
	\| Hardware \| Apple Silicon (MPS) / CUDA GPU \|
	\| Framework \| PyTorch + Transformers \|
	\| Training Time \| Varies based on dataset size \|

	---

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "hayatiali/emotion-tr"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	model.eval()

	LABELS = ["negatif", "notr", "pozitif"]

	def predict(text):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)[0]

	scores = {label: float(prob) for label, prob in zip(LABELS, probs)}
	primary = max(scores, key=scores.get)
	return {"category": primary, "confidence": scores[primary], "all_scores": scores}

	# Examples
	print(predict("Bu film harika!"))
	```

	### Production Class

	```python
	class EmotionClassifier:
	LABELS = ["negatif", "notr", "pozitif"]

	def __init__(self, model_path="hayatiali/emotion-tr"):
	self.tokenizer = AutoTokenizer.from_pretrained(model_path)
	self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
	self.device = "cuda" if torch.cuda.is_available() else "cpu"
	self.model.to(self.device).eval()

	def predict(self, text: str) -> dict:
	inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	inputs = {k: v.to(self.device) for k, v in inputs.items()}

	with torch.no_grad():
	logits = self.model(**inputs).logits
	probs = torch.softmax(logits, dim=-1)[0].cpu().numpy()

	scores = dict(zip(self.LABELS, probs))
	return {"category": max(scores, key=scores.get), "confidence": max(scores.values()), "scores": scores}
	```

	### Batch Inference

	```python
	def predict_batch(texts: list, batch_size: int = 32) -> list:
	results = []
	for i in range(0, len(texts), batch_size):
	batch = texts[i:i + batch_size]
	inputs = tokenizer(batch, return_tensors="pt", truncation=True, max_length=128, padding=True)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.no_grad():
	probs = torch.softmax(model(**inputs).logits, dim=-1).cpu().numpy()

	for prob in probs:
	scores = dict(zip(LABELS, prob))
	results.append(scores)
	return results
	```

	---

	## Limitations & Known Issues

	### ⚠️ Model Limitations

	\| Limitation \| Details \| Impact \|
	\|------------\|---------\|--------\|
	\| Context Sensitivity \| The model may misclassify sentiments in ambiguous contexts \| Potentially inaccurate predictions \|
	\| Domain Adaptability \| Performance may vary across different domains (e.g., social media vs. formal texts) \| Requires further fine-tuning for specific applications \|
	\| Language Nuances \| Subtle linguistic features unique to Turkish may not be perfectly captured \| May lead to classification errors in nuanced cases \|

	### ⚠️ Production Deployment Considerations

	\| Consideration \| Details \| Recommendation \|
	\|---------------\|---------\|----------------\|
	\| Model Size \| The model is approximately 110M parameters \| Ensure adequate resources for deployment \|
	\| Latency \| Inference time may vary with input length and server load \| Optimize batch sizes for improved performance \|

	### Not Suitable For

	- Legal document analysis
	- Medical diagnosis based on text
	- Any critical decision-making without human oversight

	---

	## Ethical Considerations

	### Intended Use

	- Sentiment analysis in customer feedback
	- Emotional tone detection in social media posts
	- Market research and analysis

	### Risks

	- Bias in Data: The model may reflect biases present in the training data, leading to skewed results.
	- Misinterpretation of Sentiments: Incorrect sentiment classification could misguide businesses in decision-making.

	### Recommendations

	1. Human Oversight: Always accompany model predictions with human judgment.
	2. Monitoring: Regularly assess model performance and retrain as necessary.
	3. Updates: Stay informed about updates to the model and fine-tune based on new data.

	---

	## Technical Specifications

	### Model Architecture

	```
	BertForSequenceClassification(
	(bert): BertModel(
	(embeddings): BertEmbeddings
	(encoder): BertEncoder (12 layers)
	(pooler): BertPooler
	)
	(dropout): Dropout(p=0.1)
	(classifier): Linear(in_features=768, out_features=3)
	)

	Total Parameters: ~110M
	```

	### Input/Output

	- Input: Turkish text (max 128 tokens)
	- Output: 3-dimensional probability vector
	- Tokenizer: BERTurk WordPiece (32k vocab)

	---

	## Citation

	```bibtex
	@misc{emotion-tr-2025,
	title={emotion-tr - Turkish Text Classification Model},
	author={SiriusAI Tech Brain Team},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/hayatiali/emotion-tr}},
	note={Fine-tuned from dbmdz/bert-base-turkish-uncased}
	}
	```

	---

	## Model Card Authors

	SiriusAI Tech Brain Team

	## Contact

	- Email: info@siriusaitech.com
	- Repository: [GitHub](https://github.com/sirius-tedarik)

	---

	## Changelog

	### v1.0 (Current)
	- Initial release
	- 3-category text classification
	- Macro F1: 0.9744976471619214, MCC: 0.9610214790438847

	---

	License: SiriusAI Tech Premium License v1.0

	Commercial Use: Requires Premium License. Contact: info@siriusaitech.com

	Free Use Allowed For:
	- Academic research and education
	- Non-profit organizations (with approval)
	- Evaluation (30 days)

	Disclaimer: This model is designed for text classification applications. Always implement with appropriate safeguards and human oversight. Model predictions should inform decisions, not replace human judgment.