|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- TTA-DQA/hate_sentence |
|
|
language: |
|
|
- ko |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
base_model: |
|
|
- beomi/KcELECTRA-base-v2022 |
|
|
tags: |
|
|
- Text-Classification |
|
|
- Multi-Label-Classification |
|
|
--- |
|
|
### π λͺ¨λΈ μμΈ μ 보 |
|
|
## 1. π§Ύ κ°μ |
|
|
|
|
|
μ΄ λͺ¨λΈμ **νκ΅μ΄ λ¬Έμ₯ λ΄ μ ν΄ ννμ μ 무 λ° μ ν΄ ννμ μ ν(μΉ΄ν
κ³ λ¦¬)λ₯Ό λΆλ₯**νκΈ° μν΄ νμ΅λ λͺ¨λΈμ
λλ€. |
|
|
`mult-label classification`μ μννλ©°, μ ν΄ννμ΄ ν¬ν¨λλμ§, μ ν΄ννμ΄λΌλ©΄ κ·Έ μ νμ **νλ¨(λΆλ₯)** ν©λλ€. |
|
|
AI-Taskλ‘λ `text-classification`μ ν΄λΉν©λλ€. |
|
|
μ¬μ©νλ λ°μ΄ν°μ
μ [`TTA-DQA/hate_sentence`](https://huggingface.co/datasets/TTA-DQA/hate_sentence)μ
λλ€. |
|
|
|
|
|
- **ν΄λμ€ κ΅¬μ±**: |
|
|
- `"0"`: `insult` |
|
|
- `"1"`: `abuse` |
|
|
- `"2"`: `obscenity` |
|
|
- `"3"`: `TVPC(Threats of violence/promotion of crime)` |
|
|
- `"4"`: `sexuality` |
|
|
- `"5"`: `age` |
|
|
- `"6"`: `race and region` |
|
|
- `"7"`: `disabled` |
|
|
- `"8"`: `religion` |
|
|
- `"9"`: `politics` |
|
|
- `"10"`: `job` |
|
|
- `"11"`: `no_hate` |
|
|
--- |
|
|
## 2. π§ νμ΅ μ 보 |
|
|
|
|
|
- **Base Model**: KcElectra (a pre-trained Korean language model based on Electra) |
|
|
- **Source**: [beomi/KcELECTRA](https://huggingface.co/beomi/KcELECTRA-base-v2022) |
|
|
- **Model Type**: Casual Language Model |
|
|
- **Pre-training (Korean)**: μ½ 17GB (over 180 million sentences) |
|
|
- **Fine-tuning (Hate Dataset)**: μ½ 22.3MB (`TTA-DQA/hate_sentence`) |
|
|
- **Learning Rate**: `5e-6` |
|
|
- **Weight Decay**: `0.01` |
|
|
- **Epochs**: `30` |
|
|
- **Batch Size**: `16` |
|
|
- **Data Loader Workers**: `2` |
|
|
- **Tokenizer**: `BertWordPieceTokenizer` |
|
|
- **Model Size**: μ½ `511MB` |
|
|
|
|
|
--- |
|
|
|
|
|
## 3. π§© μꡬμ¬ν |
|
|
|
|
|
- `pytorch ~= 1.8.0` |
|
|
- `transformers ~= 4.0.0` |
|
|
- `emoji ~= 0.6.0` |
|
|
- `soynlp ~= 0.0.493` |
|
|
|
|
|
--- |
|
|
|
|
|
## 4. π Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
|
|
model_name = "TTA-DQA/HateDetection_MultiLabel_KcElectra_FineTuning" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
|
|
sentences = ["μ€λ μ μ¬ λ λ¨ΉμκΉ?", "μ΄ λμ λμ."] |
|
|
results = classifier(sentences)' |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 5.π Citation |
|
|
μ΄ λͺ¨λΈμ μ΄κ±°λAI νμ΅μ© λ°μ΄ν° νμ§κ²μ¦ μ¬μ
(2024λ
λ μ΄κ±°λAI νμ΅μ© νμ§κ²μ¦)μ μν΄μ ꡬμΆλμμ΅λλ€. |
|
|
|
|
|
--- |
|
|
|
|
|
## 6. β οΈ Bias, Risks, and Limitations |
|
|
|
|
|
λ³Έ λͺ¨λΈμ κ° ν΄λμ€μ λ°μ΄ν°λ₯Ό νΈν₯λκ² νμ΅νμ§λ μμμΌλ, |
|
|
μΈμ΄μ Β·λ¬Ένμ νΉμ±μ μν΄ λ μ΄λΈμ λν μ΄κ²¬μ΄ μμ μ μμ΅λλ€. |
|
|
μ ν΄ ννμ μΈμ΄, λ¬Έν, μ μ© λΆμΌ, κ°μΈμ 견ν΄μ λ°λΌ μ£Όκ΄μ μΈ λΆλΆμ΄ μ‘΄μ¬νμ¬, |
|
|
κ²°κ³Όμ λν νΈν₯ λλ λ
Όλμ΄ λ°μν μ μμ΅λλ€. |
|
|
|
|
|
> β λ³Έ λͺ¨λΈμ κ²°κ³Όλ μ λμ μΈ μ ν΄ νν κΈ°μ€μ΄ μλμ μ μν΄ μ£ΌμΈμ. |
|
|
|
|
|
--- |
|
|
|
|
|
# π Results |
|
|
- Task: binary classification (text-classification) |
|
|
- F1-score: 0.8279 |
|
|
- Accuracy: 0.7013 |