File size: 3,077 Bytes
957ef54 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ---
license: mit
datasets:
- TTA-DQA/hate_sentence
language:
- ko
metrics:
- accuracy
- f1
base_model:
- beomi/KcELECTRA-base-v2022
tags:
- Text-Classification
- Multi-Label-Classification
---
### π λͺ¨λΈ μμΈ μ 보
## 1. π§Ύ κ°μ
μ΄ λͺ¨λΈμ **νκ΅μ΄ λ¬Έμ₯ λ΄ μ ν΄ ννμ μ 무 λ° μ ν΄ ννμ μ ν(μΉ΄ν
κ³ λ¦¬)λ₯Ό λΆλ₯**νκΈ° μν΄ νμ΅λ λͺ¨λΈμ
λλ€.
`mult-label classification`μ μννλ©°, μ ν΄ννμ΄ ν¬ν¨λλμ§, μ ν΄ννμ΄λΌλ©΄ κ·Έ μ νμ **νλ¨(λΆλ₯)** ν©λλ€.
AI-Taskλ‘λ `text-classification`μ ν΄λΉν©λλ€.
μ¬μ©νλ λ°μ΄ν°μ
μ [`TTA-DQA/hate_sentence`](https://huggingface.co/datasets/TTA-DQA/hate_sentence)μ
λλ€.
- **ν΄λμ€ κ΅¬μ±**:
- `"0"`: `insult`
- `"1"`: `abuse`
- `"2"`: `obscenity`
- `"3"`: `TVPC(Threats of violence/promotion of crime)`
- `"4"`: `sexuality`
- `"5"`: `age`
- `"6"`: `race and region`
- `"7"`: `disabled`
- `"8"`: `religion`
- `"9"`: `politics`
- `"10"`: `job`
- `"11"`: `no_hate`
---
## 2. π§ νμ΅ μ 보
- **Base Model**: KcElectra (a pre-trained Korean language model based on Electra)
- **Source**: [beomi/KcELECTRA](https://huggingface.co/beomi/KcELECTRA-base-v2022)
- **Model Type**: Casual Language Model
- **Pre-training (Korean)**: μ½ 17GB (over 180 million sentences)
- **Fine-tuning (Hate Dataset)**: μ½ 22.3MB (`TTA-DQA/hate_sentence`)
- **Learning Rate**: `5e-6`
- **Weight Decay**: `0.01`
- **Epochs**: `30`
- **Batch Size**: `16`
- **Data Loader Workers**: `2`
- **Tokenizer**: `BertWordPieceTokenizer`
- **Model Size**: μ½ `511MB`
---
## 3. π§© μꡬμ¬ν
- `pytorch ~= 1.8.0`
- `transformers ~= 4.0.0`
- `emoji ~= 0.6.0`
- `soynlp ~= 0.0.493`
---
## 4. π Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "TTA-DQA/HateDetection_MultiLabel_KcElectra_FineTuning"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
sentences = ["μ€λ μ μ¬ λ λ¨ΉμκΉ?", "μ΄ λμ λμ."]
results = classifier(sentences)'
```
---
## 5.π Citation
μ΄ λͺ¨λΈμ μ΄κ±°λAI νμ΅μ© λ°μ΄ν° νμ§κ²μ¦ μ¬μ
(2024λ
λ μ΄κ±°λAI νμ΅μ© νμ§κ²μ¦)μ μν΄μ ꡬμΆλμμ΅λλ€.
---
## 6. β οΈ Bias, Risks, and Limitations
λ³Έ λͺ¨λΈμ κ° ν΄λμ€μ λ°μ΄ν°λ₯Ό νΈν₯λκ² νμ΅νμ§λ μμμΌλ,
μΈμ΄μ Β·λ¬Ένμ νΉμ±μ μν΄ λ μ΄λΈμ λν μ΄κ²¬μ΄ μμ μ μμ΅λλ€.
μ ν΄ ννμ μΈμ΄, λ¬Έν, μ μ© λΆμΌ, κ°μΈμ 견ν΄μ λ°λΌ μ£Όκ΄μ μΈ λΆλΆμ΄ μ‘΄μ¬νμ¬,
κ²°κ³Όμ λν νΈν₯ λλ λ
Όλμ΄ λ°μν μ μμ΅λλ€.
> β λ³Έ λͺ¨λΈμ κ²°κ³Όλ μ λμ μΈ μ ν΄ νν κΈ°μ€μ΄ μλμ μ μν΄ μ£ΌμΈμ.
---
# π Results
- Task: binary classification (text-classification)
- F1-score: 0.8279
- Accuracy: 0.7013 |