File size: 3,077 Bytes
957ef54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: mit
datasets:
- TTA-DQA/hate_sentence
language:
- ko
metrics:
- accuracy
- f1
base_model:
- beomi/KcELECTRA-base-v2022
tags:
- Text-Classification
- Multi-Label-Classification
---
### πŸ“Œ λͺ¨λΈ 상세 정보
## 1. 🧾 κ°œμš”

이 λͺ¨λΈμ€ **ν•œκ΅­μ–΄ λ¬Έμž₯ λ‚΄ μœ ν•΄ ν‘œν˜„μ˜ 유무 및 μœ ν•΄ ν‘œν˜„μ˜ μœ ν˜•(μΉ΄ν…Œκ³ λ¦¬)λ₯Ό λΆ„λ₯˜**ν•˜κΈ° μœ„ν•΄ ν•™μŠ΅λœ λͺ¨λΈμž…λ‹ˆλ‹€.  
`mult-label classification`을 μˆ˜ν–‰ν•˜λ©°, μœ ν•΄ν‘œν˜„μ΄ ν¬ν•¨λ˜λŠ”μ§€, μœ ν•΄ν‘œν˜„μ΄λΌλ©΄ κ·Έ μœ ν˜•μ„ **νŒλ‹¨(λΆ„λ₯˜)** ν•©λ‹ˆλ‹€.  
AI-Taskλ‘œλŠ” `text-classification`에 ν•΄λ‹Ήν•©λ‹ˆλ‹€.  
μ‚¬μš©ν•˜λŠ” 데이터셋은 [`TTA-DQA/hate_sentence`](https://huggingface.co/datasets/TTA-DQA/hate_sentence)μž…λ‹ˆλ‹€.

- **클래슀 ꡬ성**:  
  - `"0"`: `insult`  
  - `"1"`: `abuse`
  - `"2"`: `obscenity`
  - `"3"`: `TVPC(Threats of violence/promotion of crime)`
  - `"4"`: `sexuality`
  - `"5"`: `age`  
  - `"6"`: `race and region`  
  - `"7"`: `disabled`  
  - `"8"`: `religion`  
  - `"9"`: `politics`  
  - `"10"`: `job`  
  - `"11"`: `no_hate`  
---
## 2. 🧠 ν•™μŠ΅ 정보

- **Base Model**: KcElectra (a pre-trained Korean language model based on Electra)
- **Source**: [beomi/KcELECTRA](https://huggingface.co/beomi/KcELECTRA-base-v2022)
- **Model Type**: Casual Language Model  
- **Pre-training (Korean)**: μ•½ 17GB (over 180 million sentences)
- **Fine-tuning (Hate Dataset)**: μ•½ 22.3MB (`TTA-DQA/hate_sentence`)  
- **Learning Rate**: `5e-6`  
- **Weight Decay**: `0.01`  
- **Epochs**: `30`  
- **Batch Size**: `16`  
- **Data Loader Workers**: `2`  
- **Tokenizer**: `BertWordPieceTokenizer`  
- **Model Size**: μ•½ `511MB`

---

## 3. 🧩 μš”κ΅¬μ‚¬ν•­

- `pytorch ~= 1.8.0`  
- `transformers ~= 4.0.0`
- `emoji ~= 0.6.0`
- `soynlp ~= 0.0.493`

---

## 4. πŸš€ Quick Start

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "TTA-DQA/HateDetection_MultiLabel_KcElectra_FineTuning"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

sentences = ["였늘 점심 뭐 λ¨Ήμ„κΉŒ?", "이 λ‚˜μœ λ†ˆμ•„."]
results = classifier(sentences)'
```

---

## 5.πŸ“š Citation
이 λͺ¨λΈμ€ μ΄ˆκ±°λŒ€AI ν•™μŠ΅μš© 데이터 ν’ˆμ§ˆκ²€μ¦ 사업(2024년도 μ΄ˆκ±°λŒ€AI ν•™μŠ΅μš© ν’ˆμ§ˆκ²€μ¦)에 μ˜ν•΄μ„œ κ΅¬μΆ•λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

---

## 6. ⚠️ Bias, Risks, and Limitations

λ³Έ λͺ¨λΈμ€ 각 클래슀의 데이터λ₯Ό 편ν–₯되게 ν•™μŠ΅ν•˜μ§€λŠ” μ•Šμ•˜μœΌλ‚˜,  
언어적·문화적 νŠΉμ„±μ— μ˜ν•΄ λ ˆμ΄λΈ”μ— λŒ€ν•œ 이견이 μžˆμ„ 수 μžˆμŠ΅λ‹ˆλ‹€.  
μœ ν•΄ ν‘œν˜„μ€ μ–Έμ–΄, λ¬Έν™”, 적용 λΆ„μ•Ό, 개인적 견해에 따라 주관적인 뢀뢄이 μ‘΄μž¬ν•˜μ—¬,  
결과에 λŒ€ν•œ 편ν–₯ λ˜λŠ” λ…Όλž€μ΄ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.  

> ❗ λ³Έ λͺ¨λΈμ˜ κ²°κ³ΌλŠ” μ ˆλŒ€μ μΈ μœ ν•΄ ν‘œν˜„ 기쀀이 μ•„λ‹˜μ„ μœ μ˜ν•΄ μ£Όμ„Έμš”.

---

# πŸ“ˆ Results
- Task: binary classification (text-classification)
- F1-score: 0.8279
- Accuracy: 0.7013