TTA-DQA's picture
Update README.md
957ef54 verified
---
license: mit
datasets:
- TTA-DQA/hate_sentence
language:
- ko
metrics:
- accuracy
- f1
base_model:
- beomi/KcELECTRA-base-v2022
tags:
- Text-Classification
- Multi-Label-Classification
---
### πŸ“Œ λͺ¨λΈ 상세 정보
## 1. 🧾 κ°œμš”
이 λͺ¨λΈμ€ **ν•œκ΅­μ–΄ λ¬Έμž₯ λ‚΄ μœ ν•΄ ν‘œν˜„μ˜ 유무 및 μœ ν•΄ ν‘œν˜„μ˜ μœ ν˜•(μΉ΄ν…Œκ³ λ¦¬)λ₯Ό λΆ„λ₯˜**ν•˜κΈ° μœ„ν•΄ ν•™μŠ΅λœ λͺ¨λΈμž…λ‹ˆλ‹€.
`mult-label classification`을 μˆ˜ν–‰ν•˜λ©°, μœ ν•΄ν‘œν˜„μ΄ ν¬ν•¨λ˜λŠ”μ§€, μœ ν•΄ν‘œν˜„μ΄λΌλ©΄ κ·Έ μœ ν˜•μ„ **νŒλ‹¨(λΆ„λ₯˜)** ν•©λ‹ˆλ‹€.
AI-Taskλ‘œλŠ” `text-classification`에 ν•΄λ‹Ήν•©λ‹ˆλ‹€.
μ‚¬μš©ν•˜λŠ” 데이터셋은 [`TTA-DQA/hate_sentence`](https://huggingface.co/datasets/TTA-DQA/hate_sentence)μž…λ‹ˆλ‹€.
- **클래슀 ꡬ성**:
- `"0"`: `insult`
- `"1"`: `abuse`
- `"2"`: `obscenity`
- `"3"`: `TVPC(Threats of violence/promotion of crime)`
- `"4"`: `sexuality`
- `"5"`: `age`
- `"6"`: `race and region`
- `"7"`: `disabled`
- `"8"`: `religion`
- `"9"`: `politics`
- `"10"`: `job`
- `"11"`: `no_hate`
---
## 2. 🧠 ν•™μŠ΅ 정보
- **Base Model**: KcElectra (a pre-trained Korean language model based on Electra)
- **Source**: [beomi/KcELECTRA](https://huggingface.co/beomi/KcELECTRA-base-v2022)
- **Model Type**: Casual Language Model
- **Pre-training (Korean)**: μ•½ 17GB (over 180 million sentences)
- **Fine-tuning (Hate Dataset)**: μ•½ 22.3MB (`TTA-DQA/hate_sentence`)
- **Learning Rate**: `5e-6`
- **Weight Decay**: `0.01`
- **Epochs**: `30`
- **Batch Size**: `16`
- **Data Loader Workers**: `2`
- **Tokenizer**: `BertWordPieceTokenizer`
- **Model Size**: μ•½ `511MB`
---
## 3. 🧩 μš”κ΅¬μ‚¬ν•­
- `pytorch ~= 1.8.0`
- `transformers ~= 4.0.0`
- `emoji ~= 0.6.0`
- `soynlp ~= 0.0.493`
---
## 4. πŸš€ Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "TTA-DQA/HateDetection_MultiLabel_KcElectra_FineTuning"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
sentences = ["였늘 점심 뭐 λ¨Ήμ„κΉŒ?", "이 λ‚˜μœ λ†ˆμ•„."]
results = classifier(sentences)'
```
---
## 5.πŸ“š Citation
이 λͺ¨λΈμ€ μ΄ˆκ±°λŒ€AI ν•™μŠ΅μš© 데이터 ν’ˆμ§ˆκ²€μ¦ 사업(2024년도 μ΄ˆκ±°λŒ€AI ν•™μŠ΅μš© ν’ˆμ§ˆκ²€μ¦)에 μ˜ν•΄μ„œ κ΅¬μΆ•λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
---
## 6. ⚠️ Bias, Risks, and Limitations
λ³Έ λͺ¨λΈμ€ 각 클래슀의 데이터λ₯Ό 편ν–₯되게 ν•™μŠ΅ν•˜μ§€λŠ” μ•Šμ•˜μœΌλ‚˜,
언어적·문화적 νŠΉμ„±μ— μ˜ν•΄ λ ˆμ΄λΈ”μ— λŒ€ν•œ 이견이 μžˆμ„ 수 μžˆμŠ΅λ‹ˆλ‹€.
μœ ν•΄ ν‘œν˜„μ€ μ–Έμ–΄, λ¬Έν™”, 적용 λΆ„μ•Ό, 개인적 견해에 따라 주관적인 뢀뢄이 μ‘΄μž¬ν•˜μ—¬,
결과에 λŒ€ν•œ 편ν–₯ λ˜λŠ” λ…Όλž€μ΄ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
> ❗ λ³Έ λͺ¨λΈμ˜ κ²°κ³ΌλŠ” μ ˆλŒ€μ μΈ μœ ν•΄ ν‘œν˜„ 기쀀이 μ•„λ‹˜μ„ μœ μ˜ν•΄ μ£Όμ„Έμš”.
---
# πŸ“ˆ Results
- Task: binary classification (text-classification)
- F1-score: 0.8279
- Accuracy: 0.7013