| | --- |
| | license: mit |
| | datasets: |
| | - TTA-DQA/hate_sentence |
| | language: |
| | - ko |
| | metrics: |
| | - accuracy |
| | - f1 |
| | base_model: |
| | - beomi/KcELECTRA-base-v2022 |
| | tags: |
| | - Text-Classification |
| | - Multi-Label-Classification |
| | --- |
| | ### π λͺ¨λΈ μμΈ μ 보 |
| | ## 1. π§Ύ κ°μ |
| |
|
| | μ΄ λͺ¨λΈμ **νκ΅μ΄ λ¬Έμ₯ λ΄ μ ν΄ ννμ μ 무 λ° μ ν΄ ννμ μ ν(μΉ΄ν
κ³ λ¦¬)λ₯Ό λΆλ₯**νκΈ° μν΄ νμ΅λ λͺ¨λΈμ
λλ€. |
| | `mult-label classification`μ μννλ©°, μ ν΄ννμ΄ ν¬ν¨λλμ§, μ ν΄ννμ΄λΌλ©΄ κ·Έ μ νμ **νλ¨(λΆλ₯)** ν©λλ€. |
| | AI-Taskλ‘λ `text-classification`μ ν΄λΉν©λλ€. |
| | μ¬μ©νλ λ°μ΄ν°μ
μ [`TTA-DQA/hate_sentence`](https://huggingface.co/datasets/TTA-DQA/hate_sentence)μ
λλ€. |
| |
|
| | - **ν΄λμ€ κ΅¬μ±**: |
| | - `"0"`: `insult` |
| | - `"1"`: `abuse` |
| | - `"2"`: `obscenity` |
| | - `"3"`: `TVPC(Threats of violence/promotion of crime)` |
| | - `"4"`: `sexuality` |
| | - `"5"`: `age` |
| | - `"6"`: `race and region` |
| | - `"7"`: `disabled` |
| | - `"8"`: `religion` |
| | - `"9"`: `politics` |
| | - `"10"`: `job` |
| | - `"11"`: `no_hate` |
| | --- |
| | ## 2. π§ νμ΅ μ 보 |
| | |
| | - **Base Model**: KrBERT-Medium (a pre-trained Korean language model based on BERT) |
| | - **Source**: [snunlp/KR-Medium](https://huggingface.co/snunlp/KR-Medium) |
| | - **Model Type**: Casual Language Model |
| | - **Pre-training (Korean)**: μ½ 12.37GB (consisting of 91M and 1.17B words) |
| | - **Fine-tuning (Hate Dataset)**: μ½ 22.3MB (`TTA-DQA/hate_sentence`) |
| | - **Learning Rate**: `5e-6` |
| | - **Weight Decay**: `0.01` |
| | - **Epochs**: `50` |
| | - **Batch Size**: `16` |
| | - **Data Loader Workers**: `2` |
| | - **Tokenizer**: `BertWordPieceTokenizer` |
| | - **Model Size**: μ½ `405MB` |
| | |
| | --- |
| |
|
| | ## 3. π§© μꡬμ¬ν |
| |
|
| | - `pytorch ~= 1.8.0` |
| | - `transformers ~= 4.0.0` |
| |
|
| | --- |
| |
|
| | ## 4. π Quick Start |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
| | |
| | model_name = "TTA-DQA/MultiLabel_KrBERT_Medium_Finetuning" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
| | |
| | sentences = ["μ€λ μ μ¬ λ λ¨ΉμκΉ?", "μ΄ λμ λμ."] |
| | results = classifier(sentences)' |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 5.π Citation |
| | μ΄ λͺ¨λΈμ μ΄κ±°λAI νμ΅μ© λ°μ΄ν° νμ§κ²μ¦ μ¬μ
(2024λ
λ μ΄κ±°λAI νμ΅μ© νμ§κ²μ¦)μ μν΄μ ꡬμΆλμμ΅λλ€. |
| |
|
| | --- |
| |
|
| | ## 6. β οΈ Bias, Risks, and Limitations |
| |
|
| | λ³Έ λͺ¨λΈμ κ° ν΄λμ€μ λ°μ΄ν°λ₯Ό νΈν₯λκ² νμ΅νμ§λ μμμΌλ, |
| | μΈμ΄μ Β·λ¬Ένμ νΉμ±μ μν΄ λ μ΄λΈμ λν μ΄κ²¬μ΄ μμ μ μμ΅λλ€. |
| | μ ν΄ ννμ μΈμ΄, λ¬Έν, μ μ© λΆμΌ, κ°μΈμ 견ν΄μ λ°λΌ μ£Όκ΄μ μΈ λΆλΆμ΄ μ‘΄μ¬νμ¬, |
| | κ²°κ³Όμ λν νΈν₯ λλ λ
Όλμ΄ λ°μν μ μμ΅λλ€. |
| |
|
| | > β λ³Έ λͺ¨λΈμ κ²°κ³Όλ μ λμ μΈ μ ν΄ νν κΈ°μ€μ΄ μλμ μ μν΄ μ£ΌμΈμ. |
| |
|
| | --- |
| |
|
| | # π Results |
| | - Task: binary classification (text-classification) |
| | - F1-score: 0.99005 |
| | - Accuracy: 0.99005 |