--- license: mit datasets: - TTA-DQA/hate_sentence language: - ko metrics: - accuracy - f1 base_model: - beomi/KcELECTRA-base-v2022 tags: - Text-Classification - Multi-Label-Classification --- ### πŸ“Œ λͺ¨λΈ 상세 정보 ## 1. 🧾 κ°œμš” 이 λͺ¨λΈμ€ **ν•œκ΅­μ–΄ λ¬Έμž₯ λ‚΄ μœ ν•΄ ν‘œν˜„μ˜ 유무 및 μœ ν•΄ ν‘œν˜„μ˜ μœ ν˜•(μΉ΄ν…Œκ³ λ¦¬)λ₯Ό λΆ„λ₯˜**ν•˜κΈ° μœ„ν•΄ ν•™μŠ΅λœ λͺ¨λΈμž…λ‹ˆλ‹€. `mult-label classification`을 μˆ˜ν–‰ν•˜λ©°, μœ ν•΄ν‘œν˜„μ΄ ν¬ν•¨λ˜λŠ”μ§€, μœ ν•΄ν‘œν˜„μ΄λΌλ©΄ κ·Έ μœ ν˜•μ„ **νŒλ‹¨(λΆ„λ₯˜)** ν•©λ‹ˆλ‹€. AI-Taskλ‘œλŠ” `text-classification`에 ν•΄λ‹Ήν•©λ‹ˆλ‹€. μ‚¬μš©ν•˜λŠ” 데이터셋은 [`TTA-DQA/hate_sentence`](https://huggingface.co/datasets/TTA-DQA/hate_sentence)μž…λ‹ˆλ‹€. - **클래슀 ꡬ성**: - `"0"`: `insult` - `"1"`: `abuse` - `"2"`: `obscenity` - `"3"`: `TVPC(Threats of violence/promotion of crime)` - `"4"`: `sexuality` - `"5"`: `age` - `"6"`: `race and region` - `"7"`: `disabled` - `"8"`: `religion` - `"9"`: `politics` - `"10"`: `job` - `"11"`: `no_hate` --- ## 2. 🧠 ν•™μŠ΅ 정보 - **Base Model**: KcElectra (a pre-trained Korean language model based on Electra) - **Source**: [beomi/KcELECTRA](https://huggingface.co/beomi/KcELECTRA-base-v2022) - **Model Type**: Casual Language Model - **Pre-training (Korean)**: μ•½ 17GB (over 180 million sentences) - **Fine-tuning (Hate Dataset)**: μ•½ 22.3MB (`TTA-DQA/hate_sentence`) - **Learning Rate**: `5e-6` - **Weight Decay**: `0.01` - **Epochs**: `30` - **Batch Size**: `16` - **Data Loader Workers**: `2` - **Tokenizer**: `BertWordPieceTokenizer` - **Model Size**: μ•½ `511MB` --- ## 3. 🧩 μš”κ΅¬μ‚¬ν•­ - `pytorch ~= 1.8.0` - `transformers ~= 4.0.0` - `emoji ~= 0.6.0` - `soynlp ~= 0.0.493` --- ## 4. πŸš€ Quick Start ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline model_name = "TTA-DQA/HateDetection_MultiLabel_KcElectra_FineTuning" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) sentences = ["였늘 점심 뭐 λ¨Ήμ„κΉŒ?", "이 λ‚˜μœ λ†ˆμ•„."] results = classifier(sentences)' ``` --- ## 5.πŸ“š Citation 이 λͺ¨λΈμ€ μ΄ˆκ±°λŒ€AI ν•™μŠ΅μš© 데이터 ν’ˆμ§ˆκ²€μ¦ 사업(2024년도 μ΄ˆκ±°λŒ€AI ν•™μŠ΅μš© ν’ˆμ§ˆκ²€μ¦)에 μ˜ν•΄μ„œ κ΅¬μΆ•λ˜μ—ˆμŠ΅λ‹ˆλ‹€. --- ## 6. ⚠️ Bias, Risks, and Limitations λ³Έ λͺ¨λΈμ€ 각 클래슀의 데이터λ₯Ό 편ν–₯되게 ν•™μŠ΅ν•˜μ§€λŠ” μ•Šμ•˜μœΌλ‚˜, 언어적·문화적 νŠΉμ„±μ— μ˜ν•΄ λ ˆμ΄λΈ”μ— λŒ€ν•œ 이견이 μžˆμ„ 수 μžˆμŠ΅λ‹ˆλ‹€. μœ ν•΄ ν‘œν˜„μ€ μ–Έμ–΄, λ¬Έν™”, 적용 λΆ„μ•Ό, 개인적 견해에 따라 주관적인 뢀뢄이 μ‘΄μž¬ν•˜μ—¬, 결과에 λŒ€ν•œ 편ν–₯ λ˜λŠ” λ…Όλž€μ΄ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€. > ❗ λ³Έ λͺ¨λΈμ˜ κ²°κ³ΌλŠ” μ ˆλŒ€μ μΈ μœ ν•΄ ν‘œν˜„ 기쀀이 μ•„λ‹˜μ„ μœ μ˜ν•΄ μ£Όμ„Έμš”. --- # πŸ“ˆ Results - Task: binary classification (text-classification) - F1-score: 0.8279 - Accuracy: 0.7013