File size: 3,744 Bytes
6d3cce7 1d657e7 2951393 6d3cce7 1d657e7 6d3cce7 829305c 6d3cce7 1d657e7 6d3cce7 1d657e7 6d3cce7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
---
language: ko
license: apache-2.0
base_model: klue/bert-base
tags:
- klue-bert
- text-classification
- pytorch
- ko
- financial-domain
- text-difficulty
datasets:
- custom
metrics:
- f1
- accuracy
- mae
---
## Colab Notebook
[](https://colab.research.google.com/drive/112GWo0LrRls5B_uF6ghjZXzY6PxXzrnV?usp=sharing)
## νμ΅ λ°μ΄ν°μ
https://huggingface.co/combe4259/difficulty_klue/blob/main/training_data_difficulty_klue.json
# κΈμ΅ λ¬Έμ λμ΄λ λΆλ₯ λͺ¨λΈ (Text Difficulty Classification)
μ΄ λͺ¨λΈμ `klue/bert-base`λ₯Ό νμΈνλνμ¬, νκ΅μ΄ κΈμ΅ λ¬Έμ₯μ λμ΄λλ₯Ό 10λ¨κ³(1~10)λ‘ λΆλ₯νλ Text Classification λͺ¨λΈμ
λλ€.
'μ΄λ €μ΄ λ¬Έμ₯'μ΄ λ±μ₯νλμ§ μ€μκ°μΌλ‘ κ°μ§νμ¬ 'μ¬μ΄ λ¬Έμ₯ λ³ν AI'μ νΈλ¦¬κ±° μν μ νλλ‘ μ€κ³λμμ΅λλ€.
---
## μ¬μ© λ°©λ² (How to Use)
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Hugging Face Hub λλ μ μ₯λ λ‘컬 κ²½λ‘μμ λͺ¨λΈ λ‘λ
MODEL_PATH = "combe4259/difficulty_klue"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval()
# μΆλ‘ ν ν
μ€νΈ
text = "μ μ©νμκ²°ν©μ¦κΆμ CDS μ€νλ λ λ³λμ λ°λ₯Έ μμ΅κ΅¬μ‘°"
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True
)
# μμΈ‘
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# λͺ¨λΈμ 0-9λ‘ μμΈ‘νλ―λ‘, +1 νμ¬ 1-10 μ€μΌμΌλ‘ λ³ν
prediction = torch.argmax(logits, dim=-1).item()
difficulty = prediction + 1
print(f"ν
μ€νΈ: {text}")
print(f"μμΈ‘ λμ΄λ: {difficulty}")
# μΆλ ₯: μμΈ‘ λμ΄λ: 7
```
## νμ΅ λ°μ΄ν° (Training Data)
- μ체 ꡬμΆν 2,880κ°μ κΈμ΅ λ¬Έμ₯/λ¨λ½μΌλ‘ ꡬμ±λ JSON λ°μ΄ν° μ¬μ©
- λ°μ΄ν° λΆν : Train (2,016) / Validation (432) / Test (432)
- λ°μ΄ν° λΆκ· ν: λμ΄λ 7(28.2%)κ³Ό 8(18.6%) μ§μ€, λμ΄λ 10(0.0%)μ 1κ° μ‘΄μ¬
- μ μ²λ¦¬: klue/bert-base ν ν¬λμ΄μ μ¬μ©, `max_length=512`λ‘ ν¨λ© λ° μ λ¨
---
## νμ΅ μ μ°¨ (Training Procedure)
- Base Model: `klue/bert-base` (`num_labels=10`)
- Optimizer: AdamW
- Loss Function: Weighted CrossEntropyLoss (ν΄λμ€ κ°μ€μΉ μ μ©)
- μ: μν 1κ°μΈ λμ΄λ 10 β 10.0
- μν 568κ°μΈ λμ΄λ 7 β 0.35
- Epochs: 10
- Batch Size: 16
- Learning Rate: 2e-5 (with 500 warmup steps)
- Best Model: `metric_for_best_model='f1'` (F1 μ μκ° κ°μ₯ λμ 체ν¬ν¬μΈνΈ μ μ₯)
- Early Stopping: patience=3 (F1 μ μκ° 3ν μ°μ κ°μ λμ§ μμΌλ©΄ νμ΅ μ‘°κΈ° μ’
λ£)
---
## νκ° κ²°κ³Ό (Evaluation Results)
Test Set (432κ°) κΈ°μ€ μ΅μ’
μ±λ₯μ
λλ€.
'F1 Score'μ 'MAE, Within 1 Acc' λͺ¨λμμ μμ μ μΈ μ±λ₯μ 보μμ΅λλ€.
| Metric | Score | μ€λͺ
|
|-----------------------|-------|------|
| F1 Score (Weighted) | 0.607 | (ν΅μ¬ μ§ν) λͺ¨λΈμ μ λ°μ μΈ μ λ°λ/μ¬νμ¨ |
| Accuracy (μ νλ) | 0.604 | 10κ° μ€ μ νν λ§ν νλ₯ |
| MAE (νκ· μ λ μ€μ°¨) | 0.560 | (μ€μ) μμΈ‘μ΄ μ λ΅μμ νκ· 0.56μΉΈ λ²μ΄λ¨ |
| Within 1 Acc | 0.926 | (μ€μ) Β±1 μ€μ°¨ λ²μ λ΄ μ νλ (92.6%) |
---
## μν μμΈ‘
| μ
λ ₯ ν
μ€νΈ | μμΈ‘ λμ΄λ (1-10) |
|-------------|-------------------|
| "μνμ λμ 맑겨μ" | 1 |
| "μκΈμ보νΈλ²μ λ°λΌ 5μ²λ§μκΉμ§ 보νΈλ©λλ€" | 2 |
| "μ μ©νμκ²°ν©μ¦κΆμ CDS μ€νλ λ λ³λμ λ°λ₯Έ μμ΅κ΅¬μ‘°" | 7 |
|