File size: 3,663 Bytes
2725cc1
 
d3c07ea
 
2725cc1
d3c07ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
library_name: transformers
base_model:
- monologg/kobert
---
# KoBERT ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ชจ๋ธ

์ด ํ”„๋กœ์ ํŠธ๋Š” **ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์˜ ๊ฐ์ •์„ ๋ถ„๋ฅ˜**ํ•˜๋Š” KoBERT ๊ธฐ๋ฐ˜์˜ ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ณ  ํ™œ์šฉํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ž…๋ ฅ๋œ ํ…์ŠคํŠธ๊ฐ€ **๋ถ„๋…ธ(Anger), ๋‘๋ ค์›€(Fear), ๊ธฐ์จ(Happy), ํ‰์˜จ(Tender), ์Šฌํ””(Sad)** ์ค‘ ์–ด๋–ค ๊ฐ์ •์— ํ•ด๋‹นํ•˜๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

## 1. ๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •

### Colab ํ™˜๊ฒฝ ์„ค์ • ๋ฐ ๋ฐ์ดํ„ฐ ์ค€๋น„
1. **ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜**:
   `transformers`, `datasets`, `torch`, `pandas`, `scikit-learn` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

2. **๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ**:
   ai hub ์— ๋“ฑ๋ก๋œ ํ•œ๊ตญ์–ด ๊ฐ์„ฑ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ์ • ๋ถ„๋ฅ˜์šฉ CSV ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.

3. **๋ฐ์ดํ„ฐ์…‹ ์ค€๋น„**:
   - **ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ๋ถ„ํ• **: 80%๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ, 20%๋Š” ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ.
   - **HuggingFace Dataset ํ˜•์‹ ๋ณ€ํ™˜**: Pandas DataFrame์„ HuggingFace `Dataset`์œผ๋กœ ๋ณ€ํ™˜.
   - **๋ ˆ์ด๋ธ” ์ปฌ๋Ÿผ๋ช… ๋ณ€๊ฒฝ**: ๊ฐ์ • ๋ ˆ์ด๋ธ”์„ ๋‚˜ํƒ€๋‚ด๋Š” `label_int` ์ปฌ๋Ÿผ์„ `labels`๋กœ ๋ณ€๊ฒฝ.
   - **๋ฐ์ดํ„ฐ ํ† ํฐํ™”**: `monologg/kobert` ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ด์šฉํ•ด ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”.
   - **ํ˜•์‹ ๋ณ€ํ™˜**: `input_ids`, `attention_mask`, `labels`๋งŒ ๋‚จ๊ฒจ ํ•™์Šต ์ค€๋น„ ์™„๋ฃŒ.

4. **๋ชจ๋ธ ๋ฐ ํ•™์Šต ์„ค์ •**:
   - **๋ชจ๋ธ**: `monologg/kobert` ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ 5๊ฐœ์˜ ๊ฐ์ • ๋ ˆ์ด๋ธ”์„ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ์„ค์ •.
   - **ํ•™์Šต ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ**:
     - `learning_rate=2e-5`, `num_train_epochs=10`, `batch_size=16`.
     - F1 ์Šค์ฝ”์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒ ์ŠคํŠธ ๋ชจ๋ธ ์ €์žฅ.
     - Early stopping ์ ์šฉ.
   
5. **ํ•™์Šต ์ง„ํ–‰ ๋ฐ ๋ชจ๋ธ ์ €์žฅ**:
   - ํ•™์Šต ์™„๋ฃŒ ํ›„ ๋ชจ๋ธ์„ Google Drive์— ์ €์žฅ.

### ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ฐ ํ…Œ์ŠคํŠธ
- **ํ‰๊ฐ€ ์ง€ํ‘œ**: Accuracy, F1 score (macro, weighted) ๊ณ„์‚ฐ.
- **ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ‰๊ฐ€**: ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ฐ€.

## 2. ๋ชจ๋ธ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

### ์‚ฌ์ „ ์ค€๋น„
- HuggingFace Hub์—์„œ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
- ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ €๋Š” `monologg/kobert` ๊ธฐ๋ฐ˜์ด๋ฉฐ, ๋ถ„๋ฅ˜ ๋ ˆ์ด๋ธ”์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  - **Anger**: ๐Ÿ˜ก
  - **Fear**: ๐Ÿ˜จ
  - **Happy**: ๐Ÿ˜Š
  - **Tender**: ๐Ÿฅฐ
  - **Sad**: ๐Ÿ˜ข

### ์‚ฌ์šฉ ์˜ˆ์‹œ
1. **๋‹จ์ˆœ ๋ฌธ์žฅ ์ž…๋ ฅ ๊ฐ์ • ๋ถ„์„**:
   - ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๋ชจ๋ธ์ด ๊ฐ์ •์„ ์˜ˆ์ธกํ•˜๊ณ , ๊ฐ ๊ฐ์ •์˜ ํ™•๋ฅ ์„ ํ•จ๊ป˜ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

2. **์—‘์…€ ํŒŒ์ผ์—์„œ ๊ฐ์ • ๋ถ„์„**:
   - ์—‘์…€ ํŒŒ์ผ์—์„œ ์ง€์ •ํ•œ ํ…์ŠคํŠธ ์—ด๊ณผ ํ–‰ ๋ฒ”์œ„๋ฅผ ์ฝ์–ด์™€, ํ•ด๋‹น ํ…์ŠคํŠธ๋“ค์— ๋Œ€ํ•ด ๊ฐ์ •์„ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

### ์ฝ”๋“œ ์‚ฌ์šฉ ์˜ˆ์‹œ
```python
# ํ† ํฌ๋‚˜์ด์ € ๋ฐ ๋ชจ๋ธ ๋กœ๋“œ
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# KoBERT ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained("monologg/kobert", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained("rkdaldus/ko-sent5-classification")

# ์‚ฌ์šฉ์ž ์ž…๋ ฅ ํ…์ŠคํŠธ ๊ฐ์ • ๋ถ„์„
text = "์˜ค๋Š˜ ์ •๋ง ํ–‰๋ณตํ•ด!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
predicted_label = torch.argmax(outputs.logits, dim=1).item()

# ๊ฐ์ • ๋ ˆ์ด๋ธ” ์ •์˜
emotion_labels = {
    0: ("Angry", "๐Ÿ˜ก"),
    1: ("Fear", "๐Ÿ˜จ"),
    2: ("Happy", "๐Ÿ˜Š"),
    3: ("Tender", "๐Ÿฅฐ"),
    4: ("Sad", "๐Ÿ˜ข")
}

# ์˜ˆ์ธก๋œ ๊ฐ์ • ์ถœ๋ ฅ
print(f"์˜ˆ์ธก๋œ ๊ฐ์ •: {emotion_labels[predicted_label][0]} {emotion_labels[predicted_label][1]}")