File size: 6,148 Bytes
ca953c6 9193696 e204a58 ca953c6 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 04e36be 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 9193696 e204a58 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
---
language: ko
license: other
base_models:
- snunlp/KR-FinBert-SC
- skt/kogpt2-base-v2
tags:
- encoder-decoder
- seq2seq
- text-simplification
- financial-domain
- ko
- pytorch
datasets:
- combe4259/fin_simplifier_dataset
---
# κΈμ΅ ν
μ€νΈ κ°μν λͺ¨λΈ (Financial Text Simplifier)
## λͺ¨λΈ μ€λͺ
[](https://colab.research.google.com/drive/19Q7kUWtHX2shLx6iGGoT66wEidOrvLCf?usp=sharing)
**fin_simplifier**λ 볡μ‘ν κΈμ΅ μ©μ΄μ λ¬Έμ₯μ μΌλ°μΈμ΄ μ΄ν΄νκΈ° μ¬μ΄ νκ΅μ΄λ‘ λ³ννλ μΈμ½λ-λμ½λ λͺ¨λΈμ
λλ€.
### λͺ¨λΈ ꡬ쑰 (config.json κΈ°λ°)
- **λͺ¨λΈ νμ
**: EncoderDecoderModel
- **μΈμ½λ**: snunlp/KR-FinBert-SC (μλ μ°¨μ: 768)
- **λμ½λ**: skt/kogpt2-base-v2 (μ΄ν ν¬κΈ°: 51,201)
- **νλΌλ―Έν° μ**: μ½ 255M
- **νμΌ ν¬κΈ°**: 1.02GB (safetensors νμ)
### μ£Όμ νΉμ§
- κΈμ΅ μ λ¬Έ μ©μ΄λ₯Ό μ¬μ΄ μΌμμ΄λ‘ λ³ν
- νκ΅μ΄ κΈμ΅ λ¬Έμμ μ΅μ ν
- 볡μ‘ν κΈμ΅ κ°λ
κ°μν (PER, ROE, νμμν λ±)
- μν μλ΄ λ° κΈμ΅ κ΅μ‘ νμ© κ°λ₯
## μ¬μ© λͺ©μ
### μ£Όμ νμ© μ¬λ‘
1. **κΈμ΅ μλ΄ μ§μ**: μν μλ΄ μ κ³ κ° μ΄ν΄λ ν₯μ
2. **κΈμ΅ κ΅μ‘**: 볡μ‘ν κΈμ΅ κ°λ
μ μ½κ² μ€λͺ
3. **λ¬Έμ κ°μν**: μ½κ΄, μν μ€λͺ
μ λ±μ μ΄ν΄νκΈ° μ½κ² λ³ν
4. **μ κ·Όμ± κ°μ **: κΈμ΅ μμΈκ³μΈ΅μ κΈμ΅ μλΉμ€ μ κ·Όμ± ν₯μ
### μ¬μ© μ ν μ¬ν
- λ²μ ꡬμλ ₯μ΄ μλ λ¬Έμ μμ±
- ν¬μ μ‘°μΈ λλ κΈμ΅ μλ΄ λ체
- μ νν μμΉλ κ³μ°μ΄ νμν κ²½μ°
## μ¬μ© λ°©λ²
### μ€μΉ
```python
from transformers import EncoderDecoderModel, AutoTokenizer
import torch
# Model loading
model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier")
encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")
# Set special tokens
if decoder_tokenizer.pad_token is None:
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
```
### μΆλ‘ μμ
```python
def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer):
# Tokenize input
inputs = encoder_tokenizer(
text,
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True
)
# Generate simplified text
with torch.no_grad():
generated = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=128,
num_beams=6,
repetition_penalty=1.2,
length_penalty=0.8,
early_stopping=True,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)
# Decode output
simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True)
return simplified
# Example usage
complex_text = "μ£Όκ°μμ΅λΉμ¨(PER)μ μ£Όκ°λ₯Ό μ£ΌλΉμμ΄μ΅μΌλ‘ λλ μ§νμ
λλ€."
simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer)
print(f"μλ¬Έ: {complex_text}")
print(f"κ°μν: {simple_text}")
# μΆλ ₯ μμ: λͺ¨λΈμ΄ μμ±νλ κ°μνλ ν
μ€νΈ
```
## νμ΅ μμΈ μ 보
### νμ΅ λ°μ΄ν°μ
[λ°μ΄ν°μ
](https://huggingface.co/datasets/combe4259/fin_simplifier_dataset/tree/main)
μ체 μ μ λ°μ΄ν°μ
-μΆμ²: NHλνμν
-NHλνμν μνμ€λͺ
μλ₯Ό gemma λͺ¨λΈμ ν¬μ
νμ¬ λ³ννμ¬ μμ±
### νμ΅ μ€μ (trainer_state.json κΈ°λ°)
- **μν¬ν¬**: 10
- **λ°°μΉ ν¬κΈ°**: 4 (gradient accumulation steps: 2)
- **μ΅λ νμ΅λ₯ **: 2.99e-05
- **μ΅μ’
νμ΅λ₯ **: 8.82e-09
- **μ΅ν°λ§μ΄μ **: AdamW (warmup steps: 200)
- **λ μ΄λΈ μ€λ¬΄λ©**: 0.1
- **λλ‘μμ**: 0.2 (μΈμ½λ λ° λμ½λ)
### μμ± νμ΄νΌνλΌλ―Έν°
- **Beam Search**: 6 beams
- **Repetition Penalty**: 1.2
- **Length Penalty**: 0.8
- **Temperature**: 0.7
- **Top-k**: 50
- **Top-p**: 0.95
## νκ° κ²°κ³Ό
### νμ΅ μ±κ³Ό (trainer_state.json κΈ°μ€)
- **μ΄κΈ° μμ€**: 13.53
- **μ΅μ’
μμ€**: 3.76
- **μμ€ κ°μμ¨**: 72.2%
- **μ΄ νμ΅ μ€ν
**: 3,600
- **μλ ΄ ν¨ν΄**: μν¬ν¬ 8λΆν° μμ μ μλ ΄
### μν¬ν¬λ³ νκ· μμ€
| μν¬ν¬ | νκ· μμ€ |
|--------|-----------|
| 1 | 8.98 |
| 2 | 6.93 |
| 3 | 5.95 |
| 4 | 5.28 |
| 5 | 4.81 |
| 6 | 4.44 |
| 7 | 4.17 |
| 8 | 3.97 |
| 9 | 3.82 |
| 10 | 3.73 |
### μμ μΆλ ₯
| μλ¬Έ (Complex) | λ³ν κ²°κ³Ό (Simplified) |
|---------------|---------------------|
| μκ°μ΄μ‘μ λ°νμ£Όμμμ μ£Όκ°λ₯Ό κ³±ν κ°μΌλ‘ κΈ°μ
μ μμ₯κ°μΉλ₯Ό λνλ
λλ€. | μκ°μ΄μ‘μ νμ¬μ λͺ¨λ μ£Όμμ ν©μΉ κ°κ²©μ
λλ€. |
| νμκ²°ν©μ¦κΆμ κΈ°μ΄μμ°μ κ°κ²©λ³λμ μ°κ³νμ¬ μμ΅μ΄ κ²°μ λλ μ¦κΆμ
λλ€. | νμκ²°ν©μ¦κΆμ λ€λ₯Έ μν κ°κ²©μ λ°λΌ μμ΅μ΄ λ°λλ ν¬μ μνμ
λλ€. |
| ν맀쑰건λΆμ±κΆ(RP)μ μΌμ κΈ°κ° ν λ€μ λ§€μ
νλ 쑰건μΌλ‘ λ§€λνλ μ±κΆμ
λλ€. | RPλ λμ€μ λ€μ μ¬κ² λ€κ³ μ½μνκ³ μΌλ¨ νλ μ±κΆμ
λλ€. |
| μ λμ±μνμ μμ°μ μ μ κ°κ²©μ νκΈννμ§ λͺ»ν μνμ
λλ€. | μ λμ±μνμ κΈνκ² ν λ μ κ°μ λͺ» λ°μ μνμ
λλ€. |
| μ리κΈκ· λ±μνμ λ§€μ λμΌν κΈμ‘μΌλ‘ μκΈκ³Ό μ΄μλ₯Ό μννλ λ°©μμ
λλ€. | μ리κΈκ· λ±μνμ λ§€λ¬ κ°μ κΈμ‘μ κ°λ λ°©μμ
λλ€. |
## μΈμ©
```bibtex
@misc{fin_simplifier2024,
title={Financial Text Simplifier: Korean Financial Terms Simplification Model},
author={combe4259},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/combe4259/fin_simplifier}
}
```
## κ°μ¬μ λ§
- **KR-FinBert-SC**: κΈμ΅ λλ©μΈ νΉν μΈμ½λ μ 곡
- **SKT KoGPT2**: νκ΅μ΄ μμ± λͺ¨λΈ μ 곡
## μ°λ½μ²
- **HuggingFace**: [combe4259](https://huggingface.co/combe4259)
- **Model Card**: λ¬Έμμ¬νμ HuggingFace ν λ‘ νμ μ΄μ©ν΄μ£ΌμΈμ
---
|