|
|
--- |
|
|
language: ko |
|
|
license: other |
|
|
base_models: |
|
|
- snunlp/KR-FinBert-SC |
|
|
- skt/kogpt2-base-v2 |
|
|
tags: |
|
|
- encoder-decoder |
|
|
- seq2seq |
|
|
- text-simplification |
|
|
- financial-domain |
|
|
- ko |
|
|
- pytorch |
|
|
datasets: |
|
|
- combe4259/fin_simplifier_dataset |
|
|
--- |
|
|
|
|
|
# κΈμ΅ ν
μ€νΈ κ°μν λͺ¨λΈ (Financial Text Simplifier) |
|
|
|
|
|
|
|
|
## λͺ¨λΈ μ€λͺ
|
|
|
|
|
|
[](https://colab.research.google.com/drive/19Q7kUWtHX2shLx6iGGoT66wEidOrvLCf?usp=sharing) |
|
|
|
|
|
**fin_simplifier**λ 볡μ‘ν κΈμ΅ μ©μ΄μ λ¬Έμ₯μ μΌλ°μΈμ΄ μ΄ν΄νκΈ° μ¬μ΄ νκ΅μ΄λ‘ λ³ννλ μΈμ½λ-λμ½λ λͺ¨λΈμ
λλ€. |
|
|
|
|
|
### λͺ¨λΈ ꡬ쑰 (config.json κΈ°λ°) |
|
|
- **λͺ¨λΈ νμ
**: EncoderDecoderModel |
|
|
- **μΈμ½λ**: snunlp/KR-FinBert-SC (μλ μ°¨μ: 768) |
|
|
- **λμ½λ**: skt/kogpt2-base-v2 (μ΄ν ν¬κΈ°: 51,201) |
|
|
- **νλΌλ―Έν° μ**: μ½ 255M |
|
|
- **νμΌ ν¬κΈ°**: 1.02GB (safetensors νμ) |
|
|
|
|
|
### μ£Όμ νΉμ§ |
|
|
- κΈμ΅ μ λ¬Έ μ©μ΄λ₯Ό μ¬μ΄ μΌμμ΄λ‘ λ³ν |
|
|
- νκ΅μ΄ κΈμ΅ λ¬Έμμ μ΅μ ν |
|
|
- 볡μ‘ν κΈμ΅ κ°λ
κ°μν (PER, ROE, νμμν λ±) |
|
|
- μν μλ΄ λ° κΈμ΅ κ΅μ‘ νμ© κ°λ₯ |
|
|
|
|
|
## μ¬μ© λͺ©μ |
|
|
|
|
|
### μ£Όμ νμ© μ¬λ‘ |
|
|
1. **κΈμ΅ μλ΄ μ§μ**: μν μλ΄ μ κ³ κ° μ΄ν΄λ ν₯μ |
|
|
2. **κΈμ΅ κ΅μ‘**: 볡μ‘ν κΈμ΅ κ°λ
μ μ½κ² μ€λͺ
|
|
|
3. **λ¬Έμ κ°μν**: μ½κ΄, μν μ€λͺ
μ λ±μ μ΄ν΄νκΈ° μ½κ² λ³ν |
|
|
4. **μ κ·Όμ± κ°μ **: κΈμ΅ μμΈκ³μΈ΅μ κΈμ΅ μλΉμ€ μ κ·Όμ± ν₯μ |
|
|
|
|
|
### μ¬μ© μ ν μ¬ν |
|
|
- λ²μ ꡬμλ ₯μ΄ μλ λ¬Έμ μμ± |
|
|
- ν¬μ μ‘°μΈ λλ κΈμ΅ μλ΄ λ체 |
|
|
- μ νν μμΉλ κ³μ°μ΄ νμν κ²½μ° |
|
|
|
|
|
## μ¬μ© λ°©λ² |
|
|
|
|
|
### μ€μΉ |
|
|
```python |
|
|
from transformers import EncoderDecoderModel, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Model loading |
|
|
model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier") |
|
|
encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC") |
|
|
decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2") |
|
|
|
|
|
# Set special tokens |
|
|
if decoder_tokenizer.pad_token is None: |
|
|
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token |
|
|
``` |
|
|
|
|
|
### μΆλ‘ μμ |
|
|
```python |
|
|
def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer): |
|
|
# Tokenize input |
|
|
inputs = encoder_tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
max_length=128, |
|
|
padding="max_length", |
|
|
truncation=True |
|
|
) |
|
|
|
|
|
# Generate simplified text |
|
|
with torch.no_grad(): |
|
|
generated = model.generate( |
|
|
input_ids=inputs["input_ids"], |
|
|
attention_mask=inputs["attention_mask"], |
|
|
max_length=128, |
|
|
num_beams=6, |
|
|
repetition_penalty=1.2, |
|
|
length_penalty=0.8, |
|
|
early_stopping=True, |
|
|
do_sample=True, |
|
|
top_k=50, |
|
|
top_p=0.95, |
|
|
temperature=0.7 |
|
|
) |
|
|
|
|
|
# Decode output |
|
|
simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True) |
|
|
return simplified |
|
|
|
|
|
# Example usage |
|
|
complex_text = "μ£Όκ°μμ΅λΉμ¨(PER)μ μ£Όκ°λ₯Ό μ£ΌλΉμμ΄μ΅μΌλ‘ λλ μ§νμ
λλ€." |
|
|
simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer) |
|
|
print(f"μλ¬Έ: {complex_text}") |
|
|
print(f"κ°μν: {simple_text}") |
|
|
# μΆλ ₯ μμ: λͺ¨λΈμ΄ μμ±νλ κ°μνλ ν
μ€νΈ |
|
|
``` |
|
|
|
|
|
## νμ΅ μμΈ μ 보 |
|
|
|
|
|
### νμ΅ λ°μ΄ν°μ
|
|
|
[λ°μ΄ν°μ
](https://huggingface.co/datasets/combe4259/fin_simplifier_dataset/tree/main) |
|
|
μ체 μ μ λ°μ΄ν°μ
|
|
|
-μΆμ²: NHλνμν |
|
|
-NHλνμν μνμ€λͺ
μλ₯Ό gemma λͺ¨λΈμ ν¬μ
νμ¬ λ³ννμ¬ μμ± |
|
|
|
|
|
### νμ΅ μ€μ (trainer_state.json κΈ°λ°) |
|
|
- **μν¬ν¬**: 10 |
|
|
- **λ°°μΉ ν¬κΈ°**: 4 (gradient accumulation steps: 2) |
|
|
- **μ΅λ νμ΅λ₯ **: 2.99e-05 |
|
|
- **μ΅μ’
νμ΅λ₯ **: 8.82e-09 |
|
|
- **μ΅ν°λ§μ΄μ **: AdamW (warmup steps: 200) |
|
|
- **λ μ΄λΈ μ€λ¬΄λ©**: 0.1 |
|
|
- **λλ‘μμ**: 0.2 (μΈμ½λ λ° λμ½λ) |
|
|
|
|
|
### μμ± νμ΄νΌνλΌλ―Έν° |
|
|
- **Beam Search**: 6 beams |
|
|
- **Repetition Penalty**: 1.2 |
|
|
- **Length Penalty**: 0.8 |
|
|
- **Temperature**: 0.7 |
|
|
- **Top-k**: 50 |
|
|
- **Top-p**: 0.95 |
|
|
|
|
|
|
|
|
|
|
|
## νκ° κ²°κ³Ό |
|
|
|
|
|
### νμ΅ μ±κ³Ό (trainer_state.json κΈ°μ€) |
|
|
- **μ΄κΈ° μμ€**: 13.53 |
|
|
- **μ΅μ’
μμ€**: 3.76 |
|
|
- **μμ€ κ°μμ¨**: 72.2% |
|
|
- **μ΄ νμ΅ μ€ν
**: 3,600 |
|
|
- **μλ ΄ ν¨ν΄**: μν¬ν¬ 8λΆν° μμ μ μλ ΄ |
|
|
|
|
|
### μν¬ν¬λ³ νκ· μμ€ |
|
|
| μν¬ν¬ | νκ· μμ€ | |
|
|
|--------|-----------| |
|
|
| 1 | 8.98 | |
|
|
| 2 | 6.93 | |
|
|
| 3 | 5.95 | |
|
|
| 4 | 5.28 | |
|
|
| 5 | 4.81 | |
|
|
| 6 | 4.44 | |
|
|
| 7 | 4.17 | |
|
|
| 8 | 3.97 | |
|
|
| 9 | 3.82 | |
|
|
| 10 | 3.73 | |
|
|
|
|
|
### μμ μΆλ ₯ |
|
|
|
|
|
| μλ¬Έ (Complex) | λ³ν κ²°κ³Ό (Simplified) | |
|
|
|---------------|---------------------| |
|
|
| μκ°μ΄μ‘μ λ°νμ£Όμμμ μ£Όκ°λ₯Ό κ³±ν κ°μΌλ‘ κΈ°μ
μ μμ₯κ°μΉλ₯Ό λνλ
λλ€. | μκ°μ΄μ‘μ νμ¬μ λͺ¨λ μ£Όμμ ν©μΉ κ°κ²©μ
λλ€. | |
|
|
| νμκ²°ν©μ¦κΆμ κΈ°μ΄μμ°μ κ°κ²©λ³λμ μ°κ³νμ¬ μμ΅μ΄ κ²°μ λλ μ¦κΆμ
λλ€. | νμκ²°ν©μ¦κΆμ λ€λ₯Έ μν κ°κ²©μ λ°λΌ μμ΅μ΄ λ°λλ ν¬μ μνμ
λλ€. | |
|
|
| ν맀쑰건λΆμ±κΆ(RP)μ μΌμ κΈ°κ° ν λ€μ λ§€μ
νλ 쑰건μΌλ‘ λ§€λνλ μ±κΆμ
λλ€. | RPλ λμ€μ λ€μ μ¬κ² λ€κ³ μ½μνκ³ μΌλ¨ νλ μ±κΆμ
λλ€. | |
|
|
| μ λμ±μνμ μμ°μ μ μ κ°κ²©μ νκΈννμ§ λͺ»ν μνμ
λλ€. | μ λμ±μνμ κΈνκ² ν λ μ κ°μ λͺ» λ°μ μνμ
λλ€. | |
|
|
| μ리κΈκ· λ±μνμ λ§€μ λμΌν κΈμ‘μΌλ‘ μκΈκ³Ό μ΄μλ₯Ό μννλ λ°©μμ
λλ€. | μ리κΈκ· λ±μνμ λ§€λ¬ κ°μ κΈμ‘μ κ°λ λ°©μμ
λλ€. | |
|
|
|
|
|
|
|
|
|
|
|
## μΈμ© |
|
|
|
|
|
```bibtex |
|
|
@misc{fin_simplifier2024, |
|
|
title={Financial Text Simplifier: Korean Financial Terms Simplification Model}, |
|
|
author={combe4259}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/combe4259/fin_simplifier} |
|
|
} |
|
|
``` |
|
|
|
|
|
## κ°μ¬μ λ§ |
|
|
|
|
|
- **KR-FinBert-SC**: κΈμ΅ λλ©μΈ νΉν μΈμ½λ μ 곡 |
|
|
- **SKT KoGPT2**: νκ΅μ΄ μμ± λͺ¨λΈ μ 곡 |
|
|
|
|
|
## μ°λ½μ² |
|
|
|
|
|
- **HuggingFace**: [combe4259](https://huggingface.co/combe4259) |
|
|
- **Model Card**: λ¬Έμμ¬νμ HuggingFace ν λ‘ νμ μ΄μ©ν΄μ£ΌμΈμ |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|