fin_simplifier / README.md
combe4259's picture
Update README.md
04e36be verified
---
language: ko
license: other
base_models:
- snunlp/KR-FinBert-SC
- skt/kogpt2-base-v2
tags:
- encoder-decoder
- seq2seq
- text-simplification
- financial-domain
- ko
- pytorch
datasets:
- combe4259/fin_simplifier_dataset
---
# 금육 ν…μŠ€νŠΈ κ°„μ†Œν™” λͺ¨λΈ (Financial Text Simplifier)
## λͺ¨λΈ μ„€λͺ…
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Q7kUWtHX2shLx6iGGoT66wEidOrvLCf?usp=sharing)
**fin_simplifier**λŠ” λ³΅μž‘ν•œ 금육 μš©μ–΄μ™€ λ¬Έμž₯을 일반인이 μ΄ν•΄ν•˜κΈ° μ‰¬μš΄ ν•œκ΅­μ–΄λ‘œ λ³€ν™˜ν•˜λŠ” 인코더-디코더 λͺ¨λΈμž…λ‹ˆλ‹€.
### λͺ¨λΈ ꡬ쑰 (config.json 기반)
- **λͺ¨λΈ νƒ€μž…**: EncoderDecoderModel
- **인코더**: snunlp/KR-FinBert-SC (은닉 차원: 768)
- **디코더**: skt/kogpt2-base-v2 (μ–΄νœ˜ 크기: 51,201)
- **νŒŒλΌλ―Έν„° 수**: μ•½ 255M
- **파일 크기**: 1.02GB (safetensors ν˜•μ‹)
### μ£Όμš” νŠΉμ§•
- 금육 μ „λ¬Έ μš©μ–΄λ₯Ό μ‰¬μš΄ μΌμƒμ–΄λ‘œ λ³€ν™˜
- ν•œκ΅­μ–΄ 금육 λ¬Έμ„œμ— μ΅œμ ν™”
- λ³΅μž‘ν•œ 금육 κ°œλ… κ°„μ†Œν™” (PER, ROE, νŒŒμƒμƒν’ˆ λ“±)
- 은행 상담 및 금육 ꡐ윑 ν™œμš© κ°€λŠ₯
## μ‚¬μš© λͺ©μ 
### μ£Όμš” ν™œμš© 사둀
1. **금육 상담 지원**: 은행 상담 μ‹œ 고객 이해도 ν–₯상
2. **금육 ꡐ윑**: λ³΅μž‘ν•œ 금육 κ°œλ…μ„ μ‰½κ²Œ μ„€λͺ…
3. **λ¬Έμ„œ κ°„μ†Œν™”**: μ•½κ΄€, μƒν’ˆ μ„€λͺ…μ„œ 등을 μ΄ν•΄ν•˜κΈ° μ‰½κ²Œ λ³€ν™˜
4. **μ ‘κ·Όμ„± κ°œμ„ **: 금육 μ†Œμ™Έκ³„μΈ΅μ˜ 금육 μ„œλΉ„μŠ€ μ ‘κ·Όμ„± ν–₯상
### μ‚¬μš© μ œν•œ 사항
- 법적 ꡬ속λ ₯이 μžˆλŠ” λ¬Έμ„œ μž‘μ„±
- 투자 μ‘°μ–Έ λ˜λŠ” 금육 상담 λŒ€μ²΄
- μ •ν™•ν•œ μˆ˜μΉ˜λ‚˜ 계산이 ν•„μš”ν•œ 경우
## μ‚¬μš© 방법
### μ„€μΉ˜
```python
from transformers import EncoderDecoderModel, AutoTokenizer
import torch
# Model loading
model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier")
encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")
# Set special tokens
if decoder_tokenizer.pad_token is None:
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
```
### μΆ”λ‘  μ˜ˆμ‹œ
```python
def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer):
# Tokenize input
inputs = encoder_tokenizer(
text,
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True
)
# Generate simplified text
with torch.no_grad():
generated = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=128,
num_beams=6,
repetition_penalty=1.2,
length_penalty=0.8,
early_stopping=True,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)
# Decode output
simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True)
return simplified
# Example usage
complex_text = "μ£Όκ°€μˆ˜μ΅λΉ„μœ¨(PER)은 μ£Όκ°€λ₯Ό μ£Όλ‹Ήμˆœμ΄μ΅μœΌλ‘œ λ‚˜λˆˆ μ§€ν‘œμž…λ‹ˆλ‹€."
simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer)
print(f"원문: {complex_text}")
print(f"κ°„μ†Œν™”: {simple_text}")
# 좜λ ₯ μ˜ˆμ‹œ: λͺ¨λΈμ΄ μƒμ„±ν•˜λŠ” κ°„μ†Œν™”λœ ν…μŠ€νŠΈ
```
## ν•™μŠ΅ 상세 정보
### ν•™μŠ΅ 데이터셋
[데이터셋](https://huggingface.co/datasets/combe4259/fin_simplifier_dataset/tree/main)
자체 μ œμž‘ 데이터셋
-좜처: NHλ†ν˜‘μ€ν–‰
-NHλ†ν˜‘μ€ν–‰ μƒν’ˆμ„€λͺ…μ„œλ₯Ό gemma λͺ¨λΈμ— νˆ¬μž…ν•˜μ—¬ λ³€ν™˜ν•˜μ—¬ 생성
### ν•™μŠ΅ μ„€μ • (trainer_state.json 기반)
- **에포크**: 10
- **배치 크기**: 4 (gradient accumulation steps: 2)
- **μ΅œλŒ€ ν•™μŠ΅λ₯ **: 2.99e-05
- **μ΅œμ’… ν•™μŠ΅λ₯ **: 8.82e-09
- **μ˜΅ν‹°λ§ˆμ΄μ €**: AdamW (warmup steps: 200)
- **λ ˆμ΄λΈ” μŠ€λ¬΄λ”©**: 0.1
- **λ“œλ‘­μ•„μ›ƒ**: 0.2 (인코더 및 디코더)
### 생성 ν•˜μ΄νΌνŒŒλΌλ―Έν„°
- **Beam Search**: 6 beams
- **Repetition Penalty**: 1.2
- **Length Penalty**: 0.8
- **Temperature**: 0.7
- **Top-k**: 50
- **Top-p**: 0.95
## 평가 κ²°κ³Ό
### ν•™μŠ΅ μ„±κ³Ό (trainer_state.json κΈ°μ€€)
- **초기 손싀**: 13.53
- **μ΅œμ’… 손싀**: 3.76
- **손싀 κ°μ†Œμœ¨**: 72.2%
- **총 ν•™μŠ΅ μŠ€ν…**: 3,600
- **수렴 νŒ¨ν„΄**: 에포크 8λΆ€ν„° μ•ˆμ •μ  수렴
### 에포크별 평균 손싀
| 에포크 | 평균 손싀 |
|--------|-----------|
| 1 | 8.98 |
| 2 | 6.93 |
| 3 | 5.95 |
| 4 | 5.28 |
| 5 | 4.81 |
| 6 | 4.44 |
| 7 | 4.17 |
| 8 | 3.97 |
| 9 | 3.82 |
| 10 | 3.73 |
### μ˜ˆμ‹œ 좜λ ₯
| 원문 (Complex) | λ³€ν™˜ κ²°κ³Ό (Simplified) |
|---------------|---------------------|
| μ‹œκ°€μ΄μ•‘μ€ λ°œν–‰μ£Όμ‹μˆ˜μ— μ£Όκ°€λ₯Ό κ³±ν•œ κ°’μœΌλ‘œ κΈ°μ—…μ˜ μ‹œμž₯κ°€μΉ˜λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€. | μ‹œκ°€μ΄μ•‘μ€ νšŒμ‚¬μ˜ λͺ¨λ“  주식을 ν•©μΉœ κ°€κ²©μž…λ‹ˆλ‹€. |
| νŒŒμƒκ²°ν•©μ¦κΆŒμ€ κΈ°μ΄ˆμžμ‚°μ˜ 가격변동에 μ—°κ³„ν•˜μ—¬ 수읡이 κ²°μ •λ˜λŠ” μ¦κΆŒμž…λ‹ˆλ‹€. | νŒŒμƒκ²°ν•©μ¦κΆŒμ€ λ‹€λ₯Έ μƒν’ˆ 가격에 따라 수읡이 λ°”λ€ŒλŠ” 투자 μƒν’ˆμž…λ‹ˆλ‹€. |
| ν™˜λ§€μ‘°κ±΄λΆ€μ±„κΆŒ(RP)은 일정기간 ν›„ λ‹€μ‹œ λ§€μž…ν•˜λŠ” 쑰건으둜 λ§€λ„ν•˜λŠ” μ±„κΆŒμž…λ‹ˆλ‹€. | RPλŠ” λ‚˜μ€‘μ— λ‹€μ‹œ 사겠닀고 μ•½μ†ν•˜κ³  일단 νŒŒλŠ” μ±„κΆŒμž…λ‹ˆλ‹€. |
| μœ λ™μ„±μœ„ν—˜μ€ μžμ‚°μ„ 적정가격에 ν˜„κΈˆν™”ν•˜μ§€ λͺ»ν•  μœ„ν—˜μž…λ‹ˆλ‹€. | μœ λ™μ„±μœ„ν—˜μ€ κΈ‰ν•˜κ²Œ νŒ” λ•Œ μ œκ°’μ„ λͺ» 받을 μœ„ν—˜μž…λ‹ˆλ‹€. |
| μ›λ¦¬κΈˆκ· λ“±μƒν™˜μ€ λ§€μ›” λ™μΌν•œ κΈˆμ•‘μœΌλ‘œ μ›κΈˆκ³Ό 이자λ₯Ό μƒν™˜ν•˜λŠ” λ°©μ‹μž…λ‹ˆλ‹€. | μ›λ¦¬κΈˆκ· λ“±μƒν™˜μ€ 맀달 같은 κΈˆμ•‘μ„ κ°šλŠ” λ°©μ‹μž…λ‹ˆλ‹€. |
## 인용
```bibtex
@misc{fin_simplifier2024,
title={Financial Text Simplifier: Korean Financial Terms Simplification Model},
author={combe4259},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/combe4259/fin_simplifier}
}
```
## κ°μ‚¬μ˜ 말
- **KR-FinBert-SC**: 금육 도메인 νŠΉν™” 인코더 제곡
- **SKT KoGPT2**: ν•œκ΅­μ–΄ 생성 λͺ¨λΈ 제곡
## μ—°λ½μ²˜
- **HuggingFace**: [combe4259](https://huggingface.co/combe4259)
- **Model Card**: λ¬Έμ˜μ‚¬ν•­μ€ HuggingFace ν† λ‘  탭을 μ΄μš©ν•΄μ£Όμ„Έμš”
---