Upload fin_simplifier_README.md
Browse files- fin_simplifier_README.md +190 -0
fin_simplifier_README.md
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Financial Text Simplifier (๊ธ์ต ํ
์คํธ ๊ฐ์ํ ๋ชจ๋ธ)
|
| 2 |
+
|
| 3 |
+
## Model Description
|
| 4 |
+
|
| 5 |
+
[](https://colab.research.google.com/drive/19Q7kUWtHX2shLx6iGGoT66wEidOrvLCf?usp=sharing)
|
| 6 |
+
|
| 7 |
+
**fin_simplifier**๋ ๋ณต์กํ ๊ธ์ต ์ฉ์ด์ ๋ฌธ์ฅ์ ์ผ๋ฐ์ธ์ด ์ดํดํ๊ธฐ ์ฌ์ด ํ๊ตญ์ด๋ก ๋ณํํ๋ Encoder-Decoder ๋ชจ๋ธ์
๋๋ค.
|
| 8 |
+
|
| 9 |
+
### Architecture
|
| 10 |
+
- **Encoder**: KR-FinBert-SC (๊ธ์ต ๋๋ฉ์ธ ํนํ BERT)
|
| 11 |
+
- **Decoder**: SKT KoGPT2-base-v2 (ํ๊ตญ์ด ์์ฑ ๋ชจ๋ธ)
|
| 12 |
+
- **Model Type**: Seq2Seq (Encoder-Decoder)
|
| 13 |
+
- **Parameters**: 255M
|
| 14 |
+
|
| 15 |
+
### Key Features
|
| 16 |
+
- ๊ธ์ต ์ ๋ฌธ ์ฉ์ด๋ฅผ ์ฌ์ด ์ผ์์ด๋ก ๋ณํ
|
| 17 |
+
- ํ๊ตญ์ด ๊ธ์ต ๋ฌธ์์ ์ต์ ํ
|
| 18 |
+
- PER, ROE, ํ์์ํ ๋ฑ ๋ณต์กํ ๊ฐ๋
๊ฐ์ํ
|
| 19 |
+
- ์ํ ์๋ด ๋ฐ ๊ธ์ต ๊ต์ก ํ์ฉ ๊ฐ๋ฅ
|
| 20 |
+
|
| 21 |
+
## Intended Use
|
| 22 |
+
|
| 23 |
+
### Primary Use Cases
|
| 24 |
+
1. **๊ธ์ต ์๋ด ์ง์**: ์ํ ์๋ด ์ ๊ณ ๊ฐ ์ดํด๋ ํฅ์
|
| 25 |
+
2. **๊ธ์ต ๊ต์ก**: ๋ณต์กํ ๊ธ์ต ๊ฐ๋
์ ์ฝ๊ฒ ์ค๋ช
|
| 26 |
+
3. **๋ฌธ์ ๊ฐ์ํ**: ์ฝ๊ด, ์ํ ์ค๋ช
์ ๋ฑ์ ์ดํดํ๊ธฐ ์ฝ๊ฒ ๋ณํ
|
| 27 |
+
4. **์ ๊ทผ์ฑ ๊ฐ์ **: ๊ธ์ต ์์ธ๊ณ์ธต์ ๊ธ์ต ์๋น์ค ์ ๊ทผ์ฑ ํฅ์
|
| 28 |
+
|
| 29 |
+
### Out-of-Scope Use
|
| 30 |
+
- ๋ฒ์ ๊ตฌ์๋ ฅ์ด ์๋ ๋ฌธ์ ์์ฑ
|
| 31 |
+
- ํฌ์ ์กฐ์ธ ๋๋ ๊ธ์ต ์๋ด ๋์ฒด
|
| 32 |
+
- ์ ํํ ์์น๋ ๊ณ์ฐ์ด ํ์ํ ๊ฒฝ์ฐ
|
| 33 |
+
|
| 34 |
+
## How to Use
|
| 35 |
+
|
| 36 |
+
### Installation
|
| 37 |
+
```python
|
| 38 |
+
from transformers import EncoderDecoderModel, AutoTokenizer
|
| 39 |
+
import torch
|
| 40 |
+
|
| 41 |
+
# Model loading
|
| 42 |
+
model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier")
|
| 43 |
+
encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
|
| 44 |
+
decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")
|
| 45 |
+
|
| 46 |
+
# Set special tokens
|
| 47 |
+
if decoder_tokenizer.pad_token is None:
|
| 48 |
+
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### Inference Example
|
| 52 |
+
```python
|
| 53 |
+
def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer):
|
| 54 |
+
# Tokenize input
|
| 55 |
+
inputs = encoder_tokenizer(
|
| 56 |
+
text,
|
| 57 |
+
return_tensors="pt",
|
| 58 |
+
max_length=128,
|
| 59 |
+
padding="max_length",
|
| 60 |
+
truncation=True
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
# Generate simplified text
|
| 64 |
+
with torch.no_grad():
|
| 65 |
+
generated = model.generate(
|
| 66 |
+
input_ids=inputs["input_ids"],
|
| 67 |
+
attention_mask=inputs["attention_mask"],
|
| 68 |
+
max_length=128,
|
| 69 |
+
num_beams=6,
|
| 70 |
+
repetition_penalty=1.2,
|
| 71 |
+
length_penalty=0.8,
|
| 72 |
+
early_stopping=True,
|
| 73 |
+
do_sample=True,
|
| 74 |
+
top_k=50,
|
| 75 |
+
top_p=0.95,
|
| 76 |
+
temperature=0.7
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
# Decode output
|
| 80 |
+
simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True)
|
| 81 |
+
return simplified
|
| 82 |
+
|
| 83 |
+
# Example usage
|
| 84 |
+
complex_text = "์ฃผ๊ฐ์์ต๋น์จ(PER)์ ์ฃผ๊ฐ๋ฅผ ์ฃผ๋น์์ด์ต์ผ๋ก ๋๋ ์งํ์
๋๋ค."
|
| 85 |
+
simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer)
|
| 86 |
+
print(f"์๋ฌธ: {complex_text}")
|
| 87 |
+
print(f"๊ฐ์ํ: {simple_text}")
|
| 88 |
+
# Output: "๊ฐ์ํ: PER์ ์ฃผ์ ๊ฐ๊ฒฉ์ด ํ์ฌ ์ด์ต ๋๋น ๋น์ผ์ง ์ผ์ง ๋ณด๋ ์ซ์์
๋๋ค."
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## Training Details
|
| 92 |
+
|
| 93 |
+
### Training Data
|
| 94 |
+
- **Size**: ์ฝ 100๊ฐ์ ๊ธ์ต ์ฉ์ด ์ (๋ณต์กํ ์ค๋ช
โ ์ฌ์ด ์ค๋ช
)
|
| 95 |
+
- **Domain**: ํ๊ตญ ๊ธ์ต ์ฉ์ด ๋ฐ ๊ฐ๋
|
| 96 |
+
- **Categories**:
|
| 97 |
+
- ๊ธฐ๋ณธ ๊ธ์ต ์งํ (PER, ROE, ROA ๋ฑ)
|
| 98 |
+
- ํฌ์ ์ํ (ETF, ELS, ํ์์ํ ๋ฑ)
|
| 99 |
+
- ๋์ถ/์๊ธ ์ฉ์ด
|
| 100 |
+
- ๋ฆฌ์คํฌ ๊ด๋ฆฌ ์ฉ์ด
|
| 101 |
+
- ์ธ๊ธ ๊ด๋ จ ์ฉ์ด
|
| 102 |
+
|
| 103 |
+
### Training Procedure
|
| 104 |
+
- **Epochs**: 10
|
| 105 |
+
- **Batch Size**: 4 (with gradient accumulation steps: 2)
|
| 106 |
+
- **Learning Rate**: 3e-5
|
| 107 |
+
- **Optimizer**: AdamW with warmup
|
| 108 |
+
- **Label Smoothing**: 0.1
|
| 109 |
+
- **Dropout**: 0.2 (encoder and decoder)
|
| 110 |
+
|
| 111 |
+
### Hyperparameters for Generation
|
| 112 |
+
- **Beam Search**: 6 beams
|
| 113 |
+
- **Repetition Penalty**: 1.2
|
| 114 |
+
- **Length Penalty**: 0.8
|
| 115 |
+
- **Temperature**: 0.7
|
| 116 |
+
- **Top-k**: 50
|
| 117 |
+
- **Top-p**: 0.95
|
| 118 |
+
|
| 119 |
+
## Evaluation
|
| 120 |
+
|
| 121 |
+
### Example Outputs
|
| 122 |
+
|
| 123 |
+
| ์๋ฌธ (Complex) | ๋ณํ ๊ฒฐ๊ณผ (Simplified) |
|
| 124 |
+
|---------------|---------------------|
|
| 125 |
+
| ์๊ฐ์ด์ก์ ๋ฐํ์ฃผ์์์ ์ฃผ๊ฐ๋ฅผ ๊ณฑํ ๊ฐ์ผ๋ก ๊ธฐ์
์ ์์ฅ๊ฐ์น๋ฅผ ๋ํ๋
๋๋ค. | ์๊ฐ์ด์ก์ ํ์ฌ์ ๋ชจ๋ ์ฃผ์์ ํฉ์น ๊ฐ๊ฒฉ์
๋๋ค. |
|
| 126 |
+
| ํ์๊ฒฐํฉ์ฆ๊ถ์ ๊ธฐ์ด์์ฐ์ ๊ฐ๊ฒฉ๋ณ๋์ ์ฐ๊ณํ์ฌ ์์ต์ด ๊ฒฐ์ ๋๋ ์ฆ๊ถ์
๋๋ค. | ํ์๊ฒฐํฉ์ฆ๊ถ์ ๋ค๋ฅธ ์ํ ๊ฐ๊ฒฉ์ ๋ฐ๋ผ ์์ต์ด ๋ฐ๋๋ ํฌ์ ์ํ์
๋๋ค. |
|
| 127 |
+
| ํ๋งค์กฐ๊ฑด๋ถ์ฑ๊ถ(RP)์ ์ผ์ ๊ธฐ๊ฐ ํ ๋ค์ ๋งค์
ํ๋ ์กฐ๊ฑด์ผ๋ก ๋งค๋ํ๋ ์ฑ๊ถ์
๋๋ค. | RP๋ ๋์ค์ ๋ค์ ์ฌ๊ฒ ๋ค๊ณ ์ฝ์ํ๊ณ ์ผ๋จ ํ๋ ์ฑ๊ถ์
๋๋ค. |
|
| 128 |
+
| ์ ๋์ฑ์ํ์ ์์ฐ์ ์ ์ ๊ฐ๊ฒฉ์ ํ๊ธํํ์ง ๋ชปํ ์ํ์
๋๋ค. | ์ ๋์ฑ์ํ์ ๊ธํ๊ฒ ํ ๋ ์ ๊ฐ์ ๋ชป ๋ฐ์ ์ํ์
๋๋ค. |
|
| 129 |
+
| ์๋ฆฌ๊ธ๊ท ๋ฑ์ํ์ ๋งค์ ๋์ผํ ๊ธ์ก์ผ๋ก ์๊ธ๊ณผ ์ด์๋ฅผ ์ํํ๋ ๋ฐฉ์์
๋๋ค. | ์๋ฆฌ๊ธ๊ท ๋ฑ์ํ์ ๋งค๋ฌ ๊ฐ์ ๊ธ์ก์ ๊ฐ๋ ๋ฐฉ์์
๋๋ค. |
|
| 130 |
+
|
| 131 |
+
## Limitations and Biases
|
| 132 |
+
|
| 133 |
+
### Limitations
|
| 134 |
+
1. **ํ์ต ๋ฐ์ดํฐ ๊ท๋ชจ**: ์ฝ 100๊ฐ์ ์์๋ก ํ์ต๋์ด ์ผ๋ฐํ ๋ฅ๋ ฅ์ด ์ ํ์
|
| 135 |
+
2. **๋๋ฉ์ธ ํนํ**: ๊ธ์ต ๋ถ์ผ ์ธ ๋ค๋ฅธ ์ ๋ฌธ ์ฉ์ด์๋ ์ฑ๋ฅ์ด ๋จ์ด์ง ์ ์์
|
| 136 |
+
3. **๋ฌธ๋งฅ ์ดํด**: ๊ธด ๋ฌธ์ฅ์ด๋ ๋ณต์กํ ๋ฌธ๋งฅ์์๋ ์ ํ๋๊ฐ ๋ฎ์ ์ ์์
|
| 137 |
+
4. **์์น ์ ๋ณด**: ์ ํํ ์์น๋ ๊ณ์ฐ์ ๋ณํ์๋ ์ ํฉํ์ง ์์
|
| 138 |
+
|
| 139 |
+
### Potential Biases
|
| 140 |
+
- ํ์ต ๋ฐ์ดํฐ๊ฐ ํ๊ตญ ๊ธ์ต ์์ฅ ์ค์ฌ์ผ๋ก ๊ตฌ์ฑ
|
| 141 |
+
- ์ผ๋ฐ ์๋น์ ๊ด์ ์ ๊ฐ์ํ๋ก ์ ๋ฌธ๊ฐ ์์ค์ ์ ํ์ฑ์ ๋ณด์ฅ๋์ง ์์
|
| 142 |
+
|
| 143 |
+
## Ethical Considerations
|
| 144 |
+
|
| 145 |
+
### Responsible Use
|
| 146 |
+
- โ
๊ธ์ต ๊ต์ก ๋ฐ ์ดํด๋ ํฅ์ ๋ชฉ์ ์ผ๋ก ์ฌ์ฉ
|
| 147 |
+
- โ
๋ณด์กฐ ๋๊ตฌ๋ก์ ์ธ๊ฐ ์ ๋ฌธ๊ฐ์ ํจ๊ป ์ฌ์ฉ
|
| 148 |
+
- โ ๋ฒ์ ํจ๋ ฅ์ด ์๋ ๋ฌธ์ ์์ฑ์ ์ฌ์ฉ ๊ธ์ง
|
| 149 |
+
- โ ํฌ์ ๊ฒฐ์ ์ ์ ์ผํ ๊ทผ๊ฑฐ๋ก ์ฌ์ฉ ๊ธ์ง
|
| 150 |
+
|
| 151 |
+
### Privacy
|
| 152 |
+
- ๋ชจ๋ธ์ ๊ฐ์ธ ์ ๋ณด๋ฅผ ํฌํจํ์ง ์์
|
| 153 |
+
- ์
๋ ฅ๋ ํ
์คํธ๋ ์ ์ฅ๋์ง ์์
|
| 154 |
+
|
| 155 |
+
## Citation
|
| 156 |
+
|
| 157 |
+
```bibtex
|
| 158 |
+
@misc{fin_simplifier2024,
|
| 159 |
+
title={Financial Text Simplifier: Korean Financial Terms Simplification Model},
|
| 160 |
+
author={combe4259},
|
| 161 |
+
year={2024},
|
| 162 |
+
publisher={HuggingFace},
|
| 163 |
+
url={https://huggingface.co/combe4259/fin_simplifier}
|
| 164 |
+
}
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## Acknowledgments
|
| 168 |
+
|
| 169 |
+
- **KR-FinBert-SC**: ๊ธ์ต ๋๋ฉ์ธ ํนํ ์ธ์ฝ๋ ์ ๊ณต
|
| 170 |
+
- **SKT KoGPT2**: ํ๊ตญ์ด ์์ฑ ๋ชจ๋ธ ์ ๊ณต
|
| 171 |
+
- **NH Bank Text-Gaze-Tracker Project**: ์ค์ ํ์ฉ ์ฌ๋ก ๋ฐ ํผ๋๋ฐฑ
|
| 172 |
+
|
| 173 |
+
## Contact
|
| 174 |
+
|
| 175 |
+
- **HuggingFace**: [combe4259](https://huggingface.co/combe4259)
|
| 176 |
+
- **Model Card**: ๋ฌธ์์ฌํญ์ HuggingFace ํ ๋ก ํญ์ ์ด์ฉํด์ฃผ์ธ์
|
| 177 |
+
|
| 178 |
+
## License
|
| 179 |
+
|
| 180 |
+
์ด ๋ชจ๋ธ์ ์ฐ๊ตฌ ๋ฐ ๊ต์ก ๋ชฉ์ ์ผ๋ก ์ ๊ณต๋ฉ๋๋ค. ์์
์ ์ฌ์ฉ ์ ๋ณ๋ ๋ฌธ์๊ฐ ํ์ํฉ๋๋ค.
|
| 181 |
+
|
| 182 |
+
## Updates
|
| 183 |
+
|
| 184 |
+
- **2024.01**: ์ด๊ธฐ ๋ฒ์ ๋ฆด๋ฆฌ์ฆ (v1.0)
|
| 185 |
+
- ๊ธ์ต ์ฉ์ด 100๊ฐ ํ์ต
|
| 186 |
+
- KR-FinBert + KoGPT2 ์ํคํ
์ฒ
|
| 187 |
+
|
| 188 |
+
---
|
| 189 |
+
|
| 190 |
+
**Note**: ์ด ๋ชจ๋ธ์ ๊ธ์ต ์ ๋ณด์ ์ ๊ทผ์ฑ์ ๋์ด๊ธฐ ์ํ ์ฐ๊ตฌ ํ๋ก์ ํธ์ ์ผํ์ผ๋ก ๊ฐ๋ฐ๋์์ต๋๋ค. ์ค์ ๊ธ์ต ์๋ด์ด๋ ํฌ์ ๊ฒฐ์ ์๋ ๋ฐ๋์ ์ ๋ฌธ๊ฐ์ ์กฐ์ธ์ ๊ตฌํ์๊ธฐ ๋ฐ๋๋๋ค.
|