combe4259 commited on
Commit
e204a58
ยท
verified ยท
1 Parent(s): 73caaa8

Upload fin_simplifier_README.md

Browse files
Files changed (1) hide show
  1. fin_simplifier_README.md +190 -0
fin_simplifier_README.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Financial Text Simplifier (๊ธˆ์œต ํ…์ŠคํŠธ ๊ฐ„์†Œํ™” ๋ชจ๋ธ)
2
+
3
+ ## Model Description
4
+
5
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Q7kUWtHX2shLx6iGGoT66wEidOrvLCf?usp=sharing)
6
+
7
+ **fin_simplifier**๋Š” ๋ณต์žกํ•œ ๊ธˆ์œต ์šฉ์–ด์™€ ๋ฌธ์žฅ์„ ์ผ๋ฐ˜์ธ์ด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ํ•œ๊ตญ์–ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Encoder-Decoder ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
8
+
9
+ ### Architecture
10
+ - **Encoder**: KR-FinBert-SC (๊ธˆ์œต ๋„๋ฉ”์ธ ํŠนํ™” BERT)
11
+ - **Decoder**: SKT KoGPT2-base-v2 (ํ•œ๊ตญ์–ด ์ƒ์„ฑ ๋ชจ๋ธ)
12
+ - **Model Type**: Seq2Seq (Encoder-Decoder)
13
+ - **Parameters**: 255M
14
+
15
+ ### Key Features
16
+ - ๊ธˆ์œต ์ „๋ฌธ ์šฉ์–ด๋ฅผ ์‰ฌ์šด ์ผ์ƒ์–ด๋กœ ๋ณ€ํ™˜
17
+ - ํ•œ๊ตญ์–ด ๊ธˆ์œต ๋ฌธ์„œ์— ์ตœ์ ํ™”
18
+ - PER, ROE, ํŒŒ์ƒ์ƒํ’ˆ ๋“ฑ ๋ณต์žกํ•œ ๊ฐœ๋… ๊ฐ„์†Œํ™”
19
+ - ์€ํ–‰ ์ƒ๋‹ด ๋ฐ ๊ธˆ์œต ๊ต์œก ํ™œ์šฉ ๊ฐ€๋Šฅ
20
+
21
+ ## Intended Use
22
+
23
+ ### Primary Use Cases
24
+ 1. **๊ธˆ์œต ์ƒ๋‹ด ์ง€์›**: ์€ํ–‰ ์ƒ๋‹ด ์‹œ ๊ณ ๊ฐ ์ดํ•ด๋„ ํ–ฅ์ƒ
25
+ 2. **๊ธˆ์œต ๊ต์œก**: ๋ณต์žกํ•œ ๊ธˆ์œต ๊ฐœ๋…์„ ์‰ฝ๊ฒŒ ์„ค๋ช…
26
+ 3. **๋ฌธ์„œ ๊ฐ„์†Œํ™”**: ์•ฝ๊ด€, ์ƒํ’ˆ ์„ค๋ช…์„œ ๋“ฑ์„ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋ณ€ํ™˜
27
+ 4. **์ ‘๊ทผ์„ฑ ๊ฐœ์„ **: ๊ธˆ์œต ์†Œ์™ธ๊ณ„์ธต์˜ ๊ธˆ์œต ์„œ๋น„์Šค ์ ‘๊ทผ์„ฑ ํ–ฅ์ƒ
28
+
29
+ ### Out-of-Scope Use
30
+ - ๋ฒ•์  ๊ตฌ์†๋ ฅ์ด ์žˆ๋Š” ๋ฌธ์„œ ์ž‘์„ฑ
31
+ - ํˆฌ์ž ์กฐ์–ธ ๋˜๋Š” ๊ธˆ์œต ์ƒ๋‹ด ๋Œ€์ฒด
32
+ - ์ •ํ™•ํ•œ ์ˆ˜์น˜๋‚˜ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
33
+
34
+ ## How to Use
35
+
36
+ ### Installation
37
+ ```python
38
+ from transformers import EncoderDecoderModel, AutoTokenizer
39
+ import torch
40
+
41
+ # Model loading
42
+ model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier")
43
+ encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
44
+ decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")
45
+
46
+ # Set special tokens
47
+ if decoder_tokenizer.pad_token is None:
48
+ decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
49
+ ```
50
+
51
+ ### Inference Example
52
+ ```python
53
+ def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer):
54
+ # Tokenize input
55
+ inputs = encoder_tokenizer(
56
+ text,
57
+ return_tensors="pt",
58
+ max_length=128,
59
+ padding="max_length",
60
+ truncation=True
61
+ )
62
+
63
+ # Generate simplified text
64
+ with torch.no_grad():
65
+ generated = model.generate(
66
+ input_ids=inputs["input_ids"],
67
+ attention_mask=inputs["attention_mask"],
68
+ max_length=128,
69
+ num_beams=6,
70
+ repetition_penalty=1.2,
71
+ length_penalty=0.8,
72
+ early_stopping=True,
73
+ do_sample=True,
74
+ top_k=50,
75
+ top_p=0.95,
76
+ temperature=0.7
77
+ )
78
+
79
+ # Decode output
80
+ simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True)
81
+ return simplified
82
+
83
+ # Example usage
84
+ complex_text = "์ฃผ๊ฐ€์ˆ˜์ต๋น„์œจ(PER)์€ ์ฃผ๊ฐ€๋ฅผ ์ฃผ๋‹น์ˆœ์ด์ต์œผ๋กœ ๋‚˜๋ˆˆ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค."
85
+ simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer)
86
+ print(f"์›๋ฌธ: {complex_text}")
87
+ print(f"๊ฐ„์†Œํ™”: {simple_text}")
88
+ # Output: "๊ฐ„์†Œํ™”: PER์€ ์ฃผ์‹ ๊ฐ€๊ฒฉ์ด ํšŒ์‚ฌ ์ด์ต ๋Œ€๋น„ ๋น„์‹ผ์ง€ ์‹ผ์ง€ ๋ณด๋Š” ์ˆซ์ž์ž…๋‹ˆ๋‹ค."
89
+ ```
90
+
91
+ ## Training Details
92
+
93
+ ### Training Data
94
+ - **Size**: ์•ฝ 100๊ฐœ์˜ ๊ธˆ์œต ์šฉ์–ด ์Œ (๋ณต์žกํ•œ ์„ค๋ช… โ†’ ์‰ฌ์šด ์„ค๋ช…)
95
+ - **Domain**: ํ•œ๊ตญ ๊ธˆ์œต ์šฉ์–ด ๋ฐ ๊ฐœ๋…
96
+ - **Categories**:
97
+ - ๊ธฐ๋ณธ ๊ธˆ์œต ์ง€ํ‘œ (PER, ROE, ROA ๋“ฑ)
98
+ - ํˆฌ์ž ์ƒํ’ˆ (ETF, ELS, ํŒŒ์ƒ์ƒํ’ˆ ๋“ฑ)
99
+ - ๋Œ€์ถœ/์˜ˆ๊ธˆ ์šฉ์–ด
100
+ - ๋ฆฌ์Šคํฌ ๊ด€๋ฆฌ ์šฉ์–ด
101
+ - ์„ธ๊ธˆ ๊ด€๋ จ ์šฉ์–ด
102
+
103
+ ### Training Procedure
104
+ - **Epochs**: 10
105
+ - **Batch Size**: 4 (with gradient accumulation steps: 2)
106
+ - **Learning Rate**: 3e-5
107
+ - **Optimizer**: AdamW with warmup
108
+ - **Label Smoothing**: 0.1
109
+ - **Dropout**: 0.2 (encoder and decoder)
110
+
111
+ ### Hyperparameters for Generation
112
+ - **Beam Search**: 6 beams
113
+ - **Repetition Penalty**: 1.2
114
+ - **Length Penalty**: 0.8
115
+ - **Temperature**: 0.7
116
+ - **Top-k**: 50
117
+ - **Top-p**: 0.95
118
+
119
+ ## Evaluation
120
+
121
+ ### Example Outputs
122
+
123
+ | ์›๋ฌธ (Complex) | ๋ณ€ํ™˜ ๊ฒฐ๊ณผ (Simplified) |
124
+ |---------------|---------------------|
125
+ | ์‹œ๊ฐ€์ด์•ก์€ ๋ฐœํ–‰์ฃผ์‹์ˆ˜์— ์ฃผ๊ฐ€๋ฅผ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ ๊ธฐ์—…์˜ ์‹œ์žฅ๊ฐ€์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. | ์‹œ๊ฐ€์ด์•ก์€ ํšŒ์‚ฌ์˜ ๋ชจ๋“  ์ฃผ์‹์„ ํ•ฉ์นœ ๊ฐ€๊ฒฉ์ž…๋‹ˆ๋‹ค. |
126
+ | ํŒŒ์ƒ๊ฒฐํ•ฉ์ฆ๊ถŒ์€ ๊ธฐ์ดˆ์ž์‚ฐ์˜ ๊ฐ€๊ฒฉ๋ณ€๋™์— ์—ฐ๊ณ„ํ•˜์—ฌ ์ˆ˜์ต์ด ๊ฒฐ์ •๋˜๋Š” ์ฆ๊ถŒ์ž…๋‹ˆ๋‹ค. | ํŒŒ์ƒ๊ฒฐํ•ฉ์ฆ๊ถŒ์€ ๋‹ค๋ฅธ ์ƒํ’ˆ ๊ฐ€๊ฒฉ์— ๋”ฐ๋ผ ์ˆ˜์ต์ด ๋ฐ”๋€Œ๋Š” ํˆฌ์ž ์ƒํ’ˆ์ž…๋‹ˆ๋‹ค. |
127
+ | ํ™˜๋งค์กฐ๊ฑด๋ถ€์ฑ„๊ถŒ(RP)์€ ์ผ์ •๊ธฐ๊ฐ„ ํ›„ ๋‹ค์‹œ ๋งค์ž…ํ•˜๋Š” ์กฐ๊ฑด์œผ๋กœ ๋งค๋„ํ•˜๋Š” ์ฑ„๊ถŒ์ž…๋‹ˆ๋‹ค. | RP๋Š” ๋‚˜์ค‘์— ๋‹ค์‹œ ์‚ฌ๊ฒ ๋‹ค๊ณ  ์•ฝ์†ํ•˜๊ณ  ์ผ๋‹จ ํŒŒ๋Š” ์ฑ„๊ถŒ์ž…๋‹ˆ๋‹ค. |
128
+ | ์œ ๋™์„ฑ์œ„ํ—˜์€ ์ž์‚ฐ์„ ์ ์ •๊ฐ€๊ฒฉ์— ํ˜„๊ธˆํ™”ํ•˜์ง€ ๋ชปํ•  ์œ„ํ—˜์ž…๋‹ˆ๋‹ค. | ์œ ๋™์„ฑ์œ„ํ—˜์€ ๊ธ‰ํ•˜๊ฒŒ ํŒ” ๋•Œ ์ œ๊ฐ’์„ ๋ชป ๋ฐ›์„ ์œ„ํ—˜์ž…๋‹ˆ๋‹ค. |
129
+ | ์›๋ฆฌ๊ธˆ๊ท ๋“ฑ์ƒํ™˜์€ ๋งค์›” ๋™์ผํ•œ ๊ธˆ์•ก์œผ๋กœ ์›๊ธˆ๊ณผ ์ด์ž๋ฅผ ์ƒํ™˜ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. | ์›๋ฆฌ๊ธˆ๊ท ๋“ฑ์ƒํ™˜์€ ๋งค๋‹ฌ ๊ฐ™์€ ๊ธˆ์•ก์„ ๊ฐš๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. |
130
+
131
+ ## Limitations and Biases
132
+
133
+ ### Limitations
134
+ 1. **ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ**: ์•ฝ 100๊ฐœ์˜ ์˜ˆ์‹œ๋กœ ํ•™์Šต๋˜์–ด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ์ œํ•œ์ 
135
+ 2. **๋„๋ฉ”์ธ ํŠนํ™”**: ๊ธˆ์œต ๋ถ„์•ผ ์™ธ ๋‹ค๋ฅธ ์ „๋ฌธ ์šฉ์–ด์—๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ
136
+ 3. **๋ฌธ๋งฅ ์ดํ•ด**: ๊ธด ๋ฌธ์žฅ์ด๋‚˜ ๋ณต์žกํ•œ ๋ฌธ๋งฅ์—์„œ๋Š” ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์„ ์ˆ˜ ์žˆ์Œ
137
+ 4. **์ˆ˜์น˜ ์ •๋ณด**: ์ •ํ™•ํ•œ ์ˆ˜์น˜๋‚˜ ๊ณ„์‚ฐ์‹ ๋ณ€ํ™˜์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Œ
138
+
139
+ ### Potential Biases
140
+ - ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ๊ตญ ๊ธˆ์œต ์‹œ์žฅ ์ค‘์‹ฌ์œผ๋กœ ๊ตฌ์„ฑ
141
+ - ์ผ๋ฐ˜ ์†Œ๋น„์ž ๊ด€์ ์˜ ๊ฐ„์†Œํ™”๋กœ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ •ํ™•์„ฑ์€ ๋ณด์žฅ๋˜์ง€ ์•Š์Œ
142
+
143
+ ## Ethical Considerations
144
+
145
+ ### Responsible Use
146
+ - โœ… ๊ธˆ์œต ๊ต์œก ๋ฐ ์ดํ•ด๋„ ํ–ฅ์ƒ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ
147
+ - โœ… ๋ณด์กฐ ๋„๊ตฌ๋กœ์„œ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ
148
+ - โŒ ๋ฒ•์  ํšจ๋ ฅ์ด ์žˆ๋Š” ๋ฌธ์„œ ์ž‘์„ฑ์— ์‚ฌ์šฉ ๊ธˆ์ง€
149
+ - โŒ ํˆฌ์ž ๊ฒฐ์ •์˜ ์œ ์ผํ•œ ๊ทผ๊ฑฐ๋กœ ์‚ฌ์šฉ ๊ธˆ์ง€
150
+
151
+ ### Privacy
152
+ - ๋ชจ๋ธ์€ ๊ฐœ์ธ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š์Œ
153
+ - ์ž…๋ ฅ๋œ ํ…์ŠคํŠธ๋Š” ์ €์žฅ๋˜์ง€ ์•Š์Œ
154
+
155
+ ## Citation
156
+
157
+ ```bibtex
158
+ @misc{fin_simplifier2024,
159
+ title={Financial Text Simplifier: Korean Financial Terms Simplification Model},
160
+ author={combe4259},
161
+ year={2024},
162
+ publisher={HuggingFace},
163
+ url={https://huggingface.co/combe4259/fin_simplifier}
164
+ }
165
+ ```
166
+
167
+ ## Acknowledgments
168
+
169
+ - **KR-FinBert-SC**: ๊ธˆ์œต ๋„๋ฉ”์ธ ํŠนํ™” ์ธ์ฝ”๋” ์ œ๊ณต
170
+ - **SKT KoGPT2**: ํ•œ๊ตญ์–ด ์ƒ์„ฑ ๋ชจ๋ธ ์ œ๊ณต
171
+ - **NH Bank Text-Gaze-Tracker Project**: ์‹ค์ œ ํ™œ์šฉ ์‚ฌ๋ก€ ๋ฐ ํ”ผ๋“œ๋ฐฑ
172
+
173
+ ## Contact
174
+
175
+ - **HuggingFace**: [combe4259](https://huggingface.co/combe4259)
176
+ - **Model Card**: ๋ฌธ์˜์‚ฌํ•ญ์€ HuggingFace ํ† ๋ก  ํƒญ์„ ์ด์šฉํ•ด์ฃผ์„ธ์š”
177
+
178
+ ## License
179
+
180
+ ์ด ๋ชจ๋ธ์€ ์—ฐ๊ตฌ ๋ฐ ๊ต์œก ๋ชฉ์ ์œผ๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ์ƒ์—…์  ์‚ฌ์šฉ ์‹œ ๋ณ„๋„ ๋ฌธ์˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
181
+
182
+ ## Updates
183
+
184
+ - **2024.01**: ์ดˆ๊ธฐ ๋ฒ„์ „ ๋ฆด๋ฆฌ์ฆˆ (v1.0)
185
+ - ๊ธˆ์œต ์šฉ์–ด 100๊ฐœ ํ•™์Šต
186
+ - KR-FinBert + KoGPT2 ์•„ํ‚คํ…์ฒ˜
187
+
188
+ ---
189
+
190
+ **Note**: ์ด ๋ชจ๋ธ์€ ๊ธˆ์œต ์ •๋ณด์˜ ์ ‘๊ทผ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ๊ธˆ์œต ์ƒ๋‹ด์ด๋‚˜ ํˆฌ์ž ๊ฒฐ์ •์—๋Š” ๋ฐ˜๋“œ์‹œ ์ „๋ฌธ๊ฐ€์˜ ์กฐ์–ธ์„ ๊ตฌํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.