metadata
language: ko
license: other
base_models:
- snunlp/KR-FinBert-SC
- skt/kogpt2-base-v2
tags:
- encoder-decoder
- seq2seq
- text-simplification
- financial-domain
- ko
- pytorch
datasets:
- combe4259/fin_simplifier_dataset
κΈμ΅ ν μ€νΈ κ°μν λͺ¨λΈ (Financial Text Simplifier)
λͺ¨λΈ μ€λͺ
fin_simplifierλ 볡μ‘ν κΈμ΅ μ©μ΄μ λ¬Έμ₯μ μΌλ°μΈμ΄ μ΄ν΄νκΈ° μ¬μ΄ νκ΅μ΄λ‘ λ³ννλ μΈμ½λ-λμ½λ λͺ¨λΈμ λλ€.
λͺ¨λΈ ꡬ쑰 (config.json κΈ°λ°)
- λͺ¨λΈ νμ : EncoderDecoderModel
- μΈμ½λ: snunlp/KR-FinBert-SC (μλ μ°¨μ: 768)
- λμ½λ: skt/kogpt2-base-v2 (μ΄ν ν¬κΈ°: 51,201)
- νλΌλ―Έν° μ: μ½ 255M
- νμΌ ν¬κΈ°: 1.02GB (safetensors νμ)
μ£Όμ νΉμ§
- κΈμ΅ μ λ¬Έ μ©μ΄λ₯Ό μ¬μ΄ μΌμμ΄λ‘ λ³ν
- νκ΅μ΄ κΈμ΅ λ¬Έμμ μ΅μ ν
- 볡μ‘ν κΈμ΅ κ°λ κ°μν (PER, ROE, νμμν λ±)
- μν μλ΄ λ° κΈμ΅ κ΅μ‘ νμ© κ°λ₯
μ¬μ© λͺ©μ
μ£Όμ νμ© μ¬λ‘
- κΈμ΅ μλ΄ μ§μ: μν μλ΄ μ κ³ κ° μ΄ν΄λ ν₯μ
- κΈμ΅ κ΅μ‘: 볡μ‘ν κΈμ΅ κ°λ μ μ½κ² μ€λͺ
- λ¬Έμ κ°μν: μ½κ΄, μν μ€λͺ μ λ±μ μ΄ν΄νκΈ° μ½κ² λ³ν
- μ κ·Όμ± κ°μ : κΈμ΅ μμΈκ³μΈ΅μ κΈμ΅ μλΉμ€ μ κ·Όμ± ν₯μ
μ¬μ© μ ν μ¬ν
- λ²μ ꡬμλ ₯μ΄ μλ λ¬Έμ μμ±
- ν¬μ μ‘°μΈ λλ κΈμ΅ μλ΄ λ체
- μ νν μμΉλ κ³μ°μ΄ νμν κ²½μ°
μ¬μ© λ°©λ²
μ€μΉ
from transformers import EncoderDecoderModel, AutoTokenizer
import torch
# Model loading
model = EncoderDecoderModel.from_pretrained("combe4259/fin_simplifier")
encoder_tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
decoder_tokenizer = AutoTokenizer.from_pretrained("skt/kogpt2-base-v2")
# Set special tokens
if decoder_tokenizer.pad_token is None:
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token
μΆλ‘ μμ
def simplify_text(text, model, encoder_tokenizer, decoder_tokenizer):
# Tokenize input
inputs = encoder_tokenizer(
text,
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True
)
# Generate simplified text
with torch.no_grad():
generated = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=128,
num_beams=6,
repetition_penalty=1.2,
length_penalty=0.8,
early_stopping=True,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)
# Decode output
simplified = decoder_tokenizer.decode(generated[0], skip_special_tokens=True)
return simplified
# Example usage
complex_text = "μ£Όκ°μμ΅λΉμ¨(PER)μ μ£Όκ°λ₯Ό μ£ΌλΉμμ΄μ΅μΌλ‘ λλ μ§νμ
λλ€."
simple_text = simplify_text(complex_text, model, encoder_tokenizer, decoder_tokenizer)
print(f"μλ¬Έ: {complex_text}")
print(f"κ°μν: {simple_text}")
# μΆλ ₯ μμ: λͺ¨λΈμ΄ μμ±νλ κ°μνλ ν
μ€νΈ
νμ΅ μμΈ μ 보
νμ΅ λ°μ΄ν°μ
λ°μ΄ν°μ μ체 μ μ λ°μ΄ν°μ -μΆμ²: NHλνμν -NHλνμν μνμ€λͺ μλ₯Ό gemma λͺ¨λΈμ ν¬μ νμ¬ λ³ννμ¬ μμ±
νμ΅ μ€μ (trainer_state.json κΈ°λ°)
- μν¬ν¬: 10
- λ°°μΉ ν¬κΈ°: 4 (gradient accumulation steps: 2)
- μ΅λ νμ΅λ₯ : 2.99e-05
- μ΅μ’ νμ΅λ₯ : 8.82e-09
- μ΅ν°λ§μ΄μ : AdamW (warmup steps: 200)
- λ μ΄λΈ μ€λ¬΄λ©: 0.1
- λλ‘μμ: 0.2 (μΈμ½λ λ° λμ½λ)
μμ± νμ΄νΌνλΌλ―Έν°
- Beam Search: 6 beams
- Repetition Penalty: 1.2
- Length Penalty: 0.8
- Temperature: 0.7
- Top-k: 50
- Top-p: 0.95
νκ° κ²°κ³Ό
νμ΅ μ±κ³Ό (trainer_state.json κΈ°μ€)
- μ΄κΈ° μμ€: 13.53
- μ΅μ’ μμ€: 3.76
- μμ€ κ°μμ¨: 72.2%
- μ΄ νμ΅ μ€ν : 3,600
- μλ ΄ ν¨ν΄: μν¬ν¬ 8λΆν° μμ μ μλ ΄
μν¬ν¬λ³ νκ· μμ€
| μν¬ν¬ | νκ· μμ€ |
|---|---|
| 1 | 8.98 |
| 2 | 6.93 |
| 3 | 5.95 |
| 4 | 5.28 |
| 5 | 4.81 |
| 6 | 4.44 |
| 7 | 4.17 |
| 8 | 3.97 |
| 9 | 3.82 |
| 10 | 3.73 |
μμ μΆλ ₯
| μλ¬Έ (Complex) | λ³ν κ²°κ³Ό (Simplified) |
|---|---|
| μκ°μ΄μ‘μ λ°νμ£Όμμμ μ£Όκ°λ₯Ό κ³±ν κ°μΌλ‘ κΈ°μ μ μμ₯κ°μΉλ₯Ό λνλ λλ€. | μκ°μ΄μ‘μ νμ¬μ λͺ¨λ μ£Όμμ ν©μΉ κ°κ²©μ λλ€. |
| νμκ²°ν©μ¦κΆμ κΈ°μ΄μμ°μ κ°κ²©λ³λμ μ°κ³νμ¬ μμ΅μ΄ κ²°μ λλ μ¦κΆμ λλ€. | νμκ²°ν©μ¦κΆμ λ€λ₯Έ μν κ°κ²©μ λ°λΌ μμ΅μ΄ λ°λλ ν¬μ μνμ λλ€. |
| ν맀쑰건λΆμ±κΆ(RP)μ μΌμ κΈ°κ° ν λ€μ λ§€μ νλ 쑰건μΌλ‘ λ§€λνλ μ±κΆμ λλ€. | RPλ λμ€μ λ€μ μ¬κ² λ€κ³ μ½μνκ³ μΌλ¨ νλ μ±κΆμ λλ€. |
| μ λμ±μνμ μμ°μ μ μ κ°κ²©μ νκΈννμ§ λͺ»ν μνμ λλ€. | μ λμ±μνμ κΈνκ² ν λ μ κ°μ λͺ» λ°μ μνμ λλ€. |
| μ리κΈκ· λ±μνμ λ§€μ λμΌν κΈμ‘μΌλ‘ μκΈκ³Ό μ΄μλ₯Ό μννλ λ°©μμ λλ€. | μ리κΈκ· λ±μνμ λ§€λ¬ κ°μ κΈμ‘μ κ°λ λ°©μμ λλ€. |
μΈμ©
@misc{fin_simplifier2024,
title={Financial Text Simplifier: Korean Financial Terms Simplification Model},
author={combe4259},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/combe4259/fin_simplifier}
}
κ°μ¬μ λ§
- KR-FinBert-SC: κΈμ΅ λλ©μΈ νΉν μΈμ½λ μ 곡
- SKT KoGPT2: νκ΅μ΄ μμ± λͺ¨λΈ μ 곡
μ°λ½μ²
- HuggingFace: combe4259
- Model Card: λ¬Έμμ¬νμ HuggingFace ν λ‘ νμ μ΄μ©ν΄μ£ΌμΈμ