metadata language:
- ko
license: mit
tags:
- finance
- extractive-summarization
- sentence-extraction
- role-classification
- korean
- roberta
pipeline_tag: text-classification
base_model: klue/roberta-base
metrics:
- f1
- accuracy
LQ-FSE-base: Korean Financial Sentence Extractor
LangQuant(λννΈ)μμ 곡κ°ν κΈμ΅ 리ν¬νΈ, κΈμ΅ κ΄λ ¨ λ΄μ€μμ λνλ¬Έμ₯μ μΆμΆνκ³ μν (outlook, event, financial, risk)μ λΆλ₯νλ λͺ¨λΈμ
λλ€.
Model Description
- Base Model: klue/roberta-base
- Architecture: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
- Task: Extractive Summarization + Role Classification (Multi-task)
- Language: Korean
- Domain: Financial Reports (μ¦κΆ 리ν¬νΈ), Financial News (κΈμ΅ λ΄μ€)
Input Constraints
| Parameter |
Value |
Description |
| Max sentence length |
128 tokens |
λ¬Έμ₯λΉ μ΅λ ν ν° μ (μ΄κ³Ό μ truncation) |
| Max sentences per document |
30 |
λ¬ΈμλΉ μ΅λ λ¬Έμ₯ μ (μ΄κ³Ό μ μ 30κ°λ§ μ¬μ©) |
| Input format |
Plain text |
λ¬Έμ₯ λΆνΈ(.!?) κΈ°μ€μΌλ‘ μλ λΆλ¦¬ |
- μ
λ ₯: νκ΅μ΄ κΈμ΅ ν
μ€νΈ (μ¦κΆ 리ν¬νΈ, κΈμ΅ λ΄μ€ λ±)
- μΆλ ₯: κ° λ¬Έμ₯λ³ λνλ¬Έμ₯ μ μ (0~1) + μν λΆλ₯ (outlook/event/financial/risk)
Performance
| Metric |
Score |
| Extraction F1 |
0.705 |
| Role Accuracy |
0.851 |
Role Labels
| Label |
Description |
outlook |
μ λ§/μμΈ‘ λ¬Έμ₯ |
event |
μ΄λ²€νΈ/μ¬κ±΄ λ¬Έμ₯ |
financial |
μ¬λ¬΄/μ€μ λ¬Έμ₯ |
risk |
리μ€ν¬ μμΈ λ¬Έμ₯ |
Usage
import re
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
repo_id = "LangQuant/LQ-FSE-base"
config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()
text = (
"μΌμ±μ μμ 2024λ
4λΆκΈ° μ€μ μ΄ μμ₯ μμμ μννλ€. "
"λ©λͺ¨λ¦¬ λ°λ체 κ°κ²© μμΉμΌλ‘ μμ
μ΄μ΅μ΄ μ λΆκΈ° λλΉ 30% μ¦κ°νλ€. "
"HBM3E μμ°μ΄ 본격νλλ©΄μ AI λ°λ체 μμ₯ μ μ μ¨μ΄ νλλ μ λ§μ΄λ€."
)
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
max_len, max_sent = config.max_length, config.max_sentences
padded = sentences[:max_sent]
num_real = len(padded)
while len(padded) < max_sent:
padded.append("")
ids_list, mask_list = [], []
for s in padded:
if s:
enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
else:
enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
"attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
ids_list.append(enc["input_ids"])
mask_list.append(enc["attention_mask"])
input_ids = torch.cat(ids_list).unsqueeze(0)
attention_mask = torch.cat(mask_list).unsqueeze(0)
doc_mask = torch.zeros(1, max_sent)
doc_mask[0, :num_real] = 1
with torch.no_grad():
scores, role_logits = model(input_ids, attention_mask, doc_mask)
role_labels = config.role_labels
for i, sent in enumerate(sentences):
score = scores[0, i].item()
role = role_labels[role_logits[0, i].argmax().item()]
marker = "*" if score >= 0.5 else " "
print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
Model Architecture
Input Sentences
β
[klue/roberta-base] β [CLS] embeddings per sentence
β
[Inter-sentence Transformer] (2 layers, 8 heads)
β
ββββββββββββββββββββ¬ββββββββββββββββββββββ
β Binary Classifierβ Role Classifier β
β (representative?)β (outlook/event/ β
β β financial/risk) β
ββββββββββββββββββββ΄ββββββββββββββββββββββ
Training
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: Linear warmup (10%)
- Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
- Max sentence length: 128 tokens
- Max sentences per document: 30
Files
model.py: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
config.json: Model configuration
model.safetensors: Model weights
inference_example.py: Inference helper with usage example
convert_checkpoint.py: Script to convert original .pt checkpoint
Disclaimer (λ©΄μ±
μ‘°ν)
- λ³Έ λͺ¨λΈμ μ°κ΅¬ λ° μ 보 μ 곡 λͺ©μ μΌλ‘λ§ μ 곡λ©λλ€.
- λ³Έ λͺ¨λΈμ μΆλ ₯μ ν¬μ μ‘°μΈ, κΈμ΅ μλ¬Έ, λ§€λ§€ μΆμ²μ΄ μλλλ€.
- λͺ¨λΈμ μμΈ‘ κ²°κ³Όλ₯Ό κΈ°λ°μΌλ‘ ν ν¬μ νλ¨μ λν΄ LangQuant λ° κ°λ°μλ μ΄λ ν λ²μ μ±
μλ μ§μ§ μμ΅λλ€.
- λͺ¨λΈμ μ νμ±, μμ μ±, μ μμ±μ λν΄ λ³΄μ¦νμ§ μμΌλ©°, μ€μ ν¬μ μμ¬κ²°μ μ λ°λμ μ λ¬Έκ°μ μ‘°μΈμ ꡬνμκΈ° λ°λλλ€.
- κΈμ΅ μμ₯μ λ³Έμ§μ μΌλ‘ λΆνμ€νλ©°, κ³Όκ±° λ°μ΄ν°λ‘ νμ΅λ λͺ¨λΈμ΄ λ―Έλ μ±κ³Όλ₯Ό 보μ₯νμ§ μμ΅λλ€.
Usage Restrictions (μ¬μ© μ ν)
- κΈμ§ μ¬ν:
- λ³Έ λͺ¨λΈμ μ΄μ©ν μμΈ μ‘°μ’
, νμ μ 보 μμ± λ± λΆλ²μ λͺ©μ μ μ¬μ©
- μλνλ ν¬μ λ§€λ§€ μμ€ν
μ λ¨λ
μμ¬κ²°μ μλ¨μΌλ‘ μ¬μ©
- λͺ¨λΈ μΆλ ₯μ μ λ¬Έ κΈμ΅ μλ¬ΈμΈ κ²μ²λΌ μ 3μμκ² μ 곡νλ νμ
- νμ© μ¬ν:
- νμ μ°κ΅¬ λ° κ΅μ‘ λͺ©μ μ μ¬μ©
- κΈμ΅ ν
μ€νΈ λΆμ νμ΄νλΌμΈμ 보쑰 λκ΅¬λ‘ νμ©
- μ¬λ΄ 리μμΉ/λΆμ μ
무μ μ°Έκ³ μλ£λ‘ νμ©
- μμ
μ μ¬μ© μ LangQuantμ μ¬μ λ¬Έμλ₯Ό κΆμ₯ν©λλ€.
Contributors