File size: 6,260 Bytes
31d42b8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ---
language:
- ko
license: mit
tags:
- finance
- extractive-summarization
- sentence-extraction
- role-classification
- korean
- roberta
pipeline_tag: text-classification
base_model: klue/roberta-base
metrics:
- f1
- accuracy
---
# LQ-FSE-base: Korean Financial Sentence Extractor
LangQuant(λννΈ)μμ 곡κ°ν κΈμ΅ 리ν¬νΈ, κΈμ΅ κ΄λ ¨ λ΄μ€μμ λνλ¬Έμ₯μ μΆμΆνκ³ μν (outlook, event, financial, risk)μ λΆλ₯νλ λͺ¨λΈμ
λλ€.
## Model Description
- **Base Model**: klue/roberta-base
- **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
- **Task**: Extractive Summarization + Role Classification (Multi-task)
- **Language**: Korean
- **Domain**: Financial Reports (μ¦κΆ 리ν¬νΈ), Financial News (κΈμ΅ λ΄μ€)
### Input Constraints
| Parameter | Value | Description |
|-----------|-------|-------------|
| Max sentence length | 128 tokens | λ¬Έμ₯λΉ μ΅λ ν ν° μ (μ΄κ³Ό μ truncation) |
| Max sentences per document | 30 | λ¬ΈμλΉ μ΅λ λ¬Έμ₯ μ (μ΄κ³Ό μ μ 30κ°λ§ μ¬μ©) |
| Input format | Plain text | λ¬Έμ₯ λΆνΈ(`.!?`) κΈ°μ€μΌλ‘ μλ λΆλ¦¬ |
- **μ
λ ₯**: νκ΅μ΄ κΈμ΅ ν
μ€νΈ (μ¦κΆ 리ν¬νΈ, κΈμ΅ λ΄μ€ λ±)
- **μΆλ ₯**: κ° λ¬Έμ₯λ³ λνλ¬Έμ₯ μ μ (0~1) + μν λΆλ₯ (outlook/event/financial/risk)
### Performance
| Metric | Score |
|--------|-------|
| Extraction F1 | 0.705 |
| Role Accuracy | 0.851 |
### Role Labels
| Label | Description |
|-------|-------------|
| `outlook` | μ λ§/μμΈ‘ λ¬Έμ₯ |
| `event` | μ΄λ²€νΈ/μ¬κ±΄ λ¬Έμ₯ |
| `financial` | μ¬λ¬΄/μ€μ λ¬Έμ₯ |
| `risk` | 리μ€ν¬ μμΈ λ¬Έμ₯ |
## Usage
```python
import re
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
repo_id = "LangQuant/LQ-FSE-base"
# λͺ¨λΈ λ‘λ
config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()
# μ
λ ₯ ν
μ€νΈ
text = (
"μΌμ±μ μμ 2024λ
4λΆκΈ° μ€μ μ΄ μμ₯ μμμ μννλ€. "
"λ©λͺ¨λ¦¬ λ°λ체 κ°κ²© μμΉμΌλ‘ μμ
μ΄μ΅μ΄ μ λΆκΈ° λλΉ 30% μ¦κ°νλ€. "
"HBM3E μμ°μ΄ 본격νλλ©΄μ AI λ°λ체 μμ₯ μ μ μ¨μ΄ νλλ μ λ§μ΄λ€."
)
# λ¬Έμ₯ λΆλ¦¬ λ° ν ν°ν
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
max_len, max_sent = config.max_length, config.max_sentences
padded = sentences[:max_sent]
num_real = len(padded)
while len(padded) < max_sent:
padded.append("")
ids_list, mask_list = [], []
for s in padded:
if s:
enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
else:
enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
"attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
ids_list.append(enc["input_ids"])
mask_list.append(enc["attention_mask"])
input_ids = torch.cat(ids_list).unsqueeze(0)
attention_mask = torch.cat(mask_list).unsqueeze(0)
doc_mask = torch.zeros(1, max_sent)
doc_mask[0, :num_real] = 1
# μΆλ‘
with torch.no_grad():
scores, role_logits = model(input_ids, attention_mask, doc_mask)
role_labels = config.role_labels
for i, sent in enumerate(sentences):
score = scores[0, i].item()
role = role_labels[role_logits[0, i].argmax().item()]
marker = "*" if score >= 0.5 else " "
print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
```
## Model Architecture
```
Input Sentences
β
[klue/roberta-base] β [CLS] embeddings per sentence
β
[Inter-sentence Transformer] (2 layers, 8 heads)
β
ββββββββββββββββββββ¬ββββββββββββββββββββββ
β Binary Classifierβ Role Classifier β
β (representative?)β (outlook/event/ β
β β financial/risk) β
ββββββββββββββββββββ΄ββββββββββββββββββββββ
```
## Training
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: Linear warmup (10%)
- Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
- Max sentence length: 128 tokens
- Max sentences per document: 30
## Files
- `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
- `config.json`: Model configuration
- `model.safetensors`: Model weights
- `inference_example.py`: Inference helper with usage example
- `convert_checkpoint.py`: Script to convert original .pt checkpoint
## Disclaimer (λ©΄μ±
μ‘°ν)
- λ³Έ λͺ¨λΈμ **μ°κ΅¬ λ° μ 보 μ 곡 λͺ©μ **μΌλ‘λ§ μ 곡λ©λλ€.
- λ³Έ λͺ¨λΈμ μΆλ ₯μ **ν¬μ μ‘°μΈ, κΈμ΅ μλ¬Έ, λ§€λ§€ μΆμ²μ΄ μλλλ€.**
- λͺ¨λΈμ μμΈ‘ κ²°κ³Όλ₯Ό κΈ°λ°μΌλ‘ ν ν¬μ νλ¨μ λν΄ LangQuant λ° κ°λ°μλ **μ΄λ ν λ²μ μ±
μλ μ§μ§ μμ΅λλ€.**
- λͺ¨λΈμ μ νμ±, μμ μ±, μ μμ±μ λν΄ λ³΄μ¦νμ§ μμΌλ©°, μ€μ ν¬μ μμ¬κ²°μ μ λ°λμ μ λ¬Έκ°μ μ‘°μΈμ ꡬνμκΈ° λ°λλλ€.
- κΈμ΅ μμ₯μ λ³Έμ§μ μΌλ‘ λΆνμ€νλ©°, κ³Όκ±° λ°μ΄ν°λ‘ νμ΅λ λͺ¨λΈμ΄ λ―Έλ μ±κ³Όλ₯Ό 보μ₯νμ§ μμ΅λλ€.
## Usage Restrictions (μ¬μ© μ ν)
- **κΈμ§ μ¬ν:**
- λ³Έ λͺ¨λΈμ μ΄μ©ν μμΈ μ‘°μ’
, νμ μ 보 μμ± λ± λΆλ²μ λͺ©μ μ μ¬μ©
- μλνλ ν¬μ λ§€λ§€ μμ€ν
μ λ¨λ
μμ¬κ²°μ μλ¨μΌλ‘ μ¬μ©
- λͺ¨λΈ μΆλ ₯μ μ λ¬Έ κΈμ΅ μλ¬ΈμΈ κ²μ²λΌ μ 3μμκ² μ 곡νλ νμ
- **νμ© μ¬ν:**
- νμ μ°κ΅¬ λ° κ΅μ‘ λͺ©μ μ μ¬μ©
- κΈμ΅ ν
μ€νΈ λΆμ νμ΄νλΌμΈμ 보쑰 λκ΅¬λ‘ νμ©
- μ¬λ΄ 리μμΉ/λΆμ μ
무μ μ°Έκ³ μλ£λ‘ νμ©
- μμ
μ μ¬μ© μ LangQuantμ μ¬μ λ¬Έμλ₯Ό κΆμ₯ν©λλ€.
## Contributors
- **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com)
- **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β Ecole 42
- **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β DSSAL
|