File size: 6,260 Bytes
31d42b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
language:
- ko
license: mit
tags:
- finance
- extractive-summarization
- sentence-extraction
- role-classification
- korean
- roberta
pipeline_tag: text-classification
base_model: klue/roberta-base
metrics:
- f1
- accuracy
---

# LQ-FSE-base: Korean Financial Sentence Extractor

LangQuant(λž­ν€€νŠΈ)μ—μ„œ κ³΅κ°œν•œ 금육 리포트, 금육 κ΄€λ ¨ λ‰΄μŠ€μ—μ„œ λŒ€ν‘œλ¬Έμž₯을 μΆ”μΆœν•˜κ³  μ—­ν• (outlook, event, financial, risk)을 λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.

## Model Description

- **Base Model**: klue/roberta-base
- **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
- **Task**: Extractive Summarization + Role Classification (Multi-task)
- **Language**: Korean
- **Domain**: Financial Reports (증ꢌ 리포트), Financial News (금육 λ‰΄μŠ€)

### Input Constraints

| Parameter | Value | Description |
|-----------|-------|-------------|
| Max sentence length | 128 tokens | λ¬Έμž₯λ‹Ή μ΅œλŒ€ 토큰 수 (초과 μ‹œ truncation) |
| Max sentences per document | 30 | λ¬Έμ„œλ‹Ή μ΅œλŒ€ λ¬Έμž₯ 수 (초과 μ‹œ μ•ž 30개만 μ‚¬μš©) |
| Input format | Plain text | λ¬Έμž₯ λΆ€ν˜Έ(`.!?`) κΈ°μ€€μœΌλ‘œ μžλ™ 뢄리 |

- **μž…λ ₯**: ν•œκ΅­μ–΄ 금육 ν…μŠ€νŠΈ (증ꢌ 리포트, 금육 λ‰΄μŠ€ λ“±)
- **좜λ ₯**: 각 λ¬Έμž₯별 λŒ€ν‘œλ¬Έμž₯ 점수 (0~1) + μ—­ν•  λΆ„λ₯˜ (outlook/event/financial/risk)

### Performance

| Metric | Score |
|--------|-------|
| Extraction F1 | 0.705 |
| Role Accuracy | 0.851 |

### Role Labels

| Label | Description |
|-------|-------------|
| `outlook` | 전망/예츑 λ¬Έμž₯ |
| `event` | 이벀트/사건 λ¬Έμž₯ |
| `financial` | 재무/싀적 λ¬Έμž₯ |
| `risk` | 리슀크 μš”μΈ λ¬Έμž₯ |

## Usage

```python
import re
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer

repo_id = "LangQuant/LQ-FSE-base"

# λͺ¨λΈ λ‘œλ“œ
config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()

# μž…λ ₯ ν…μŠ€νŠΈ
text = (
    "μ‚Όμ„±μ „μžμ˜ 2024λ…„ 4λΆ„κΈ° 싀적이 μ‹œμž₯ μ˜ˆμƒμ„ μƒνšŒν–ˆλ‹€. "
    "λ©”λͺ¨λ¦¬ λ°˜λ„μ²΄ 가격 μƒμŠΉμœΌλ‘œ μ˜μ—…μ΄μ΅μ΄ μ „λΆ„κΈ° λŒ€λΉ„ 30% μ¦κ°€ν–ˆλ‹€. "
    "HBM3E 양산이 λ³Έκ²©ν™”λ˜λ©΄μ„œ AI λ°˜λ„μ²΄ μ‹œμž₯ 점유율이 ν™•λŒ€λ  전망이닀."
)

# λ¬Έμž₯ 뢄리 및 토큰화
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
max_len, max_sent = config.max_length, config.max_sentences

padded = sentences[:max_sent]
num_real = len(padded)
while len(padded) < max_sent:
    padded.append("")

ids_list, mask_list = [], []
for s in padded:
    if s:
        enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
    else:
        enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
               "attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
    ids_list.append(enc["input_ids"])
    mask_list.append(enc["attention_mask"])

input_ids = torch.cat(ids_list).unsqueeze(0)
attention_mask = torch.cat(mask_list).unsqueeze(0)
doc_mask = torch.zeros(1, max_sent)
doc_mask[0, :num_real] = 1

# μΆ”λ‘ 
with torch.no_grad():
    scores, role_logits = model(input_ids, attention_mask, doc_mask)

role_labels = config.role_labels
for i, sent in enumerate(sentences):
    score = scores[0, i].item()
    role = role_labels[role_logits[0, i].argmax().item()]
    marker = "*" if score >= 0.5 else " "
    print(f"  {marker} [{score:.4f}] [{role:10s}] {sent}")
```

## Model Architecture

```
Input Sentences
    ↓
[klue/roberta-base] β†’ [CLS] embeddings per sentence
    ↓
[Inter-sentence Transformer] (2 layers, 8 heads)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Binary Classifierβ”‚  Role Classifier    β”‚
β”‚ (representative?)β”‚  (outlook/event/    β”‚
β”‚                  β”‚   financial/risk)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Training

- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: Linear warmup (10%)
- Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
- Max sentence length: 128 tokens
- Max sentences per document: 30

## Files

- `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
- `config.json`: Model configuration
- `model.safetensors`: Model weights
- `inference_example.py`: Inference helper with usage example
- `convert_checkpoint.py`: Script to convert original .pt checkpoint

## Disclaimer (λ©΄μ±… μ‘°ν•­)

- λ³Έ λͺ¨λΈμ€ **연ꡬ 및 정보 제곡 λͺ©μ **으둜만 μ œκ³΅λ©λ‹ˆλ‹€.
- λ³Έ λͺ¨λΈμ˜ 좜λ ₯은 **투자 μ‘°μ–Έ, 금육 자문, λ§€λ§€ μΆ”μ²œμ΄ μ•„λ‹™λ‹ˆλ‹€.**
- λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό 기반으둜 ν•œ 투자 νŒλ‹¨μ— λŒ€ν•΄ LangQuant 및 κ°œλ°œμžλŠ” **μ–΄λ– ν•œ 법적 μ±…μž„λ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.**
- λͺ¨λΈμ˜ μ •ν™•μ„±, μ™„μ „μ„±, μ μ‹œμ„±μ— λŒ€ν•΄ λ³΄μ¦ν•˜μ§€ μ•ŠμœΌλ©°, μ‹€μ œ 투자 μ˜μ‚¬κ²°μ • μ‹œ λ°˜λ“œμ‹œ μ „λ¬Έκ°€μ˜ 쑰언을 κ΅¬ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
- 금육 μ‹œμž₯은 본질적으둜 λΆˆν™•μ‹€ν•˜λ©°, κ³Όκ±° λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ λͺ¨λΈμ΄ 미래 μ„±κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

## Usage Restrictions (μ‚¬μš© μ œν•œ)

- **κΈˆμ§€ 사항:**
  - λ³Έ λͺ¨λΈμ„ μ΄μš©ν•œ μ‹œμ„Έ μ‘°μ’…, ν—ˆμœ„ 정보 생성 λ“± λΆˆλ²•μ  λͺ©μ μ˜ μ‚¬μš©
  - μžλ™ν™”λœ 투자 λ§€λ§€ μ‹œμŠ€ν…œμ˜ 단독 μ˜μ‚¬κ²°μ • μˆ˜λ‹¨μœΌλ‘œ μ‚¬μš©
  - λͺ¨λΈ 좜λ ₯을 μ „λ¬Έ 금육 자문인 κ²ƒμ²˜λŸΌ 제3μžμ—κ²Œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
- **ν—ˆμš© 사항:**
  - ν•™μˆ  연ꡬ 및 ꡐ윑 λͺ©μ μ˜ μ‚¬μš©
  - 금육 ν…μŠ€νŠΈ 뢄석 νŒŒμ΄ν”„λΌμΈμ˜ 보쑰 λ„κ΅¬λ‘œ ν™œμš©
  - 사내 λ¦¬μ„œμΉ˜/뢄석 μ—…λ¬΄μ˜ μ°Έκ³  자료둜 ν™œμš©
- 상업적 μ‚¬μš© μ‹œ LangQuant에 사전 문의λ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.

## Contributors

- **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com)
- **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β€” Ecole 42
- **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β€” DSSAL