LQ-FSE-base / README.md
langquantof's picture
Update README.md
31d42b8 verified
---
language:
- ko
license: mit
tags:
- finance
- extractive-summarization
- sentence-extraction
- role-classification
- korean
- roberta
pipeline_tag: text-classification
base_model: klue/roberta-base
metrics:
- f1
- accuracy
---
# LQ-FSE-base: Korean Financial Sentence Extractor
LangQuant(λž­ν€€νŠΈ)μ—μ„œ κ³΅κ°œν•œ 금육 리포트, 금육 κ΄€λ ¨ λ‰΄μŠ€μ—μ„œ λŒ€ν‘œλ¬Έμž₯을 μΆ”μΆœν•˜κ³  μ—­ν• (outlook, event, financial, risk)을 λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.
## Model Description
- **Base Model**: klue/roberta-base
- **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
- **Task**: Extractive Summarization + Role Classification (Multi-task)
- **Language**: Korean
- **Domain**: Financial Reports (증ꢌ 리포트), Financial News (금육 λ‰΄μŠ€)
### Input Constraints
| Parameter | Value | Description |
|-----------|-------|-------------|
| Max sentence length | 128 tokens | λ¬Έμž₯λ‹Ή μ΅œλŒ€ 토큰 수 (초과 μ‹œ truncation) |
| Max sentences per document | 30 | λ¬Έμ„œλ‹Ή μ΅œλŒ€ λ¬Έμž₯ 수 (초과 μ‹œ μ•ž 30개만 μ‚¬μš©) |
| Input format | Plain text | λ¬Έμž₯ λΆ€ν˜Έ(`.!?`) κΈ°μ€€μœΌλ‘œ μžλ™ 뢄리 |
- **μž…λ ₯**: ν•œκ΅­μ–΄ 금육 ν…μŠ€νŠΈ (증ꢌ 리포트, 금육 λ‰΄μŠ€ λ“±)
- **좜λ ₯**: 각 λ¬Έμž₯별 λŒ€ν‘œλ¬Έμž₯ 점수 (0~1) + μ—­ν•  λΆ„λ₯˜ (outlook/event/financial/risk)
### Performance
| Metric | Score |
|--------|-------|
| Extraction F1 | 0.705 |
| Role Accuracy | 0.851 |
### Role Labels
| Label | Description |
|-------|-------------|
| `outlook` | 전망/예츑 λ¬Έμž₯ |
| `event` | 이벀트/사건 λ¬Έμž₯ |
| `financial` | 재무/싀적 λ¬Έμž₯ |
| `risk` | 리슀크 μš”μΈ λ¬Έμž₯ |
## Usage
```python
import re
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
repo_id = "LangQuant/LQ-FSE-base"
# λͺ¨λΈ λ‘œλ“œ
config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()
# μž…λ ₯ ν…μŠ€νŠΈ
text = (
"μ‚Όμ„±μ „μžμ˜ 2024λ…„ 4λΆ„κΈ° 싀적이 μ‹œμž₯ μ˜ˆμƒμ„ μƒνšŒν–ˆλ‹€. "
"λ©”λͺ¨λ¦¬ λ°˜λ„μ²΄ 가격 μƒμŠΉμœΌλ‘œ μ˜μ—…μ΄μ΅μ΄ μ „λΆ„κΈ° λŒ€λΉ„ 30% μ¦κ°€ν–ˆλ‹€. "
"HBM3E 양산이 λ³Έκ²©ν™”λ˜λ©΄μ„œ AI λ°˜λ„μ²΄ μ‹œμž₯ 점유율이 ν™•λŒ€λ  전망이닀."
)
# λ¬Έμž₯ 뢄리 및 토큰화
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
max_len, max_sent = config.max_length, config.max_sentences
padded = sentences[:max_sent]
num_real = len(padded)
while len(padded) < max_sent:
padded.append("")
ids_list, mask_list = [], []
for s in padded:
if s:
enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
else:
enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
"attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
ids_list.append(enc["input_ids"])
mask_list.append(enc["attention_mask"])
input_ids = torch.cat(ids_list).unsqueeze(0)
attention_mask = torch.cat(mask_list).unsqueeze(0)
doc_mask = torch.zeros(1, max_sent)
doc_mask[0, :num_real] = 1
# μΆ”λ‘ 
with torch.no_grad():
scores, role_logits = model(input_ids, attention_mask, doc_mask)
role_labels = config.role_labels
for i, sent in enumerate(sentences):
score = scores[0, i].item()
role = role_labels[role_logits[0, i].argmax().item()]
marker = "*" if score >= 0.5 else " "
print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
```
## Model Architecture
```
Input Sentences
↓
[klue/roberta-base] β†’ [CLS] embeddings per sentence
↓
[Inter-sentence Transformer] (2 layers, 8 heads)
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Binary Classifierβ”‚ Role Classifier β”‚
β”‚ (representative?)β”‚ (outlook/event/ β”‚
β”‚ β”‚ financial/risk) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Training
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: Linear warmup (10%)
- Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
- Max sentence length: 128 tokens
- Max sentences per document: 30
## Files
- `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
- `config.json`: Model configuration
- `model.safetensors`: Model weights
- `inference_example.py`: Inference helper with usage example
- `convert_checkpoint.py`: Script to convert original .pt checkpoint
## Disclaimer (λ©΄μ±… μ‘°ν•­)
- λ³Έ λͺ¨λΈμ€ **연ꡬ 및 정보 제곡 λͺ©μ **으둜만 μ œκ³΅λ©λ‹ˆλ‹€.
- λ³Έ λͺ¨λΈμ˜ 좜λ ₯은 **투자 μ‘°μ–Έ, 금육 자문, λ§€λ§€ μΆ”μ²œμ΄ μ•„λ‹™λ‹ˆλ‹€.**
- λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό 기반으둜 ν•œ 투자 νŒλ‹¨μ— λŒ€ν•΄ LangQuant 및 κ°œλ°œμžλŠ” **μ–΄λ– ν•œ 법적 μ±…μž„λ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.**
- λͺ¨λΈμ˜ μ •ν™•μ„±, μ™„μ „μ„±, μ μ‹œμ„±μ— λŒ€ν•΄ λ³΄μ¦ν•˜μ§€ μ•ŠμœΌλ©°, μ‹€μ œ 투자 μ˜μ‚¬κ²°μ • μ‹œ λ°˜λ“œμ‹œ μ „λ¬Έκ°€μ˜ 쑰언을 κ΅¬ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
- 금육 μ‹œμž₯은 본질적으둜 λΆˆν™•μ‹€ν•˜λ©°, κ³Όκ±° λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ λͺ¨λΈμ΄ 미래 μ„±κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
## Usage Restrictions (μ‚¬μš© μ œν•œ)
- **κΈˆμ§€ 사항:**
- λ³Έ λͺ¨λΈμ„ μ΄μš©ν•œ μ‹œμ„Έ μ‘°μ’…, ν—ˆμœ„ 정보 생성 λ“± λΆˆλ²•μ  λͺ©μ μ˜ μ‚¬μš©
- μžλ™ν™”λœ 투자 λ§€λ§€ μ‹œμŠ€ν…œμ˜ 단독 μ˜μ‚¬κ²°μ • μˆ˜λ‹¨μœΌλ‘œ μ‚¬μš©
- λͺ¨λΈ 좜λ ₯을 μ „λ¬Έ 금육 자문인 κ²ƒμ²˜λŸΌ 제3μžμ—κ²Œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
- **ν—ˆμš© 사항:**
- ν•™μˆ  연ꡬ 및 ꡐ윑 λͺ©μ μ˜ μ‚¬μš©
- 금육 ν…μŠ€νŠΈ 뢄석 νŒŒμ΄ν”„λΌμΈμ˜ 보쑰 λ„κ΅¬λ‘œ ν™œμš©
- 사내 λ¦¬μ„œμΉ˜/뢄석 μ—…λ¬΄μ˜ μ°Έκ³  자료둜 ν™œμš©
- 상업적 μ‚¬μš© μ‹œ LangQuant에 사전 문의λ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.
## Contributors
- **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com)
- **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β€” Ecole 42
- **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β€” DSSAL