Update README.md

31d42b8 verified 3 days ago

6.26 kB

	---
	language:
	- ko
	license: mit
	tags:
	- finance
	- extractive-summarization
	- sentence-extraction
	- role-classification
	- korean
	- roberta
	pipeline_tag: text-classification
	base_model: klue/roberta-base
	metrics:
	- f1
	- accuracy
	---

	# LQ-FSE-base: Korean Financial Sentence Extractor

	LangQuant(랭퀀트)에서 공개한 금융 리포트, 금융 관련 뉴스에서 대표문장을 추출하고 역할(outlook, event, financial, risk)을 분류하는 모델입니다.

	## Model Description

	- Base Model: klue/roberta-base
	- Architecture: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
	- Task: Extractive Summarization + Role Classification (Multi-task)
	- Language: Korean
	- Domain: Financial Reports (증권 리포트), Financial News (금융 뉴스)

	### Input Constraints

	\| Parameter \| Value \| Description \|
	\|-----------\|-------\|-------------\|
	\| Max sentence length \| 128 tokens \| 문장당 최대 토큰 수 (초과 시 truncation) \|
	\| Max sentences per document \| 30 \| 문서당 최대 문장 수 (초과 시 앞 30개만 사용) \|
	\| Input format \| Plain text \| 문장 부호(`.!?`) 기준으로 자동 분리 \|

	- 입력: 한국어 금융 텍스트 (증권 리포트, 금융 뉴스 등)
	- 출력: 각 문장별 대표문장 점수 (0~1) + 역할 분류 (outlook/event/financial/risk)

	### Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Extraction F1 \| 0.705 \|
	\| Role Accuracy \| 0.851 \|

	### Role Labels

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `outlook` \| 전망/예측 문장 \|
	\| `event` \| 이벤트/사건 문장 \|
	\| `financial` \| 재무/실적 문장 \|
	\| `risk` \| 리스크 요인 문장 \|

	## Usage

	```python
	import re
	import torch
	from transformers import AutoConfig, AutoModel, AutoTokenizer

	repo_id = "LangQuant/LQ-FSE-base"

	# 모델 로드
	config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model.eval()

	# 입력 텍스트
	text = (
	"삼성전자의 2024년 4분기 실적이 시장 예상을 상회했다. "
	"메모리 반도체 가격 상승으로 영업이익이 전분기 대비 30% 증가했다. "
	"HBM3E 양산이 본격화되면서 AI 반도체 시장 점유율이 확대될 전망이다."
	)

	# 문장 분리 및 토큰화
	sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
	max_len, max_sent = config.max_length, config.max_sentences

	padded = sentences[:max_sent]
	num_real = len(padded)
	while len(padded) < max_sent:
	padded.append("")

	ids_list, mask_list = [], []
	for s in padded:
	if s:
	enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
	else:
	enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
	"attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
	ids_list.append(enc["input_ids"])
	mask_list.append(enc["attention_mask"])

	input_ids = torch.cat(ids_list).unsqueeze(0)
	attention_mask = torch.cat(mask_list).unsqueeze(0)
	doc_mask = torch.zeros(1, max_sent)
	doc_mask[0, :num_real] = 1

	# 추론
	with torch.no_grad():
	scores, role_logits = model(input_ids, attention_mask, doc_mask)

	role_labels = config.role_labels
	for i, sent in enumerate(sentences):
	score = scores[0, i].item()
	role = role_labels[role_logits[0, i].argmax().item()]
	marker = "*" if score >= 0.5 else " "
	print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
	```

	## Model Architecture

	```
	Input Sentences
	↓
	[klue/roberta-base] → [CLS] embeddings per sentence
	↓
	[Inter-sentence Transformer] (2 layers, 8 heads)
	↓
	┌──────────────────┬─────────────────────┐
	│ Binary Classifier│ Role Classifier │
	│ (representative?)│ (outlook/event/ │
	│ │ financial/risk) │
	└──────────────────┴─────────────────────┘
	```

	## Training

	- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
	- Scheduler: Linear warmup (10%)
	- Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
	- Max sentence length: 128 tokens
	- Max sentences per document: 30

	## Files

	- `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
	- `config.json`: Model configuration
	- `model.safetensors`: Model weights
	- `inference_example.py`: Inference helper with usage example
	- `convert_checkpoint.py`: Script to convert original .pt checkpoint

	## Disclaimer (면책 조항)

	- 본 모델은 연구 및 정보 제공 목적으로만 제공됩니다.
	- 본 모델의 출력은 투자 조언, 금융 자문, 매매 추천이 아닙니다.
	- 모델의 예측 결과를 기반으로 한 투자 판단에 대해 LangQuant 및 개발자는 어떠한 법적 책임도 지지 않습니다.
	- 모델의 정확성, 완전성, 적시성에 대해 보증하지 않으며, 실제 투자 의사결정 시 반드시 전문가의 조언을 구하시기 바랍니다.
	- 금융 시장은 본질적으로 불확실하며, 과거 데이터로 학습된 모델이 미래 성과를 보장하지 않습니다.

	## Usage Restrictions (사용 제한)

	- 금지 사항:
	- 본 모델을 이용한 시세 조종, 허위 정보 생성 등 불법적 목적의 사용
	- 자동화된 투자 매매 시스템의 단독 의사결정 수단으로 사용
	- 모델 출력을 전문 금융 자문인 것처럼 제3자에게 제공하는 행위
	- 허용 사항:
	- 학술 연구 및 교육 목적의 사용
	- 금융 텍스트 분석 파이프라인의 보조 도구로 활용
	- 사내 리서치/분석 업무의 참고 자료로 활용
	- 상업적 사용 시 LangQuant에 사전 문의를 권장합니다.

	## Contributors

	- [Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/) (taegyeong.leaf@gmail.com)
	- [Dong Young Kim](https://www.linkedin.com/in/dykim04/) (dong-kim@student.42kl.edu.my) — Ecole 42
	- [Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/) (hsh1030@g.skku.edu) — DSSAL