LQ-FSE-base / README.md

langquantof

Update README.md

31d42b8 verified 3 days ago

preview code

raw

history blame contribute delete

6.26 kB

metadata

language:
  - ko
license: mit
tags:
  - finance
  - extractive-summarization
  - sentence-extraction
  - role-classification
  - korean
  - roberta
pipeline_tag: text-classification
base_model: klue/roberta-base
metrics:
  - f1
  - accuracy

LQ-FSE-base: Korean Financial Sentence Extractor

LangQuant(랭퀀트)에서 공개한 금융 리포트, 금융 관련 뉴스에서 대표문장을 추출하고 역할(outlook, event, financial, risk)을 분류하는 모델입니다.

Model Description

Base Model: klue/roberta-base
Architecture: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
Task: Extractive Summarization + Role Classification (Multi-task)
Language: Korean
Domain: Financial Reports (증권 리포트), Financial News (금융 뉴스)

Input Constraints

Parameter	Value	Description
Max sentence length	128 tokens	문장당 최대 토큰 수 (초과 시 truncation)
Max sentences per document	30	문서당 최대 문장 수 (초과 시 앞 30개만 사용)
Input format	Plain text	문장 부호(`.!?`) 기준으로 자동 분리

입력: 한국어 금융 텍스트 (증권 리포트, 금융 뉴스 등)
출력: 각 문장별 대표문장 점수 (0~1) + 역할 분류 (outlook/event/financial/risk)

Performance

Metric	Score
Extraction F1	0.705
Role Accuracy	0.851

Role Labels

Label	Description
`outlook`	전망/예측 문장
`event`	이벤트/사건 문장
`financial`	재무/실적 문장
`risk`	리스크 요인 문장

Usage

import re
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer

repo_id = "LangQuant/LQ-FSE-base"

# 모델 로드
config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()

# 입력 텍스트
text = (
    "삼성전자의 2024년 4분기 실적이 시장 예상을 상회했다. "
    "메모리 반도체 가격 상승으로 영업이익이 전분기 대비 30% 증가했다. "
    "HBM3E 양산이 본격화되면서 AI 반도체 시장 점유율이 확대될 전망이다."
)

# 문장 분리 및 토큰화
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
max_len, max_sent = config.max_length, config.max_sentences

padded = sentences[:max_sent]
num_real = len(padded)
while len(padded) < max_sent:
    padded.append("")

ids_list, mask_list = [], []
for s in padded:
    if s:
        enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
    else:
        enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
               "attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
    ids_list.append(enc["input_ids"])
    mask_list.append(enc["attention_mask"])

input_ids = torch.cat(ids_list).unsqueeze(0)
attention_mask = torch.cat(mask_list).unsqueeze(0)
doc_mask = torch.zeros(1, max_sent)
doc_mask[0, :num_real] = 1

# 추론
with torch.no_grad():
    scores, role_logits = model(input_ids, attention_mask, doc_mask)

role_labels = config.role_labels
for i, sent in enumerate(sentences):
    score = scores[0, i].item()
    role = role_labels[role_logits[0, i].argmax().item()]
    marker = "*" if score >= 0.5 else " "
    print(f"  {marker} [{score:.4f}] [{role:10s}] {sent}")

Model Architecture

Input Sentences
    ↓
[klue/roberta-base] → [CLS] embeddings per sentence
    ↓
[Inter-sentence Transformer] (2 layers, 8 heads)
    ↓
┌──────────────────┬─────────────────────┐
│ Binary Classifier│  Role Classifier    │
│ (representative?)│  (outlook/event/    │
│                  │   financial/risk)   │
└──────────────────┴─────────────────────┘

Training

Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
Scheduler: Linear warmup (10%)
Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
Max sentence length: 128 tokens
Max sentences per document: 30

Files

model.py: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
config.json: Model configuration
model.safetensors: Model weights
inference_example.py: Inference helper with usage example
convert_checkpoint.py: Script to convert original .pt checkpoint

Disclaimer (면책 조항)

본 모델은 연구 및 정보 제공 목적으로만 제공됩니다.
본 모델의 출력은 투자 조언, 금융 자문, 매매 추천이 아닙니다.
모델의 예측 결과를 기반으로 한 투자 판단에 대해 LangQuant 및 개발자는 어떠한 법적 책임도 지지 않습니다.
모델의 정확성, 완전성, 적시성에 대해 보증하지 않으며, 실제 투자 의사결정 시 반드시 전문가의 조언을 구하시기 바랍니다.
금융 시장은 본질적으로 불확실하며, 과거 데이터로 학습된 모델이 미래 성과를 보장하지 않습니다.

Usage Restrictions (사용 제한)

금지 사항:
- 본 모델을 이용한 시세 조종, 허위 정보 생성 등 불법적 목적의 사용
- 자동화된 투자 매매 시스템의 단독 의사결정 수단으로 사용
- 모델 출력을 전문 금융 자문인 것처럼 제3자에게 제공하는 행위
허용 사항:
- 학술 연구 및 교육 목적의 사용
- 금융 텍스트 분석 파이프라인의 보조 도구로 활용
- 사내 리서치/분석 업무의 참고 자료로 활용
상업적 사용 시 LangQuant에 사전 문의를 권장합니다.

Contributors

Taegyeong Lee (taegyeong.leaf@gmail.com)
Dong Young Kim (dong-kim@student.42kl.edu.my) — Ecole 42
Seunghyun Hwang (hsh1030@g.skku.edu) — DSSAL