| | --- |
| | language: |
| | - ko |
| | license: mit |
| | tags: |
| | - finance |
| | - extractive-summarization |
| | - sentence-extraction |
| | - role-classification |
| | - korean |
| | - roberta |
| | pipeline_tag: text-classification |
| | base_model: klue/roberta-base |
| | metrics: |
| | - f1 |
| | - accuracy |
| | --- |
| | |
| | # LQ-FSE-base: Korean Financial Sentence Extractor |
| |
|
| | LangQuant(λννΈ)μμ 곡κ°ν κΈμ΅ 리ν¬νΈ, κΈμ΅ κ΄λ ¨ λ΄μ€μμ λνλ¬Έμ₯μ μΆμΆνκ³ μν (outlook, event, financial, risk)μ λΆλ₯νλ λͺ¨λΈμ
λλ€. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model**: klue/roberta-base |
| | - **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers |
| | - **Task**: Extractive Summarization + Role Classification (Multi-task) |
| | - **Language**: Korean |
| | - **Domain**: Financial Reports (μ¦κΆ 리ν¬νΈ), Financial News (κΈμ΅ λ΄μ€) |
| |
|
| | ### Input Constraints |
| |
|
| | | Parameter | Value | Description | |
| | |-----------|-------|-------------| |
| | | Max sentence length | 128 tokens | λ¬Έμ₯λΉ μ΅λ ν ν° μ (μ΄κ³Ό μ truncation) | |
| | | Max sentences per document | 30 | λ¬ΈμλΉ μ΅λ λ¬Έμ₯ μ (μ΄κ³Ό μ μ 30κ°λ§ μ¬μ©) | |
| | | Input format | Plain text | λ¬Έμ₯ λΆνΈ(`.!?`) κΈ°μ€μΌλ‘ μλ λΆλ¦¬ | |
| |
|
| | - **μ
λ ₯**: νκ΅μ΄ κΈμ΅ ν
μ€νΈ (μ¦κΆ 리ν¬νΈ, κΈμ΅ λ΄μ€ λ±) |
| | - **μΆλ ₯**: κ° λ¬Έμ₯λ³ λνλ¬Έμ₯ μ μ (0~1) + μν λΆλ₯ (outlook/event/financial/risk) |
| |
|
| | ### Performance |
| |
|
| | | Metric | Score | |
| | |--------|-------| |
| | | Extraction F1 | 0.705 | |
| | | Role Accuracy | 0.851 | |
| |
|
| | ### Role Labels |
| |
|
| | | Label | Description | |
| | |-------|-------------| |
| | | `outlook` | μ λ§/μμΈ‘ λ¬Έμ₯ | |
| | | `event` | μ΄λ²€νΈ/μ¬κ±΄ λ¬Έμ₯ | |
| | | `financial` | μ¬λ¬΄/μ€μ λ¬Έμ₯ | |
| | | `risk` | 리μ€ν¬ μμΈ λ¬Έμ₯ | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import re |
| | import torch |
| | from transformers import AutoConfig, AutoModel, AutoTokenizer |
| | |
| | repo_id = "LangQuant/LQ-FSE-base" |
| | |
| | # λͺ¨λΈ λ‘λ |
| | config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True) |
| | model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained(repo_id) |
| | model.eval() |
| | |
| | # μ
λ ₯ ν
μ€νΈ |
| | text = ( |
| | "μΌμ±μ μμ 2024λ
4λΆκΈ° μ€μ μ΄ μμ₯ μμμ μννλ€. " |
| | "λ©λͺ¨λ¦¬ λ°λ체 κ°κ²© μμΉμΌλ‘ μμ
μ΄μ΅μ΄ μ λΆκΈ° λλΉ 30% μ¦κ°νλ€. " |
| | "HBM3E μμ°μ΄ 본격νλλ©΄μ AI λ°λ체 μμ₯ μ μ μ¨μ΄ νλλ μ λ§μ΄λ€." |
| | ) |
| | |
| | # λ¬Έμ₯ λΆλ¦¬ λ° ν ν°ν |
| | sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()] |
| | max_len, max_sent = config.max_length, config.max_sentences |
| | |
| | padded = sentences[:max_sent] |
| | num_real = len(padded) |
| | while len(padded) < max_sent: |
| | padded.append("") |
| | |
| | ids_list, mask_list = [], [] |
| | for s in padded: |
| | if s: |
| | enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt") |
| | else: |
| | enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long), |
| | "attention_mask": torch.zeros(1, max_len, dtype=torch.long)} |
| | ids_list.append(enc["input_ids"]) |
| | mask_list.append(enc["attention_mask"]) |
| | |
| | input_ids = torch.cat(ids_list).unsqueeze(0) |
| | attention_mask = torch.cat(mask_list).unsqueeze(0) |
| | doc_mask = torch.zeros(1, max_sent) |
| | doc_mask[0, :num_real] = 1 |
| | |
| | # μΆλ‘ |
| | with torch.no_grad(): |
| | scores, role_logits = model(input_ids, attention_mask, doc_mask) |
| | |
| | role_labels = config.role_labels |
| | for i, sent in enumerate(sentences): |
| | score = scores[0, i].item() |
| | role = role_labels[role_logits[0, i].argmax().item()] |
| | marker = "*" if score >= 0.5 else " " |
| | print(f" {marker} [{score:.4f}] [{role:10s}] {sent}") |
| | ``` |
| |
|
| | ## Model Architecture |
| |
|
| | ``` |
| | Input Sentences |
| | β |
| | [klue/roberta-base] β [CLS] embeddings per sentence |
| | β |
| | [Inter-sentence Transformer] (2 layers, 8 heads) |
| | β |
| | ββββββββββββββββββββ¬ββββββββββββββββββββββ |
| | β Binary Classifierβ Role Classifier β |
| | β (representative?)β (outlook/event/ β |
| | β β financial/risk) β |
| | ββββββββββββββββββββ΄ββββββββββββββββββββββ |
| | ``` |
| |
|
| | ## Training |
| |
|
| | - Optimizer: AdamW (lr=2e-5, weight_decay=0.01) |
| | - Scheduler: Linear warmup (10%) |
| | - Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5 |
| | - Max sentence length: 128 tokens |
| | - Max sentences per document: 30 |
| |
|
| | ## Files |
| |
|
| | - `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization) |
| | - `config.json`: Model configuration |
| | - `model.safetensors`: Model weights |
| | - `inference_example.py`: Inference helper with usage example |
| | - `convert_checkpoint.py`: Script to convert original .pt checkpoint |
| |
|
| | ## Disclaimer (λ©΄μ±
μ‘°ν) |
| |
|
| | - λ³Έ λͺ¨λΈμ **μ°κ΅¬ λ° μ 보 μ 곡 λͺ©μ **μΌλ‘λ§ μ 곡λ©λλ€. |
| | - λ³Έ λͺ¨λΈμ μΆλ ₯μ **ν¬μ μ‘°μΈ, κΈμ΅ μλ¬Έ, λ§€λ§€ μΆμ²μ΄ μλλλ€.** |
| | - λͺ¨λΈμ μμΈ‘ κ²°κ³Όλ₯Ό κΈ°λ°μΌλ‘ ν ν¬μ νλ¨μ λν΄ LangQuant λ° κ°λ°μλ **μ΄λ ν λ²μ μ±
μλ μ§μ§ μμ΅λλ€.** |
| | - λͺ¨λΈμ μ νμ±, μμ μ±, μ μμ±μ λν΄ λ³΄μ¦νμ§ μμΌλ©°, μ€μ ν¬μ μμ¬κ²°μ μ λ°λμ μ λ¬Έκ°μ μ‘°μΈμ ꡬνμκΈ° λ°λλλ€. |
| | - κΈμ΅ μμ₯μ λ³Έμ§μ μΌλ‘ λΆνμ€νλ©°, κ³Όκ±° λ°μ΄ν°λ‘ νμ΅λ λͺ¨λΈμ΄ λ―Έλ μ±κ³Όλ₯Ό 보μ₯νμ§ μμ΅λλ€. |
| |
|
| | ## Usage Restrictions (μ¬μ© μ ν) |
| |
|
| | - **κΈμ§ μ¬ν:** |
| | - λ³Έ λͺ¨λΈμ μ΄μ©ν μμΈ μ‘°μ’
, νμ μ 보 μμ± λ± λΆλ²μ λͺ©μ μ μ¬μ© |
| | - μλνλ ν¬μ λ§€λ§€ μμ€ν
μ λ¨λ
μμ¬κ²°μ μλ¨μΌλ‘ μ¬μ© |
| | - λͺ¨λΈ μΆλ ₯μ μ λ¬Έ κΈμ΅ μλ¬ΈμΈ κ²μ²λΌ μ 3μμκ² μ 곡νλ νμ |
| | - **νμ© μ¬ν:** |
| | - νμ μ°κ΅¬ λ° κ΅μ‘ λͺ©μ μ μ¬μ© |
| | - κΈμ΅ ν
μ€νΈ λΆμ νμ΄νλΌμΈμ 보쑰 λκ΅¬λ‘ νμ© |
| | - μ¬λ΄ 리μμΉ/λΆμ μ
무μ μ°Έκ³ μλ£λ‘ νμ© |
| | - μμ
μ μ¬μ© μ LangQuantμ μ¬μ λ¬Έμλ₯Ό κΆμ₯ν©λλ€. |
| |
|
| | ## Contributors |
| |
|
| | - **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com) |
| | - **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β Ecole 42 |
| | - **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β DSSAL |
| |
|