langquantof commited on
Commit
a369fac
Β·
verified Β·
1 Parent(s): 3d36724

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +154 -3
README.md CHANGED
@@ -1,3 +1,154 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LQ-FSE-base: Korean Financial Sentence Extractor
2
+
3
+ 금육 리포트, 금육 κ΄€λ ¨ λ‰΄μŠ€μ—μ„œ λŒ€ν‘œλ¬Έμž₯을 μΆ”μΆœν•˜κ³  μ—­ν• (outlook, event, financial, risk)을 λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.
4
+
5
+ ## Model Description
6
+
7
+ - **Base Model**: klue/roberta-base
8
+ - **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
9
+ - **Task**: Extractive Summarization + Role Classification (Multi-task)
10
+ - **Language**: Korean
11
+ - **Domain**: Financial Reports (증ꢌ 리포트), Financial News (금육 λ‰΄μŠ€)
12
+
13
+ ### Input Constraints
14
+
15
+ | Parameter | Value | Description |
16
+ |-----------|-------|-------------|
17
+ | Max sentence length | 128 tokens | λ¬Έμž₯λ‹Ή μ΅œλŒ€ 토큰 수 (초과 μ‹œ truncation) |
18
+ | Max sentences per document | 30 | λ¬Έμ„œλ‹Ή μ΅œλŒ€ λ¬Έμž₯ 수 (초과 μ‹œ μ•ž 30개만 μ‚¬μš©) |
19
+ | Input format | Plain text | λ¬Έμž₯ λΆ€ν˜Έ(`.!?`) κΈ°μ€€μœΌλ‘œ μžλ™ 뢄리 |
20
+
21
+ - **μž…λ ₯**: ν•œκ΅­μ–΄ 금육 ν…μŠ€νŠΈ (증ꢌ 리포트, 금육 λ‰΄μŠ€ λ“±)
22
+ - **좜λ ₯**: 각 λ¬Έμž₯별 λŒ€ν‘œλ¬Έμž₯ 점수 (0~1) + μ—­ν•  λΆ„λ₯˜ (outlook/event/financial/risk)
23
+
24
+ ### Performance
25
+
26
+ | Metric | Score |
27
+ |--------|-------|
28
+ | Extraction F1 | 0.705 |
29
+ | Role Accuracy | 0.851 |
30
+
31
+ ### Role Labels
32
+
33
+ | Label | Description |
34
+ |-------|-------------|
35
+ | `outlook` | 전망/예츑 λ¬Έμž₯ |
36
+ | `event` | 이벀트/사건 λ¬Έμž₯ |
37
+ | `financial` | 재무/싀적 λ¬Έμž₯ |
38
+ | `risk` | 리슀크 μš”μΈ λ¬Έμž₯ |
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ import re
44
+ import torch
45
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
46
+
47
+ repo_id = "LangQuant/LQ-FSE-base"
48
+
49
+ # λͺ¨λΈ λ‘œλ“œ
50
+ config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
51
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
52
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
53
+ model.eval()
54
+
55
+ # μž…λ ₯ ν…μŠ€νŠΈ
56
+ text = (
57
+ "μ‚Όμ„±μ „μžμ˜ 2024λ…„ 4λΆ„κΈ° 싀적이 μ‹œμž₯ μ˜ˆμƒμ„ μƒνšŒν–ˆλ‹€. "
58
+ "λ©”λͺ¨λ¦¬ λ°˜λ„μ²΄ 가격 μƒμŠΉμœΌλ‘œ μ˜μ—…μ΄μ΅μ΄ μ „λΆ„κΈ° λŒ€λΉ„ 30% μ¦κ°€ν–ˆλ‹€. "
59
+ "HBM3E 양산이 λ³Έκ²©ν™”λ˜λ©΄μ„œ AI λ°˜λ„μ²΄ μ‹œμž₯ 점유율이 ν™•λŒ€λ  전망이닀."
60
+ )
61
+
62
+ # λ¬Έμž₯ 뢄리 및 토큰화
63
+ sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
64
+ max_len, max_sent = config.max_length, config.max_sentences
65
+
66
+ padded = sentences[:max_sent]
67
+ num_real = len(padded)
68
+ while len(padded) < max_sent:
69
+ padded.append("")
70
+
71
+ ids_list, mask_list = [], []
72
+ for s in padded:
73
+ if s:
74
+ enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
75
+ else:
76
+ enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
77
+ "attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
78
+ ids_list.append(enc["input_ids"])
79
+ mask_list.append(enc["attention_mask"])
80
+
81
+ input_ids = torch.cat(ids_list).unsqueeze(0)
82
+ attention_mask = torch.cat(mask_list).unsqueeze(0)
83
+ doc_mask = torch.zeros(1, max_sent)
84
+ doc_mask[0, :num_real] = 1
85
+
86
+ # μΆ”λ‘ 
87
+ with torch.no_grad():
88
+ scores, role_logits = model(input_ids, attention_mask, doc_mask)
89
+
90
+ role_labels = config.role_labels
91
+ for i, sent in enumerate(sentences):
92
+ score = scores[0, i].item()
93
+ role = role_labels[role_logits[0, i].argmax().item()]
94
+ marker = "*" if score >= 0.5 else " "
95
+ print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
96
+ ```
97
+
98
+ ## Model Architecture
99
+
100
+ ```
101
+ Input Sentences
102
+ ↓
103
+ [klue/roberta-base] β†’ [CLS] embeddings per sentence
104
+ ↓
105
+ [Inter-sentence Transformer] (2 layers, 8 heads)
106
+ ↓
107
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
108
+ β”‚ Binary Classifierβ”‚ Role Classifier β”‚
109
+ β”‚ (representative?)β”‚ (outlook/event/ β”‚
110
+ β”‚ β”‚ financial/risk) β”‚
111
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
112
+ ```
113
+
114
+ ## Training
115
+
116
+ - Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
117
+ - Scheduler: Linear warmup (10%)
118
+ - Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
119
+ - Max sentence length: 128 tokens
120
+ - Max sentences per document: 30
121
+
122
+ ## Files
123
+
124
+ - `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
125
+ - `config.json`: Model configuration
126
+ - `model.safetensors`: Model weights
127
+ - `inference_example.py`: Inference helper with usage example
128
+ - `convert_checkpoint.py`: Script to convert original .pt checkpoint
129
+
130
+ ## Disclaimer (λ©΄μ±… μ‘°ν•­)
131
+
132
+ - λ³Έ λͺ¨λΈμ€ **연ꡬ 및 정보 제곡 λͺ©μ **으둜만 μ œκ³΅λ©λ‹ˆλ‹€.
133
+ - λ³Έ λͺ¨λΈμ˜ 좜λ ₯은 **투자 μ‘°μ–Έ, 금육 자문, λ§€λ§€ μΆ”μ²œμ΄ μ•„λ‹™λ‹ˆλ‹€.**
134
+ - λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό 기반으둜 ν•œ 투자 νŒλ‹¨μ— λŒ€ν•΄ LangQuant 및 κ°œλ°œμžλŠ” **μ–΄λ– ν•œ 법적 μ±…μž„λ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.**
135
+ - λͺ¨λΈμ˜ μ •ν™•μ„±, μ™„μ „μ„±, μ μ‹œμ„±μ— λŒ€ν•΄ λ³΄μ¦ν•˜μ§€ μ•ŠμœΌλ©°, μ‹€μ œ 투자 μ˜μ‚¬κ²°μ • μ‹œ λ°˜λ“œμ‹œ μ „λ¬Έκ°€μ˜ 쑰언을 κ΅¬ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
136
+ - 금육 μ‹œμž₯은 본질적으둜 λΆˆν™•μ‹€ν•˜λ©°, κ³Όκ±° λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ λͺ¨λΈμ΄ 미래 μ„±κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
137
+
138
+ ## Usage Restrictions (μ‚¬μš© μ œν•œ)
139
+
140
+ - **κΈˆμ§€ 사항:**
141
+ - λ³Έ λͺ¨λΈμ„ μ΄μš©ν•œ μ‹œμ„Έ μ‘°μ’…, ν—ˆμœ„ 정보 생성 λ“± λΆˆλ²•μ  λͺ©μ μ˜ μ‚¬μš©
142
+ - μžλ™ν™”λœ 투자 λ§€λ§€ μ‹œμŠ€ν…œμ˜ 단독 μ˜μ‚¬κ²°μ • μˆ˜λ‹¨μœΌλ‘œ μ‚¬μš©
143
+ - λͺ¨λΈ 좜λ ₯을 μ „λ¬Έ 금육 자문인 κ²ƒμ²˜λŸΌ 제3μžμ—κ²Œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
144
+ - **ν—ˆμš© 사항:**
145
+ - ν•™μˆ  연ꡬ 및 ꡐ윑 λͺ©μ μ˜ μ‚¬μš©
146
+ - 금육 ν…μŠ€νŠΈ 뢄석 νŒŒμ΄ν”„λΌμΈμ˜ 보쑰 λ„κ΅¬λ‘œ ν™œμš©
147
+ - 사내 λ¦¬μ„œμΉ˜/뢄석 μ—…λ¬΄μ˜ μ°Έκ³  자료둜 ν™œμš©
148
+ - 상업적 μ‚¬μš© μ‹œ LangQuant에 사전 문의λ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.
149
+
150
+ ## Contributors
151
+
152
+ - **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com)
153
+ - **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β€” Ecole 42
154
+ - **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β€” DSSAL