langquantof commited on
Commit
31d42b8
Β·
verified Β·
1 Parent(s): ca4cf80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -172
README.md CHANGED
@@ -1,172 +1,172 @@
1
- ---
2
- language:
3
- - ko
4
- license: mit
5
- tags:
6
- - finance
7
- - extractive-summarization
8
- - sentence-extraction
9
- - role-classification
10
- - korean
11
- - roberta
12
- pipeline_tag: text-classification
13
- base_model: klue/roberta-base
14
- metrics:
15
- - f1
16
- - accuracy
17
- ---
18
-
19
- # LQ-FSE-base: Korean Financial Sentence Extractor
20
-
21
- 금육 리포트, 금육 κ΄€λ ¨ λ‰΄μŠ€μ—μ„œ λŒ€ν‘œλ¬Έμž₯을 μΆ”μΆœν•˜κ³  μ—­ν• (outlook, event, financial, risk)을 λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.
22
-
23
- ## Model Description
24
-
25
- - **Base Model**: klue/roberta-base
26
- - **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
27
- - **Task**: Extractive Summarization + Role Classification (Multi-task)
28
- - **Language**: Korean
29
- - **Domain**: Financial Reports (증ꢌ 리포트), Financial News (금육 λ‰΄μŠ€)
30
-
31
- ### Input Constraints
32
-
33
- | Parameter | Value | Description |
34
- |-----------|-------|-------------|
35
- | Max sentence length | 128 tokens | λ¬Έμž₯λ‹Ή μ΅œλŒ€ 토큰 수 (초과 μ‹œ truncation) |
36
- | Max sentences per document | 30 | λ¬Έμ„œλ‹Ή μ΅œλŒ€ λ¬Έμž₯ 수 (초과 μ‹œ μ•ž 30개만 μ‚¬μš©) |
37
- | Input format | Plain text | λ¬Έμž₯ λΆ€ν˜Έ(`.!?`) κΈ°μ€€μœΌλ‘œ μžλ™ 뢄리 |
38
-
39
- - **μž…λ ₯**: ν•œκ΅­μ–΄ 금육 ν…μŠ€νŠΈ (증ꢌ 리포트, 금육 λ‰΄μŠ€ λ“±)
40
- - **좜λ ₯**: 각 λ¬Έμž₯별 λŒ€ν‘œλ¬Έμž₯ 점수 (0~1) + μ—­ν•  λΆ„λ₯˜ (outlook/event/financial/risk)
41
-
42
- ### Performance
43
-
44
- | Metric | Score |
45
- |--------|-------|
46
- | Extraction F1 | 0.705 |
47
- | Role Accuracy | 0.851 |
48
-
49
- ### Role Labels
50
-
51
- | Label | Description |
52
- |-------|-------------|
53
- | `outlook` | 전망/예츑 λ¬Έμž₯ |
54
- | `event` | 이벀트/사건 λ¬Έμž₯ |
55
- | `financial` | 재무/싀적 λ¬Έμž₯ |
56
- | `risk` | 리슀크 μš”μΈ λ¬Έμž₯ |
57
-
58
- ## Usage
59
-
60
- ```python
61
- import re
62
- import torch
63
- from transformers import AutoConfig, AutoModel, AutoTokenizer
64
-
65
- repo_id = "LangQuant/LQ-FSE-base"
66
-
67
- # λͺ¨λΈ λ‘œλ“œ
68
- config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
69
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
70
- tokenizer = AutoTokenizer.from_pretrained(repo_id)
71
- model.eval()
72
-
73
- # μž…λ ₯ ν…μŠ€νŠΈ
74
- text = (
75
- "μ‚Όμ„±μ „μžμ˜ 2024λ…„ 4λΆ„κΈ° 싀적이 μ‹œμž₯ μ˜ˆμƒμ„ μƒνšŒν–ˆλ‹€. "
76
- "λ©”λͺ¨λ¦¬ λ°˜λ„μ²΄ 가격 μƒμŠΉμœΌλ‘œ μ˜μ—…μ΄μ΅μ΄ μ „λΆ„κΈ° λŒ€λΉ„ 30% μ¦κ°€ν–ˆλ‹€. "
77
- "HBM3E 양산이 λ³Έκ²©ν™”λ˜λ©΄μ„œ AI λ°˜λ„μ²΄ μ‹œμž₯ 점유율이 ν™•λŒ€λ  전망이닀."
78
- )
79
-
80
- # λ¬Έμž₯ 뢄리 및 토큰화
81
- sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
82
- max_len, max_sent = config.max_length, config.max_sentences
83
-
84
- padded = sentences[:max_sent]
85
- num_real = len(padded)
86
- while len(padded) < max_sent:
87
- padded.append("")
88
-
89
- ids_list, mask_list = [], []
90
- for s in padded:
91
- if s:
92
- enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
93
- else:
94
- enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
95
- "attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
96
- ids_list.append(enc["input_ids"])
97
- mask_list.append(enc["attention_mask"])
98
-
99
- input_ids = torch.cat(ids_list).unsqueeze(0)
100
- attention_mask = torch.cat(mask_list).unsqueeze(0)
101
- doc_mask = torch.zeros(1, max_sent)
102
- doc_mask[0, :num_real] = 1
103
-
104
- # μΆ”λ‘ 
105
- with torch.no_grad():
106
- scores, role_logits = model(input_ids, attention_mask, doc_mask)
107
-
108
- role_labels = config.role_labels
109
- for i, sent in enumerate(sentences):
110
- score = scores[0, i].item()
111
- role = role_labels[role_logits[0, i].argmax().item()]
112
- marker = "*" if score >= 0.5 else " "
113
- print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
114
- ```
115
-
116
- ## Model Architecture
117
-
118
- ```
119
- Input Sentences
120
- ↓
121
- [klue/roberta-base] β†’ [CLS] embeddings per sentence
122
- ↓
123
- [Inter-sentence Transformer] (2 layers, 8 heads)
124
- ↓
125
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
126
- β”‚ Binary Classifierβ”‚ Role Classifier β”‚
127
- β”‚ (representative?)β”‚ (outlook/event/ β”‚
128
- β”‚ β”‚ financial/risk) β”‚
129
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
130
- ```
131
-
132
- ## Training
133
-
134
- - Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
135
- - Scheduler: Linear warmup (10%)
136
- - Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
137
- - Max sentence length: 128 tokens
138
- - Max sentences per document: 30
139
-
140
- ## Files
141
-
142
- - `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
143
- - `config.json`: Model configuration
144
- - `model.safetensors`: Model weights
145
- - `inference_example.py`: Inference helper with usage example
146
- - `convert_checkpoint.py`: Script to convert original .pt checkpoint
147
-
148
- ## Disclaimer (λ©΄μ±… μ‘°ν•­)
149
-
150
- - λ³Έ λͺ¨λΈμ€ **연ꡬ 및 정보 제곡 λͺ©μ **으둜만 μ œκ³΅λ©λ‹ˆλ‹€.
151
- - λ³Έ λͺ¨λΈμ˜ 좜λ ₯은 **투자 μ‘°μ–Έ, 금�� 자문, λ§€λ§€ μΆ”μ²œμ΄ μ•„λ‹™λ‹ˆλ‹€.**
152
- - λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό 기반으둜 ν•œ 투자 νŒλ‹¨μ— λŒ€ν•΄ LangQuant 및 κ°œλ°œμžλŠ” **μ–΄λ– ν•œ 법적 μ±…μž„λ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.**
153
- - λͺ¨λΈμ˜ μ •ν™•μ„±, μ™„μ „μ„±, μ μ‹œμ„±μ— λŒ€ν•΄ λ³΄μ¦ν•˜μ§€ μ•ŠμœΌλ©°, μ‹€μ œ 투자 μ˜μ‚¬κ²°μ • μ‹œ λ°˜λ“œμ‹œ μ „λ¬Έκ°€μ˜ 쑰언을 κ΅¬ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
154
- - 금육 μ‹œμž₯은 본질적으둜 λΆˆν™•μ‹€ν•˜λ©°, κ³Όκ±° λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ λͺ¨λΈμ΄ 미래 μ„±κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
155
-
156
- ## Usage Restrictions (μ‚¬μš© μ œν•œ)
157
-
158
- - **κΈˆμ§€ 사항:**
159
- - λ³Έ λͺ¨λΈμ„ μ΄μš©ν•œ μ‹œμ„Έ μ‘°μ’…, ν—ˆμœ„ 정보 생성 λ“± λΆˆλ²•μ  λͺ©μ μ˜ μ‚¬μš©
160
- - μžλ™ν™”λœ 투자 λ§€λ§€ μ‹œμŠ€ν…œμ˜ 단독 μ˜μ‚¬κ²°μ • μˆ˜λ‹¨μœΌλ‘œ μ‚¬μš©
161
- - λͺ¨λΈ 좜λ ₯을 μ „λ¬Έ 금육 자문인 κ²ƒμ²˜λŸΌ 제3μžμ—κ²Œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
162
- - **ν—ˆμš© 사항:**
163
- - ν•™μˆ  연ꡬ 및 ꡐ윑 λͺ©μ μ˜ μ‚¬μš©
164
- - 금육 ν…μŠ€νŠΈ 뢄석 νŒŒμ΄ν”„λΌμΈμ˜ 보쑰 λ„κ΅¬λ‘œ ν™œμš©
165
- - 사내 λ¦¬μ„œμΉ˜/뢄석 μ—…λ¬΄μ˜ μ°Έκ³  자료둜 ν™œμš©
166
- - 상업적 μ‚¬μš© μ‹œ LangQuant에 사전 문의λ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.
167
-
168
- ## Contributors
169
-
170
- - **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com)
171
- - **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β€” Ecole 42
172
- - **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β€” DSSAL
 
1
+ ---
2
+ language:
3
+ - ko
4
+ license: mit
5
+ tags:
6
+ - finance
7
+ - extractive-summarization
8
+ - sentence-extraction
9
+ - role-classification
10
+ - korean
11
+ - roberta
12
+ pipeline_tag: text-classification
13
+ base_model: klue/roberta-base
14
+ metrics:
15
+ - f1
16
+ - accuracy
17
+ ---
18
+
19
+ # LQ-FSE-base: Korean Financial Sentence Extractor
20
+
21
+ LangQuant(λž­ν€€νŠΈ)μ—μ„œ κ³΅κ°œν•œ 금육 리포트, 금육 κ΄€λ ¨ λ‰΄μŠ€μ—μ„œ λŒ€ν‘œλ¬Έμž₯을 μΆ”μΆœν•˜κ³  μ—­ν• (outlook, event, financial, risk)을 λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.
22
+
23
+ ## Model Description
24
+
25
+ - **Base Model**: klue/roberta-base
26
+ - **Architecture**: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
27
+ - **Task**: Extractive Summarization + Role Classification (Multi-task)
28
+ - **Language**: Korean
29
+ - **Domain**: Financial Reports (증ꢌ 리포트), Financial News (금육 λ‰΄μŠ€)
30
+
31
+ ### Input Constraints
32
+
33
+ | Parameter | Value | Description |
34
+ |-----------|-------|-------------|
35
+ | Max sentence length | 128 tokens | λ¬Έμž₯λ‹Ή μ΅œλŒ€ 토큰 수 (초과 μ‹œ truncation) |
36
+ | Max sentences per document | 30 | λ¬Έμ„œλ‹Ή μ΅œλŒ€ λ¬Έμž₯ 수 (초과 μ‹œ μ•ž 30개만 μ‚¬μš©) |
37
+ | Input format | Plain text | λ¬Έμž₯ λΆ€ν˜Έ(`.!?`) κΈ°μ€€μœΌλ‘œ μžλ™ 뢄리 |
38
+
39
+ - **μž…λ ₯**: ν•œκ΅­μ–΄ 금육 ν…μŠ€νŠΈ (증ꢌ 리포트, 금육 λ‰΄μŠ€ λ“±)
40
+ - **좜λ ₯**: 각 λ¬Έμž₯별 λŒ€ν‘œλ¬Έμž₯ 점수 (0~1) + μ—­ν•  λΆ„λ₯˜ (outlook/event/financial/risk)
41
+
42
+ ### Performance
43
+
44
+ | Metric | Score |
45
+ |--------|-------|
46
+ | Extraction F1 | 0.705 |
47
+ | Role Accuracy | 0.851 |
48
+
49
+ ### Role Labels
50
+
51
+ | Label | Description |
52
+ |-------|-------------|
53
+ | `outlook` | 전망/예츑 λ¬Έμž₯ |
54
+ | `event` | 이벀트/사건 λ¬Έμž₯ |
55
+ | `financial` | 재무/싀적 λ¬Έμž₯ |
56
+ | `risk` | 리슀크 μš”μΈ λ¬Έμž₯ |
57
+
58
+ ## Usage
59
+
60
+ ```python
61
+ import re
62
+ import torch
63
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
64
+
65
+ repo_id = "LangQuant/LQ-FSE-base"
66
+
67
+ # λͺ¨λΈ λ‘œλ“œ
68
+ config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
69
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
70
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
71
+ model.eval()
72
+
73
+ # μž…λ ₯ ν…μŠ€νŠΈ
74
+ text = (
75
+ "μ‚Όμ„±μ „μžμ˜ 2024λ…„ 4λΆ„κΈ° 싀적이 μ‹œμž₯ μ˜ˆμƒμ„ μƒνšŒν–ˆλ‹€. "
76
+ "λ©”λͺ¨λ¦¬ λ°˜λ„μ²΄ 가격 μƒμŠΉμœΌλ‘œ μ˜μ—…μ΄μ΅μ΄ μ „λΆ„κΈ° λŒ€λΉ„ 30% μ¦κ°€ν–ˆλ‹€. "
77
+ "HBM3E 양산이 λ³Έκ²©ν™”λ˜λ©΄μ„œ AI λ°˜λ„μ²΄ μ‹œμž₯ 점유율이 ν™•λŒ€λ  전망이닀."
78
+ )
79
+
80
+ # λ¬Έμž₯ 뢄리 및 토큰화
81
+ sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
82
+ max_len, max_sent = config.max_length, config.max_sentences
83
+
84
+ padded = sentences[:max_sent]
85
+ num_real = len(padded)
86
+ while len(padded) < max_sent:
87
+ padded.append("")
88
+
89
+ ids_list, mask_list = [], []
90
+ for s in padded:
91
+ if s:
92
+ enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
93
+ else:
94
+ enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
95
+ "attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
96
+ ids_list.append(enc["input_ids"])
97
+ mask_list.append(enc["attention_mask"])
98
+
99
+ input_ids = torch.cat(ids_list).unsqueeze(0)
100
+ attention_mask = torch.cat(mask_list).unsqueeze(0)
101
+ doc_mask = torch.zeros(1, max_sent)
102
+ doc_mask[0, :num_real] = 1
103
+
104
+ # μΆ”λ‘ 
105
+ with torch.no_grad():
106
+ scores, role_logits = model(input_ids, attention_mask, doc_mask)
107
+
108
+ role_labels = config.role_labels
109
+ for i, sent in enumerate(sentences):
110
+ score = scores[0, i].item()
111
+ role = role_labels[role_logits[0, i].argmax().item()]
112
+ marker = "*" if score >= 0.5 else " "
113
+ print(f" {marker} [{score:.4f}] [{role:10s}] {sent}")
114
+ ```
115
+
116
+ ## Model Architecture
117
+
118
+ ```
119
+ Input Sentences
120
+ ↓
121
+ [klue/roberta-base] β†’ [CLS] embeddings per sentence
122
+ ↓
123
+ [Inter-sentence Transformer] (2 layers, 8 heads)
124
+ ↓
125
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
126
+ β”‚ Binary Classifierβ”‚ Role Classifier β”‚
127
+ β”‚ (representative?)β”‚ (outlook/event/ β”‚
128
+ β”‚ β”‚ financial/risk) β”‚
129
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
130
+ ```
131
+
132
+ ## Training
133
+
134
+ - Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
135
+ - Scheduler: Linear warmup (10%)
136
+ - Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
137
+ - Max sentence length: 128 tokens
138
+ - Max sentences per document: 30
139
+
140
+ ## Files
141
+
142
+ - `model.py`: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
143
+ - `config.json`: Model configuration
144
+ - `model.safetensors`: Model weights
145
+ - `inference_example.py`: Inference helper with usage example
146
+ - `convert_checkpoint.py`: Script to convert original .pt checkpoint
147
+
148
+ ## Disclaimer (λ©΄μ±… μ‘°ν•­)
149
+
150
+ - λ³Έ λͺ¨λΈμ€ **연ꡬ 및 정보 제곡 λͺ©μ **으둜만 μ œκ³΅λ©λ‹ˆλ‹€.
151
+ - λ³Έ λͺ¨λΈμ˜ 좜λ ₯은 **투자 μ‘°μ–Έ, 금육 자문, λ§€λ§€ μΆ”μ²œμ΄ μ•„λ‹™λ‹ˆλ‹€.**
152
+ - λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό 기반으둜 ν•œ 투자 νŒλ‹¨μ— λŒ€ν•΄ LangQuant 및 κ°œλ°œμžλŠ” **μ–΄λ– ν•œ 법적 μ±…μž„λ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.**
153
+ - λͺ¨λΈμ˜ μ •ν™•μ„±, μ™„μ „μ„±, μ μ‹œμ„±μ— λŒ€ν•΄ λ³΄μ¦ν•˜μ§€ μ•ŠμœΌλ©°, μ‹€μ œ 투자 μ˜μ‚¬κ²°μ • μ‹œ λ°˜λ“œμ‹œ μ „λ¬Έκ°€μ˜ 쑰언을 κ΅¬ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
154
+ - 금육 μ‹œμž₯은 본질적으둜 λΆˆν™•μ‹€ν•˜λ©°, κ³Όκ±° λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ λͺ¨λΈμ΄ 미래 μ„±κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
155
+
156
+ ## Usage Restrictions (μ‚¬μš© μ œν•œ)
157
+
158
+ - **κΈˆμ§€ 사항:**
159
+ - λ³Έ λͺ¨λΈμ„ μ΄μš©ν•œ μ‹œμ„Έ μ‘°μ’…, ν—ˆμœ„ 정보 생성 λ“± λΆˆλ²•μ  λͺ©μ μ˜ μ‚¬μš©
160
+ - μžλ™ν™”λœ 투자 λ§€λ§€ μ‹œμŠ€ν…œμ˜ 단독 μ˜μ‚¬κ²°μ • μˆ˜λ‹¨μœΌλ‘œ μ‚¬μš©
161
+ - λͺ¨λΈ 좜λ ₯을 μ „λ¬Έ 금육 자문인 κ²ƒμ²˜λŸΌ 제3μžμ—κ²Œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
162
+ - **ν—ˆμš© 사항:**
163
+ - ν•™μˆ  연ꡬ 및 ꡐ윑 λͺ©μ μ˜ μ‚¬μš©
164
+ - 금육 ν…μŠ€νŠΈ 뢄석 νŒŒμ΄ν”„λΌμΈμ˜ 보쑰 λ„κ΅¬λ‘œ ν™œμš©
165
+ - 사내 λ¦¬μ„œμΉ˜/뢄석 μ—…λ¬΄μ˜ μ°Έκ³  자료둜 ν™œμš©
166
+ - 상업적 μ‚¬μš© μ‹œ LangQuant에 사전 문의λ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.
167
+
168
+ ## Contributors
169
+
170
+ - **[Taegyeong Lee](https://www.linkedin.com/in/taegyeong-lee/)** (taegyeong.leaf@gmail.com)
171
+ - **[Dong Young Kim](https://www.linkedin.com/in/dykim04/)** (dong-kim@student.42kl.edu.my) β€” Ecole 42
172
+ - **[Seunghyun Hwang](https://www.linkedin.com/in/seung-hyun-hwang-53700124a/)** (hsh1030@g.skku.edu) β€” DSSAL