intrect commited on
Commit
1bc7a0f
·
verified ·
1 Parent(s): de3e104

docs: 벤치마크 섹션 한국어로 변환

Browse files
Files changed (1) hide show
  1. README.md +53 -54
README.md CHANGED
@@ -194,77 +194,76 @@ Qwen/Qwen2.5-7B-Instruct
194
 
195
  ## Benchmarks
196
 
197
- ### Korean LLM Benchmark (KMMLU + HAE-RAE)
198
 
199
- All models evaluated with **Q4_K_M quantization**, **0-shot**, using `lm-evaluation-harness` v0.4.9 + `llama.cpp` on Apple M1 Max 32GB.
200
 
201
- #### KMMLU (Korean MMLU, 10 subjects)
202
 
203
- | Subject | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
204
- |---------|:-----------:|:--------------------:|:---------------:|
205
- | Marketing | **75.7** | 72.5 | 75.6 |
206
- | Computer Science | **73.7** | 69.7 | 69.7 |
207
- | Management | 54.0 | 55.2 | **57.3** |
208
- | Political Science | 49.0 | 49.3 | **56.0** |
209
- | Economics | 45.4 | 47.7 | **51.5** |
210
- | Law | 43.4 | 46.1 | **49.9** |
211
- | Psychology | 39.2 | 39.3 | **45.7** |
212
- | Accounting | 38.0 | 33.0 | **42.0** |
213
- | Math | **33.0** | **33.7** | 27.7 |
214
- | Korean History | **31.0** | 29.0 | 22.0 |
215
- | **Average** | **48.2** | **47.6** | **49.7** |
216
 
217
- #### HAE-RAE Bench (Korean-specific)
218
 
219
- | Subtask | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
220
- |---------|:-----------:|:--------------------:|:---------------:|
221
- | Rare Words | 69.9 | 68.4 | **78.8** |
222
- | Standard Nomenclature | 64.7 | 66.0 | **71.9** |
223
- | Loan Words | 48.5 | 57.4 | **81.1** |
224
- | History | 45.7 | 42.6 | **77.7** |
225
- | General Knowledge | **44.3** | 42.1 | 44.3 |
226
- | **Average** | **54.5** | **55.3** | **70.7** |
227
 
228
- #### Key Findings
229
 
230
- - **No catastrophic forgetting**: VELA maintains baseline Qwen2.5 capabilities (KMMLU avg 48.2% vs 47.6%) despite domain-specific fine-tuning
231
- - **Domain transfer**: Finance-related subjects improved Marketing (+3.2%), Computer Science (+4.0%), Accounting (+5.0%) vs base Qwen2.5
232
- - **Competitive with Korean-native models**: VELA beats EXAONE-3.5-7.8B (LG AI Research) in 4/10 KMMLU subjects despite EXAONE being pre-trained with large-scale Korean corpus
233
 
234
- ### Quantization Benchmark (GGUF)
235
 
236
  RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
237
 
238
- | Format | Speed | Chinese Leak | Quality |
239
- |--------|-------|--------------|---------|
240
- | **Q4_K_M (v6)** | **36 tok/s** | 0/5 CLEAN | RT + Report OK |
241
 
242
- > Stress test 5 runs: Synthesis + 3K Reasoning Trace alternating**zero Chinese leak** in all runs
243
 
244
- ### MLX Benchmark (Apple Silicon)
245
 
246
- M1 Max 32GB, MLX 4-bit quantization
247
 
248
- | Config | Quantization | Load Time | Speed | Memory |
249
- |--------|-------------|-----------|-------|--------|
250
- | **MLX 4-bit** | 4-bit (4.5 bpw) | 0.59s | **15.93 tok/s** | 4.4 GB |
251
- | PyTorch (CPU) | BF16 | 0.10s | 4.93 tok/s | 0.3 GB |
252
- | PyTorch + LoRA (CPU) | BF16 | 1.64s | 4.22 tok/s | 14.1 GB |
253
 
254
  MLX 4-bit vs PyTorch CPU:
255
- - **3.2x** faster inference (15.93 vs 4.93 tok/s)
256
- - **73%** smaller model size (4 GB vs 15 GB)
257
- - **68%** less memory (4.4 vs 14.1 GB)
258
-
259
- ### DPO Quality Improvements
260
-
261
- | Metric | Before DPO | After DPO |
262
- |--------|-----------|-----------|
263
- | Chinese leak | Frequent | **0/10 CLEAN** |
264
- | English leak | Occasional | Minimal |
265
- | RT format compliance | ~80% | **~98%** |
266
- | Korean fluency | Good | **Excellent** |
267
-
268
  ---
269
 
270
  ## Usage
 
194
 
195
  ## Benchmarks
196
 
197
+ ### 한국어 LLM 벤치마크 (KMMLU + HAE-RAE)
198
 
199
+ 모든 모델 **Q4_K_M 양자화**, **0-shot** 조건으로 평가. `lm-evaluation-harness` v0.4.9 + `llama.cpp`, Apple M1 Max 32GB 환경.
200
 
201
+ #### KMMLU (한국어 MMLU, 10과목)
202
 
203
+ | 과목 | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
204
+ |------|:-----------:|:--------------------:|:---------------:|
205
+ | 마케팅 | **75.7** | 72.5 | 75.6 |
206
+ | 컴퓨터과학 | **73.7** | 69.7 | 69.7 |
207
+ | 경영학 | 54.0 | 55.2 | **57.3** |
208
+ | 정치사회학 | 49.0 | 49.3 | **56.0** |
209
+ | 경제학 | 45.4 | 47.7 | **51.5** |
210
+ | 법학 | 43.4 | 46.1 | **49.9** |
211
+ | 심리학 | 39.2 | 39.3 | **45.7** |
212
+ | 회계 | 38.0 | 33.0 | **42.0** |
213
+ | 수학 | **33.0** | **33.7** | 27.7 |
214
+ | 한국사 | **31.0** | 29.0 | 22.0 |
215
+ | **평균** | **48.2** | **47.6** | **49.7** |
216
 
217
+ #### HAE-RAE Bench (한국어 특화)
218
 
219
+ | 영역 | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
220
+ |------|:-----------:|:--------------------:|:---------------:|
221
+ | 희귀어 | 69.9 | 68.4 | **78.8** |
222
+ | 표준명칭 | 64.7 | 66.0 | **71.9** |
223
+ | 외래어 | 48.5 | 57.4 | **81.1** |
224
+ | 한국사 | 45.7 | 42.6 | **77.7** |
225
+ | 일반상식 | **44.3** | 42.1 | 44.3 |
226
+ | **평균** | **54.5** | **55.3** | **70.7** |
227
 
228
+ #### 주요 발견
229
 
230
+ - **Catastrophic forgetting 없음**: 도메인 특화 fine-tuning 후에도 베이스 모델(Qwen2.5) 능력 유지 (KMMLU 평균 48.2% vs 47.6%)
231
+ - **도메인 전이 효과**: 금융 관련 과목에서 베이스 모델 대비 향상 — 마케팅(+3.2%), 컴퓨터과학(+4.0%), 회계(+5.0%)
232
+ - **한국어 네이티브 모델과 경쟁**: 대규모 한국어 코퍼스로 사전학습된 EXAONE-3.5-7.8B (LG AI Research) 대비 KMMLU 10과목 4개에서 우위
233
 
234
+ ### 양자화 벤치마크 (GGUF)
235
 
236
  RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
237
 
238
+ | 포맷 | 속도 | 중국어 Leak | 품질 |
239
+ |------|------|-------------|------|
240
+ | **Q4_K_M (v6)** | **36 tok/s** | 0/5 클린 | RT + 리포트 정상 |
241
 
242
+ > 스트레스 테스트 5: Synthesis + 3K Reasoning Trace 교대 구간 **중국어 leak 제로**
243
 
244
+ ### MLX 벤치마크 (Apple Silicon)
245
 
246
+ M1 Max 32GB, MLX 4-bit 양자화
247
 
248
+ | 구성 | 양자화 | 로딩 시간 | 추론 속도 | 메모리 |
249
+ |------|--------|----------|----------|--------|
250
+ | **MLX 4-bit** | 4-bit (4.5 bpw) | 0.59초 | **15.93 tok/s** | 4.4 GB |
251
+ | PyTorch (CPU) | BF16 | 0.10초 | 4.93 tok/s | 0.3 GB |
252
+ | PyTorch + LoRA (CPU) | BF16 | 1.64초 | 4.22 tok/s | 14.1 GB |
253
 
254
  MLX 4-bit vs PyTorch CPU:
255
+ - 추론 속도 **3.2배** (15.93 vs 4.93 tok/s)
256
+ - 모델 크기 **73% 감소** (4 GB vs 15 GB)
257
+ - 메모리 **68% 절약** (4.4 vs 14.1 GB)
258
+
259
+ ### DPO 학습 품질 개선
260
+
261
+ | 지표 | DPO | DPO |
262
+ |------|--------|--------|
263
+ | 중국어 leak | 빈번 | **0/10 클린** |
264
+ | 영어 leak | 간헐적 | 최소화 |
265
+ | RT 형식 준수율 | ~80% | **~98%** |
266
+ | 한국어 유창성 | 양호 | **우수** |
 
267
  ---
268
 
269
  ## Usage