intrect commited on
Commit
4043357
·
verified ·
1 Parent(s): 8fa0fb0

docs: add Korean LLM benchmark results (KMMLU + HAE-RAE, 3-model comparison)

Browse files
Files changed (1) hide show
  1. README.md +39 -2
README.md CHANGED
@@ -194,6 +194,43 @@ Qwen/Qwen2.5-7B-Instruct
194
 
195
  ## Benchmarks
196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ### Quantization Benchmark (GGUF)
198
 
199
  RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
@@ -202,11 +239,11 @@ RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
202
  |--------|-------|--------------|---------|
203
  | **Q4_K_M (v6)** | **36 tok/s** | 0/5 CLEAN | RT + Report OK |
204
 
205
- > Stress test 5: Synthesis + 3K Reasoning Trace 교대양쪽 모두 **Chinese leak 제로**
206
 
207
  ### MLX Benchmark (Apple Silicon)
208
 
209
- M1 Max 32GB, MLX 4-bit 양자화
210
 
211
  | Config | Quantization | Load Time | Speed | Memory |
212
  |--------|-------------|-----------|-------|--------|
 
194
 
195
  ## Benchmarks
196
 
197
+ ### Korean LLM Benchmark (KMMLU + HAE-RAE)
198
+
199
+ All models evaluated with **Q4_K_M quantization**, **0-shot**, using `lm-evaluation-harness` v0.4.9 + `llama.cpp` on Apple M1 Max 32GB.
200
+
201
+ #### KMMLU (Korean MMLU, 10 subjects)
202
+
203
+ | Subject | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
204
+ |---------|:-----------:|:--------------------:|:---------------:|
205
+ | Marketing | **75.7** | 72.5 | 75.6 |
206
+ | Computer Science | **73.7** | 69.7 | 69.7 |
207
+ | Management | 54.0 | 55.2 | **57.3** |
208
+ | Political Science | 49.0 | 49.3 | **56.0** |
209
+ | Economics | 45.4 | 47.7 | **51.5** |
210
+ | Law | 43.4 | 46.1 | **49.9** |
211
+ | Psychology | 39.2 | 39.3 | **45.7** |
212
+ | Accounting | 38.0 | 33.0 | **42.0** |
213
+ | Math | **33.0** | **33.7** | 27.7 |
214
+ | Korean History | **31.0** | 29.0 | 22.0 |
215
+ | **Average** | **48.2** | **47.6** | **49.7** |
216
+
217
+ #### HAE-RAE Bench (Korean-specific)
218
+
219
+ | Subtask | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
220
+ |---------|:-----------:|:--------------------:|:---------------:|
221
+ | Rare Words | 69.9 | 68.4 | **78.8** |
222
+ | Standard Nomenclature | 64.7 | 66.0 | **71.9** |
223
+ | Loan Words | 48.5 | 57.4 | **81.1** |
224
+ | History | 45.7 | 42.6 | **77.7** |
225
+ | General Knowledge | **44.3** | 42.1 | 44.3 |
226
+ | **Average** | **54.5** | **55.3** | **70.7** |
227
+
228
+ #### Key Findings
229
+
230
+ - **No catastrophic forgetting**: VELA maintains baseline Qwen2.5 capabilities (KMMLU avg 48.2% vs 47.6%) despite domain-specific fine-tuning
231
+ - **Domain transfer**: Finance-related subjects improved — Marketing (+3.2%), Computer Science (+4.0%), Accounting (+5.0%) vs base Qwen2.5
232
+ - **Competitive with Korean-native models**: VELA beats EXAONE-3.5-7.8B (LG AI Research) in 4/10 KMMLU subjects despite EXAONE being pre-trained with large-scale Korean corpus
233
+
234
  ### Quantization Benchmark (GGUF)
235
 
236
  RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
 
239
  |--------|-------|--------------|---------|
240
  | **Q4_K_M (v6)** | **36 tok/s** | 0/5 CLEAN | RT + Report OK |
241
 
242
+ > Stress test 5 runs: Synthesis + 3K Reasoning Trace alternating — **zero Chinese leak** in all runs
243
 
244
  ### MLX Benchmark (Apple Silicon)
245
 
246
+ M1 Max 32GB, MLX 4-bit quantization
247
 
248
  | Config | Quantization | Load Time | Speed | Memory |
249
  |--------|-------------|-----------|-------|--------|