docs: add Korean LLM benchmark results (KMMLU + HAE-RAE, 3-model comparison)
Browse files
README.md
CHANGED
|
@@ -194,6 +194,43 @@ Qwen/Qwen2.5-7B-Instruct
|
|
| 194 |
|
| 195 |
## Benchmarks
|
| 196 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
### Quantization Benchmark (GGUF)
|
| 198 |
|
| 199 |
RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
|
|
@@ -202,11 +239,11 @@ RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
|
|
| 202 |
|--------|-------|--------------|---------|
|
| 203 |
| **Q4_K_M (v6)** | **36 tok/s** | 0/5 CLEAN | RT + Report OK |
|
| 204 |
|
| 205 |
-
> Stress test 5
|
| 206 |
|
| 207 |
### MLX Benchmark (Apple Silicon)
|
| 208 |
|
| 209 |
-
M1 Max 32GB, MLX 4-bit
|
| 210 |
|
| 211 |
| Config | Quantization | Load Time | Speed | Memory |
|
| 212 |
|--------|-------------|-----------|-------|--------|
|
|
|
|
| 194 |
|
| 195 |
## Benchmarks
|
| 196 |
|
| 197 |
+
### Korean LLM Benchmark (KMMLU + HAE-RAE)
|
| 198 |
+
|
| 199 |
+
All models evaluated with **Q4_K_M quantization**, **0-shot**, using `lm-evaluation-harness` v0.4.9 + `llama.cpp` on Apple M1 Max 32GB.
|
| 200 |
+
|
| 201 |
+
#### KMMLU (Korean MMLU, 10 subjects)
|
| 202 |
+
|
| 203 |
+
| Subject | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
|
| 204 |
+
|---------|:-----------:|:--------------------:|:---------------:|
|
| 205 |
+
| Marketing | **75.7** | 72.5 | 75.6 |
|
| 206 |
+
| Computer Science | **73.7** | 69.7 | 69.7 |
|
| 207 |
+
| Management | 54.0 | 55.2 | **57.3** |
|
| 208 |
+
| Political Science | 49.0 | 49.3 | **56.0** |
|
| 209 |
+
| Economics | 45.4 | 47.7 | **51.5** |
|
| 210 |
+
| Law | 43.4 | 46.1 | **49.9** |
|
| 211 |
+
| Psychology | 39.2 | 39.3 | **45.7** |
|
| 212 |
+
| Accounting | 38.0 | 33.0 | **42.0** |
|
| 213 |
+
| Math | **33.0** | **33.7** | 27.7 |
|
| 214 |
+
| Korean History | **31.0** | 29.0 | 22.0 |
|
| 215 |
+
| **Average** | **48.2** | **47.6** | **49.7** |
|
| 216 |
+
|
| 217 |
+
#### HAE-RAE Bench (Korean-specific)
|
| 218 |
+
|
| 219 |
+
| Subtask | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
|
| 220 |
+
|---------|:-----------:|:--------------------:|:---------------:|
|
| 221 |
+
| Rare Words | 69.9 | 68.4 | **78.8** |
|
| 222 |
+
| Standard Nomenclature | 64.7 | 66.0 | **71.9** |
|
| 223 |
+
| Loan Words | 48.5 | 57.4 | **81.1** |
|
| 224 |
+
| History | 45.7 | 42.6 | **77.7** |
|
| 225 |
+
| General Knowledge | **44.3** | 42.1 | 44.3 |
|
| 226 |
+
| **Average** | **54.5** | **55.3** | **70.7** |
|
| 227 |
+
|
| 228 |
+
#### Key Findings
|
| 229 |
+
|
| 230 |
+
- **No catastrophic forgetting**: VELA maintains baseline Qwen2.5 capabilities (KMMLU avg 48.2% vs 47.6%) despite domain-specific fine-tuning
|
| 231 |
+
- **Domain transfer**: Finance-related subjects improved — Marketing (+3.2%), Computer Science (+4.0%), Accounting (+5.0%) vs base Qwen2.5
|
| 232 |
+
- **Competitive with Korean-native models**: VELA beats EXAONE-3.5-7.8B (LG AI Research) in 4/10 KMMLU subjects despite EXAONE being pre-trained with large-scale Korean corpus
|
| 233 |
+
|
| 234 |
### Quantization Benchmark (GGUF)
|
| 235 |
|
| 236 |
RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
|
|
|
|
| 239 |
|--------|-------|--------------|---------|
|
| 240 |
| **Q4_K_M (v6)** | **36 tok/s** | 0/5 CLEAN | RT + Report OK |
|
| 241 |
|
| 242 |
+
> Stress test 5 runs: Synthesis + 3K Reasoning Trace alternating — **zero Chinese leak** in all runs
|
| 243 |
|
| 244 |
### MLX Benchmark (Apple Silicon)
|
| 245 |
|
| 246 |
+
M1 Max 32GB, MLX 4-bit quantization
|
| 247 |
|
| 248 |
| Config | Quantization | Load Time | Speed | Memory |
|
| 249 |
|--------|-------------|-----------|-------|--------|
|