Upload eval_reports/2026-03-09_GGUF_DEPLOYMENT_AND_EVAL_REPORT.md with huggingface_hub

#3
by somebody-to-love - opened
eval_reports/2026-03-09_GGUF_DEPLOYMENT_AND_EVAL_REPORT.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FRANKENSTALLM 3B v2 โ€” GGUF ๋ณ€ํ™˜ยท๋ฐฐํฌ ๋ฐ Ollama ํ‰๊ฐ€ ๋ณด๊ณ ์„œ
2
+
3
+ - **์ž‘์„ฑ์ผ**: 2026-03-09
4
+ - **๋Œ€์ƒ**: byte-fallback ์ˆ˜์ • ์ ์šฉ ์ฒดํฌํฌ์ธํŠธ โ†’ GGUF ๋ณ€ํ™˜ โ†’ Ollama ๋ฐฐํฌ โ†’ ๋ฒค์น˜๋งˆํฌ
5
+
6
+ ---
7
+
8
+ ## 1. ์š”์•ฝ
9
+
10
+ | ํ•ญ๋ชฉ | ๋‚ด์šฉ |
11
+ |------|------|
12
+ | **์›์ธ** | SentencePiece Unigram ํ† ํฌ๋‚˜์ด์ €์— `byte_fallback` ๋ฏธ์ ์šฉ โ†’ `\n` ๋“ฑ ๋ฏธ๋“ฑ๋ก ๋ฌธ์ž ์‹œ llama.cpp ํฌ๋ž˜์‹œ |
13
+ | **์กฐ์น˜** | 256๊ฐœ byte-fallback ํ† ํฐ ์ถ”๊ฐ€, ์ž„๋ฒ ๋”ฉ 64000โ†’64256 ๋ฆฌ์‚ฌ์ด์ฆˆ, GGUF ์žฌ๋ณ€ํ™˜, Q4_K_M ์–‘์žํ™” |
14
+ | **๋ฐฐํฌ** | Ollama ๋ชจ๋ธ `frankenstallm-3b-v2:latest` (792 MB, Q4_K_M) |
15
+ | **๋‰ด๋ผ์ธ ๊ฒ€์ฆ** | โœ… ํฌ๋ž˜์‹œ ์—†์ด `\n` ํฌํ•จ ํ”„๋กฌํ”„ํŠธ ์ฒ˜๋ฆฌ ํ™•์ธ |
16
+ | **Ollama ๋ฒค์น˜๋งˆํฌ** | 35๊ฐœ ํ…Œ์ŠคํŠธ, ์ž๋™ ์ฑ„์  ํ‰๊ท  46.7, ํ‰๊ท  TPS 142.5, TTFT 16.7 ms |
17
+
18
+ ---
19
+
20
+ ## 2. ํŒŒ์ดํ”„๋ผ์ธ ๋‹จ๊ณ„
21
+
22
+ ### 2.1 ํ† ํฌ๋‚˜์ด์ €ยท์ž„๋ฒ ๋”ฉ ์ˆ˜์ •
23
+
24
+ - **์Šคํฌ๋ฆฝํŠธ**: `scripts/fix_tokenizer_byte_fallback.py`
25
+ - **์ž…๋ ฅ**: `outputs/hf_checkpoint-best`
26
+ - **์ถœ๋ ฅ**: `outputs/hf_checkpoint-best-fixed`
27
+ - **๋ณ€๊ฒฝ ์‚ฌํ•ญ**:
28
+ - `tokenizer.json`: `byte_fallback=True`, `<0x00>`~`<0xFF>` 256๊ฐœ ํ† ํฐ ์ถ”๊ฐ€
29
+ - `config.json`: `vocab_size` 64000 โ†’ 64256
30
+ - ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด ๋ฆฌ์‚ฌ์ด์ฆˆ ๋ฐ ์ƒˆ ํ† ํฐ ์ดˆ๊ธฐํ™” ํ›„ safetensors ์ €์žฅ
31
+
32
+ ### 2.2 GGUF ๋ณ€ํ™˜ ๋ฐ ์–‘์žํ™”
33
+
34
+ - **F16 GGUF**: `outputs/llama.cpp/convert_hf_to_gguf.py`
35
+ `outputs/hf_checkpoint-best-fixed` โ†’ `outputs/gguf/frankenstallm-3b-v2-f16.gguf`
36
+ - **Q4_K_M ์–‘์žํ™”**: `outputs/llama.cpp/build/bin/llama-quantize`
37
+ โ†’ `outputs/gguf/frankenstallm-3b-v2-Q4_K_M.gguf` (์•ฝ 792 MB)
38
+
39
+ ### 2.3 Ollama ๋ฐฐํฌ
40
+
41
+ - **Modelfile**: ๋กœ์ปฌ GGUF ๊ฒฝ๋กœ `FROM` ์ง€์ • ํ›„ `ollama create`
42
+ - **๋ชจ๋ธ ์ด๋ฆ„**: `frankenstallm-3b-v2:latest`
43
+
44
+ ### 2.4 ๋‰ด๋ผ์ธ ํ…Œ์ŠคํŠธ
45
+
46
+ - **๋ฐฉ๋ฒ•**: Ollama API๋กœ `"์ฒซ ์ค„\n๋‘ ๋ฒˆ์งธ ์ค„\n์„ธ ๋ฒˆ์งธ ์ค„์ด๋ผ๊ณ  ๋งํ•ด์ค˜."` ํ”„๋กฌํ”„ํŠธ ์ „์†ก
47
+ - **๊ฒฐ๊ณผ**: HTTP 200, `done: true`, ํฌ๋ž˜์‹œ ์—†์Œ โ†’ byte-fallback ์ˆ˜์ • ๊ฒ€์ฆ ์™„๋ฃŒ
48
+
49
+ ---
50
+
51
+ ## 3. Ollama ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ (frankenstallm-3b-v2)
52
+
53
+ - **์‹คํ–‰**: `python eval/ollama_benchmark.py --models frankenstallm-3b-v2 --output-dir eval/results/frankenstallm-3b-v2`
54
+ - **์ผ์‹œ**: 2026-03-09 23:24:22
55
+ - **์ด ํ…Œ์ŠคํŠธ**: 35 (์ž๋™ ์ฑ„์  20 + ์ˆ˜๋™ ๊ฒ€ํ†  15)
56
+
57
+ ### 3.1 ์ „์ฒด ์ž๋™ ์ฑ„์  ํ‰๊ท 
58
+
59
+ | ๋ชจ๋ธ | Auto Avg |
60
+ |------|----------|
61
+ | frankenstallm-3b-v2 | **46.7** |
62
+
63
+ ### 3.2 ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ์ ์ˆ˜ (์ž๋™/์ˆ˜๋™)
64
+
65
+ | ์นดํ…Œ๊ณ ๋ฆฌ | ์ ์ˆ˜ | ๋น„๊ณ  |
66
+ |----------|------|------|
67
+ | korean_nlu | 100.0 | 3 ์ž๋™ / 2 ์ˆ˜๋™ |
68
+ | korean_generation | manual | 5 ์ˆ˜๋™ |
69
+ | reasoning | 50.0 | 4 ์ž๋™ / 1 ์ˆ˜๋™ |
70
+ | knowledge | 75.0 | 4 ์ž๋™ / 1 ์ˆ˜๋™ |
71
+ | code | 0.0 | 3 ์ž๋™ |
72
+ | safety | 10.0 | 2 ์ž๋™ / 1 ์ˆ˜๋™ |
73
+ | instruction_following | 66.7 | 3 ์ž๋™ |
74
+ | multilingual | manual | 3 ์ˆ˜๋™ |
75
+ | repetition_resistance | 2.2 | 3 ์ž๋™ (๋ฐ˜๋ณต๋ฅ  ๋†’์Œ) |
76
+
77
+ ### 3.3 ์ง€์—ฐ ์‹œ๊ฐ„
78
+
79
+ | ์ง€ํ‘œ | ๊ฐ’ |
80
+ |------|-----|
81
+ | Avg TTFT (ms) | 16.7 |
82
+ | P50 TTFT (ms) | 15.8 |
83
+ | P95 TTFT (ms) | 26.2 |
84
+ | Avg TPS | 142.5 |
85
+ | P50 TPS | 142.7 |
86
+ | P95 TPS | 143.3 |
87
+
88
+ ### 3.4 ๋ฐ˜๋ณต๋ฅ  ์ƒ์„ธ (repetition_resistance)
89
+
90
+ | Test ID | Rep Rate | Unique/Total N-grams | Score |
91
+ |---------|----------|----------------------|-------|
92
+ | rep_01 | 73.76% | 122/465 | 0.0 |
93
+ | rep_02 | 59.72% | 255/633 | 0.0 |
94
+ | rep_03 | 46.70% | 226/424 | 6.6 |
95
+
96
+ - **์›๋ณธ ORPO ํ‰๊ฐ€** (HF ์ฒดํฌํฌ์ธํŠธ, Greedy): 3-gram ๋ฐ˜๋ณต๋ฅ  30.89%, EOS 67%.
97
+ Ollama Q4_K_M + ๋ฒค์น˜๋งˆํฌ ํ”„๋กฌํ”„ํŠธ์—์„œ๋Š” ๋ฐ˜๋ณต์ด ๋” ๋‘๋“œ๋Ÿฌ์ง.
98
+
99
+ ### 3.5 ๊ฒฐ๊ณผ ํŒŒ์ผ ์œ„์น˜
100
+
101
+ - **JSON**: `eval/results/frankenstallm-3b-v2/ollama_benchmark_results.json`
102
+ - **์š”์•ฝ MD**: `eval/results/frankenstallm-3b-v2/ollama_benchmark_summary.md`
103
+
104
+ ---
105
+
106
+ ## 4. ๊ธฐ์กด ORPO ํ‰๊ฐ€์™€์˜ ์—ฐ๊ณ„
107
+
108
+ - **ORPO ์ข…ํ•ฉ ๋ณด๊ณ ์„œ**: `reports/2026-03-09_ORPO_EVALUATION_REPORT.md`
109
+ - **์ •๋Ÿ‰ ์Šค์ฝ”์–ด**: 63.7/100, 7/10 ์ฐจ์› ํ†ต๊ณผ, ์ตœ์ข… ํŒ์ • **RETRY**
110
+ - **v2 ๋ฐฐํฌ๋ณธ**์€ ๋™์ผ ORPO ์ฒดํฌํฌ์ธํŠธ์—์„œ byte-fallback๋งŒ ์ˆ˜์ •ยทGGUF ๋ณ€ํ™˜ํ•œ ๋ฒ„์ „์ด๋ฉฐ,
111
+ ORPO ์ง€ํ‘œ(์˜ˆ: preference accuracy, reward margin)๋Š” ๊ธฐ์กด ๋ณด๊ณ ์„œ์™€ ๋™์ผํ•œ ์ฒดํฌํฌ์ธํŠธ ๊ธฐ์ค€์œผ๋กœ ์œ ์ง€๋จ.
112
+
113
+ ---
114
+
115
+ ## 5. ์•„ํ‹ฐํŒฉํŠธ ๊ฒฝ๋กœ ์ •๋ฆฌ
116
+
117
+ | ์šฉ๋„ | ๊ฒฝ๋กœ |
118
+ |------|------|
119
+ | ์ˆ˜์ •๋œ HF ์ฒดํฌํฌ์ธํŠธ | `outputs/hf_checkpoint-best-fixed/` |
120
+ | F16 GGUF | `outputs/gguf/frankenstallm-3b-v2-f16.gguf` |
121
+ | Q4_K_M GGUF (Ollama ๋ฐฐํฌ์šฉ) | `outputs/gguf/frankenstallm-3b-v2-Q4_K_M.gguf` |
122
+ | Ollama ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ | `eval/results/frankenstallm-3b-v2/` |
123
+ | Byte-fallback ์ˆ˜์ • ์Šคํฌ๋ฆฝํŠธ | `scripts/fix_tokenizer_byte_fallback.py` |
124
+
125
+ ---
126
+
127
+ *์ด ๋ณด๊ณ ์„œ๋Š” GGUF ๋ณ€ํ™˜ยทOllama ๋ฐฐํฌ ๋ฐ Ollama ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋ฅผ ์ •๋ฆฌํ•œ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค.*