File size: 21,971 Bytes
42ca925
 
29fc577
 
42ca925
cceae9d
 
42ca925
29fc577
 
 
 
 
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
 
cceae9d
42ca925
cceae9d
42ca925
cceae9d
42ca925
cceae9d
29fc577
cceae9d
29fc577
cceae9d
 
 
 
 
 
29fc577
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
bd64859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29fc577
 
 
cceae9d
 
 
 
 
29fc577
42ca925
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
cceae9d
e95edec
cceae9d
 
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
cceae9d
e95edec
 
 
cceae9d
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
cceae9d
e95edec
cceae9d
e95edec
 
 
 
 
 
 
cceae9d
 
e95edec
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd64859
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
29fc577
 
 
 
 
 
 
 
42ca925
cceae9d
 
42ca925
bd64859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
42ca925
cceae9d
42ca925
cceae9d
 
 
 
 
 
 
 
 
42ca925
cceae9d
29fc577
 
cceae9d
 
 
 
 
 
29fc577
42ca925
cceae9d
42ca925
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29fc577
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
42ca925
e95edec
29fc577
 
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
e95edec
cceae9d
e95edec
 
 
cceae9d
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
e95edec
cceae9d
e95edec
cceae9d
e95edec
 
 
 
 
 
 
29fc577
42ca925
e95edec
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
cceae9d
 
 
 
 
42ca925
cceae9d
42ca925
29fc577
8297984
cceae9d
42ca925
f6a641a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
42ca925
cceae9d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
---
language:
  - ko
  - en
license: mit
library_name: pytorch
pipeline_tag: text-generation
tags:
  - mamba2
  - hybrid
  - transformer
  - korean
  - from-scratch
  - dpo
  - slerp
  - orpo
  - nemotron-h
datasets:
  - heegyu/orca-math-korean-preference-cleaned
  - nayohan/preference-collection-ko-full
  - kuotient/orca-math-word-problems-193k-korean
  - FreedomIntelligence/alpaca-gpt4-korean
  - heegyu/orca_ko
  - HAERAE-HUB/KOFFQA-GuardInstruct-v1
model-index:
  - name: EVAFRILL-Mo-3B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: hellaswag
          name: HellaSwag (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 34.6
      - task:
          type: text-generation
        dataset:
          type: arc_easy
          name: ARC-Easy (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 32.0
      - task:
          type: text-generation
        dataset:
          type: belebele
          name: Belebele Korean (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 23.6
      - task:
          type: text-generation
        dataset:
          type: mmlu
          name: Global MMLU Korean (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 23.7
---

> [ํ•œ๊ตญ์–ด](#ํ•œ๊ตญ์–ด) | [English](#english)

---

# ํ•œ๊ตญ์–ด

## EVAFRILL-Mo 3B โ€” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer

### ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

EVAFRILL-Mo 3B๋Š” NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) ์•„ํ‚คํ…์ฒ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ **๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ** 30์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

- 7ร— NVIDIA B200 GPU๋กœ 55B ํ† ํฐ ์‚ฌ์ „ํ•™์Šต (์•ฝ 60์‹œ๊ฐ„)
- ํ•œ๊ตญ์–ดยท์˜์–ดยท์ฝ”๋“œยท์ˆ˜ํ•™ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ
- SFT โ†’ DPO โ†’ SLERP ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๋‹จ์ผ ํ”„๋กœ์ ํŠธ์—์„œ ์ง์ ‘ ๊ตฌํ˜„
- ์™ธ๋ถ€ ํ”„๋ ˆ์ž„์›Œํฌ(Transformers Trainer, TRL) ์—†์ด PyTorch ๋„ค์ดํ‹ฐ๋ธŒ๋กœ ๊ตฌํ˜„

### ์•„ํ‚คํ…์ฒ˜

```
Type:           Hybrid Mamba-2 + Transformer
Parameters:     2.94B (2,975,397,632)
Layers:         26 (24ร— Mamba-2 SSM + 2ร— Attention GQA)
d_model:        3,072
Vocabulary:     64,000 (custom SentencePiece)
Max seq length: 4,096
```

Mamba-2 SSM ๋ธ”๋ก์ด ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , 2๊ฐœ์˜ GQA Attention ๋ธ”๋ก์ด ์ „์—ญ ์ปจํ…์ŠคํŠธ๋ฅผ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.
ํ‘œ์ค€ Transformer ๋Œ€๋น„ ์ถ”๋ก  ์‹œ KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํฌ๊ฒŒ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค.

### ๊ฐœ๋ฐœ ๋ฐฐ๊ฒฝ ๋ฐ ํžˆ์Šคํ† ๋ฆฌ

EVAFRILL-Mo๋Š” 6๋‹จ๊ณ„์˜ ๋ฐ˜๋ณต์  ์„ค๊ณ„ ๊ณผ์ •์„ ๊ฑฐ์ณ ํƒ„์ƒํ–ˆ์Šต๋‹ˆ๋‹ค:

1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** โ€” ์ˆœ์ˆ˜ Transformer decoder-only LLM์œผ๋กœ ์‹œ์ž‘ํ•œ ์ „์‹  ํ”„๋กœ์ ํŠธ. ํ•œ๊ตญ์–ด+์˜์–ด+์ฝ”๋“œ+์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ๋กœ ์ปค์Šคํ…€ SentencePiece ํ† ํฌ๋‚˜์ด์ €(64K ์–ดํœ˜)๋ฅผ ํ•™์Šตํ•˜๊ณ , DDP ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
2. **Nemotron-H ์˜๊ฐ** โ€” NVIDIA์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer ์„ค๊ณ„๋ฅผ ํ•ต์‹ฌ ์›์น™๋งŒ ์ถ”์ถœํ•˜์—ฌ(fragmentation) ์ œํ•œ๋œ ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ์ถ•์†Œยท์ ์šฉ.
3. **์ฒด๊ณ„์  ๊ทœ๋ชจ ํƒ์ƒ‰** โ€” 5๊ฐœ ๊ทœ๋ชจ(1B~3B) ๋ชจ๋ธ์„ 7ร—B200์—์„œ ๋ฒค์น˜๋งˆํฌํ•˜์—ฌ Chinchilla-optimal ์ตœ๋Œ€ ๊ทœ๋ชจ(3B, 93% ๋‹ฌ์„ฑ) ๊ฒฐ์ •.
4. **1B โ†’ 3B ์ „ํ™˜** โ€” tok/s๊ฐ€ per-GPU ๊ฐ’์ž„์„ ๋ฐœ๊ฒฌํ•˜์—ฌ, 1B ๊ณผ์ž‰ํ•™์Šต(681%)์„ 3B ์ ์ •ํ•™์Šต(93%)์œผ๋กœ ์ „ํ™˜.
5. **3B ์‚ฌ์ „ํ•™์Šต** โ€” 319,772 steps, 55B tokens, 7ร—B200 FP8๋กœ 60์‹œ๊ฐ„ ์™„๋ฃŒ.
6. **Post-training** โ€” H100 MIG ํ™˜๊ฒฝ์—์„œ SFT โ†’ DPO โ†’ SLERP โ†’ ORPO ์‹คํ—˜๊นŒ์ง€ ์™„์ˆ˜.

### ํ•ต์‹ฌ ๊ธฐ์ˆ  ํ•˜์ด๋ผ์ดํŠธ

| ๊ธฐ์ˆ  | ํšจ๊ณผ |
|------|------|
| **Chunked Cross-Entropy** | 64K ์–ดํœ˜์—์„œ logits ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 1/8๋กœ ์ ˆ๊ฐ |
| **Mamba Memory Cliff ๋ฐœ๊ฒฌ** | batch 6โ†’7์—์„œ 47GBโ†’183GB+ ํญ์ฆ โ€” selective scan์˜ ๊ตฌ์กฐ์  ์ œ์•ฝ ๊ทœ๋ช… |
| **FP8 ๋„ค์ดํ‹ฐ๋ธŒ ํ•™์Šต** | TransformerEngine MXFP8BlockScaling์œผ๋กœ B200์—์„œ BF16 ๋Œ€๋น„ ~2๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰ |
| **LoRA B-zeroing** | DPO reference model์„ ๋ชจ๋ธ ๋ณต์ œ ์—†์ด LoRA B๋ฅผ ์ž„์‹œ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๊ณ„์‚ฐ โ€” VRAM 50% ์ ˆ์•ฝ |
| **SLERP ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ** | SFT ์ง€์‹ ๋ณด์กด + DPO ์ •๋ ฌ์„ ๊ตฌ๋ฉด ๋ณด๊ฐ„์œผ๋กœ ๊ท ํ˜• โ€” alignment tax ์™„ํ™” |
| **Native DPO/ORPO** | TRL ๋ฏธ์‚ฌ์šฉ, ์ปค์Šคํ…€ Mamba-2 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋ฅผ ์œ„ํ•ด ์ฒ˜์Œ๋ถ€ํ„ฐ PyTorch๋กœ ๊ตฌํ˜„ |

> ๐Ÿ“– **์ „์ฒด ๊ฐœ๋ฐœ ๊ณผ์ •, ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„ ๊ทผ๊ฑฐ, ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ์ƒ์„ธ๋Š” [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.**

### ๋ชจ๋ธ ๋ฒ„์ „

์ด ์ €์žฅ์†Œ์—๋Š” ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๊ฐ ๋‹จ๊ณ„์˜ ์ฒดํฌํฌ์ธํŠธ **7์ข…**์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

| ๋ฒ„์ „ | ๋””๋ ‰ํ† ๋ฆฌ | ํฌ๊ธฐ | ์„ค๋ช… | ๊ถŒ์žฅ |
|------|----------|------|------|:----:|
| **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 ๊ตฌ๋ฉด ์„ ํ˜• ๋ณด๊ฐ„ (ฮฑ=0.5) | โญ |
| Pretrain | `pretrain/` | 12.6 GB | ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (319K ์Šคํ…, 55B ํ† ํฐ) | |
| SFT v2 | `sft-v2/` | 6.3 GB | ๋ช…๋ น์–ด ํŒŒ์ธํŠœ๋‹ (65K ์Šคํ…) | |
| DPO R1 | `dpo-r1/` | 6.3 GB | ์„ ํ˜ธ๋„ ์ •๋ ฌ 1๋ผ์šด๋“œ (3K ์Šคํ…) | |
| DPO R2 | `dpo-r2/` | 6.3 GB | ๋ณด์ˆ˜์  ํŒŒ์ธํŠœ๋‹ 2๋ผ์šด๋“œ (2K ์Šคํ…) | |
| ORPO | `orpo/` | 6.3 GB | SFT+์ •๋ ฌ ๋™์‹œ ํ•™์Šต ์‹คํ—˜ (10K ์Šคํ…) | |
| DPO R3 | `dpo-r3/` | 6.3 GB | ๋ฐ˜๋ณต ์–ต์ œ ํŠนํ™” ์‹คํ—˜ (1K ์Šคํ…) | |

### ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ

```
Pretrain (55B tokens, 7ร—B200, 60h)
  โ””โ”€โ–บ SFT v2 (65K steps, H100 MIG, 5์ผ)
        โ”œโ”€โ–บ DPO R1 (3K steps) โ”€โ–บ DPO R2 (2K steps)
        โ”‚     โ””โ”€โ–บ SLERP Merge (ฮฑ=0.5) โญ ์ตœ์ข… ๊ถŒ์žฅ
        โ””โ”€โ–บ ORPO (10K steps, ์‹คํ—˜)
              โ””โ”€โ–บ DPO R3 (1K steps, ๋ฐ˜๋ณต ํŠนํ™” ์‹คํ—˜)
```

๊ฐ ํ™”์‚ดํ‘œ๋Š” ๋…๋ฆฝ๋œ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ €์žฅ๋˜์–ด, ์ž„์˜์˜ ๋‹จ๊ณ„๋ถ€ํ„ฐ ์žฌํ˜„ยท๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

### ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

**ํ‰๊ฐ€ ๋Œ€์ƒ: SLERP ๋ชจ๋ธ** (0-shot, limit=500)

| ๋ฒค์น˜๋งˆํฌ | ์ •ํ™•๋„ |
|----------|:------:|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele ํ•œ๊ตญ์–ด | 23.6% |
| Global MMLU ํ•œ๊ตญ์–ด | 23.7% |

**๋ฐ˜๋ณต ์ƒ์„ฑ ์–ต์ œ** (greedy decoding ๊ธฐ์ค€)

| ์„ค์ • | 3-gram ๋ฐ˜๋ณต๋ฅ  |
|------|:-------------:|
| rep_penalty ์—†์Œ | 74.5% |
| rep_penalty=1.2 | **5.5%** |

๊ถŒ์žฅ ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ: `temperature=0.7, repetition_penalty=1.2`

### DPO vs ORPO ๋น„๊ต

| ์ง€ํ‘œ | SLERP (SFTโ†’DPO) | ORPO | ์šฐ์„ธ |
|------|:---------------:|:----:|:----:|
| Greedy ๋ฐ˜๋ณต๋ฅ  | 74.5% | 87.1% | SLERP |
| ๋Œ€ํ™” ํ’ˆ์งˆ | ์ž์—ฐ์Šค๋Ÿฌ์›€ | ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์›€ | SLERP |
| HellaSwag | **39.0%** | 35.0% | SLERP |
| ํ•™์Šต ์‹œ๊ฐ„ | 5์ผ+8์‹œ๊ฐ„ | **12.8์‹œ๊ฐ„** | ORPO |

ORPO์˜ ์•ฝ์ : SFT 65K ์Šคํ… ๋Œ€๋น„ 10K ์Šคํ…๋งŒ ํ•™์Šต๋˜์–ด ๊ธฐ๋ฐ˜ ๋ช…๋ น์–ด ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

### ์‚ฌ์šฉ๋ฒ•

> **GGUF/Ollama ๋ฏธ์ง€์›**: ์ปค์Šคํ…€ Mamba-2 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜๋กœ llama.cpp/GGUF/Ollama์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. PyTorch ์ง์ ‘ ์ถ”๋ก ๋งŒ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

**์‚ฌ์ „ ์ค€๋น„:**

```bash
# 1. ์†Œ์Šค ์ฝ”๋“œ ํด๋ก  (์ปค์Šคํ…€ ์•„ํ‚คํ…์ฒ˜ ๋ชจ๋“ˆ ํ•„์š”)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo

# 2. ์˜์กด์„ฑ ์„ค์น˜
pip install torch safetensors tokenizers PyYAML
```

**๋ฐฉ๋ฒ• 1: safetensors ์ง์ ‘ ๋กœ๋”ฉ (๊ถŒ์žฅ)**

```python
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/EVAFRILL-Mo-3B/slerp"  # ์ด ์ €์žฅ์†Œ์˜ slerp/ ๋””๋ ‰ํ† ๋ฆฌ

# Config & ๋ชจ๋ธ ๋กœ๋“œ
with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

# ์ƒ์„ฑ (๊ถŒ์žฅ: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\n์ธ๊ณต์ง€๋Šฅ์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€์š”?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(256):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))
```

**๋ฐฉ๋ฒ• 2: ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๋Ÿฌ๋„ˆ ์‚ฌ์šฉ**

[frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test)์˜ `evafrill_runner.py`๊ฐ€ ์œ„ ๊ณผ์ •์„ ๋ž˜ํ•‘ํ•ฉ๋‹ˆ๋‹ค:

```python
from eval_framework.evafrill_runner import generate, unload_model

result = generate("ํ•œ๊ตญ์–ด๋กœ ์ธ์‚ฌํ•ด์ฃผ์„ธ์š”.")
print(result["response"])
print(f"์†๋„: {result['tokens_per_sec']:.1f} TPS")
unload_model()
```

> ์„ค์ • ๋ฐฉ๋ฒ•: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-๋ชจ๋ธ-์„ค์ •-pytorch-์ง์ ‘-์ถ”๋ก ) ์ฐธ์กฐ

**์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ**: GPU VRAM 8GB+ (BF16), CPU ์ถ”๋ก  ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ๊ทนํžˆ ๋А๋ฆผ (~0.5 TPS)

### ์žฌํ˜„ ์ž๋ฃŒ

| ๊ฒฝ๋กœ | ๋‚ด์šฉ |
|------|------|
| `data/combined_preference.jsonl` | ์„ ํ˜ธ๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ (684K ์Œ, 2.6 GB) |
| `data/repetition_preference.jsonl` | ๋ฐ˜๋ณต ์–ต์ œ ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ (105 ์Œ, ์ž๋™ ์ƒ์„ฑ) |
| `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG ์„ค์ • |
| `configs/dpo_3b_1gpu.yaml` | DPO ํ•™์Šต ์„ค์ • |
| `configs/orpo_3b_1gpu.yaml` | ORPO ํ•™์Šต ์„ค์ • |
| `scripts/dpo.py` | DPO ํ•™์Šต ์ฝ”๋“œ |
| `scripts/orpo_native.py` | ORPO ํ•™์Šต ์ฝ”๋“œ |
| `scripts/sft.py` | SFT ํ•™์Šต ์ฝ”๋“œ |
| `scripts/evafrill_eval.py` | ๋ฒค์น˜๋งˆํฌ ํ‰๊ฐ€ ์ฝ”๋“œ |
| `scripts/merge_checkpoints.py` | SLERP ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ |

### ์ œํ•œ์‚ฌํ•ญ

- **3B ๊ทœ๋ชจ ํ•œ๊ณ„**: ์‚ฌ์‹ค ์ •ํ™•๋„ยท๋ณต์žกํ•œ ์ถ”๋ก ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ ๋Œ€๋น„ ์„ฑ๋Šฅ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.
- **GGUF/Ollama ๋ถˆ๊ฐ€**: ์ปค์Šคํ…€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 ์•„ํ‚คํ…์ฒ˜๋กœ ํ‘œ์ค€ ๋ณ€ํ™˜ ํˆด์„ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
- **vLLM ์ œํ•œ์ **: ์ด๋ก ์ƒ ๊ฐ€๋Šฅํ•˜๋‚˜ ์ปค์Šคํ…€ weight key ๋งคํ•‘์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
- **๋ฐ˜๋ณต ์ƒ์„ฑ**: greedy decoding ์‹œ ๋ฐ˜๋ณต๋ฅ ์ด ๋†’์œผ๋ฏ€๋กœ ๋ฐ˜๋“œ์‹œ `repetition_penalty=1.2` ์ด์ƒ์„ ์„ค์ •ํ•˜์„ธ์š”.
- **์–ธ์–ด ํŽธ์ค‘**: ํ•œ๊ตญ์–ดยท์˜์–ด ์™ธ ์–ธ์–ด๋Š” ์„ฑ๋Šฅ์ด ๋ณด์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

### ๋งํฌ

- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
- **์ด์ „ ํ”„๋กœ์ ํŠธ**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) โ€” ์ˆœ์ˆ˜ Transformer ๊ธฐ๋ฐ˜ ์ „์‹  ํ”„๋กœ์ ํŠธ
- **์ฐธ์กฐ ๋…ผ๋ฌธ**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)

### ๋ผ์ด์„ ์Šค

MIT License โ€” ์ƒ์—…์  ์ด์šฉยท์ˆ˜์ •ยท์žฌ๋ฐฐํฌ ๋ชจ๋‘ ์ž์œ ๋กญ์Šต๋‹ˆ๋‹ค.

---

# English

## EVAFRILL-Mo 3B โ€” Hybrid Mamba-2 + Transformer

### Introduction

EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.

- Pretrained on 55B tokens using 7ร— NVIDIA B200 GPUs (~60 hours)
- Mixed Korean, English, code, and math datasets
- Full SFT โ†’ DPO โ†’ SLERP pipeline implemented in pure PyTorch โ€” no Transformers Trainer or TRL
- Designed as a Korean-first model with strong multilingual capability

### Architecture

```
Type:           Hybrid Mamba-2 + Transformer
Parameters:     2.94B (2,975,397,632)
Layers:         26 (24ร— Mamba-2 SSM + 2ร— Attention GQA)
d_model:        3,072
Vocabulary:     64,000 (custom SentencePiece)
Max seq length: 4,096
```

Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.

### Development Background & History

EVAFRILL-Mo was built through 6 iterative design stages:

1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** โ€” Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline.
2. **Nemotron-H Inspiration** โ€” Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware.
3. **Systematic Scale Search** โ€” Benchmarked 5 model sizes (1Bโ€“3B) on 7ร—B200 to find the Chinchilla-optimal maximum (3B, 93% achievement).
4. **1B โ†’ 3B Transition** โ€” Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%).
5. **3B Pretraining** โ€” 319,772 steps, 55B tokens, 60 hours on 7ร—B200 with FP8.
6. **Post-training** โ€” SFT โ†’ DPO โ†’ SLERP โ†’ ORPO experiments on H100 MIG.

### Key Technical Highlights

| Technique | Impact |
|-----------|--------|
| **Chunked Cross-Entropy** | Reduces logits memory by 8ร— for 64K vocabulary |
| **Mamba Memory Cliff Discovery** | Batch 6โ†’7 causes 47GBโ†’183GB+ explosion โ€” structural limitation of selective scan |
| **FP8 Native Training** | TransformerEngine MXFP8BlockScaling delivers ~2ร— throughput vs BF16 on B200 |
| **LoRA B-zeroing** | Computes DPO reference logprobs without model duplication โ€” 50% VRAM savings |
| **SLERP Checkpoint Merging** | Balances SFT knowledge + DPO alignment via spherical interpolation โ€” mitigates alignment tax |
| **Native DPO/ORPO** | No TRL dependency โ€” implemented from scratch in PyTorch for custom Mamba-2 hybrid |

> ๐Ÿ“– **For the complete development journey, architecture design rationale, and hardware optimization details, see the [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo).**

### Model Variants

This repository contains **7 checkpoints** representing each stage of the training pipeline.

| Variant | Directory | Size | Description | Recommended |
|---------|-----------|------|-------------|:-----------:|
| **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (ฮฑ=0.5) | โญ |
| Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | |
| SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | |
| DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | |
| DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
| ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
| DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | |

### Training Pipeline

```
Pretrain (55B tokens, 7ร—B200, 60h)
  โ””โ”€โ–บ SFT v2 (65K steps, H100 MIG, 5 days)
        โ”œโ”€โ–บ DPO R1 (3K steps) โ”€โ–บ DPO R2 (2K steps)
        โ”‚     โ””โ”€โ–บ SLERP Merge (ฮฑ=0.5) โญ Final Recommended
        โ””โ”€โ–บ ORPO (10K steps, experimental)
              โ””โ”€โ–บ DPO R3 (1K steps, repetition experiment)
```

Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.

### Benchmark Results

**Evaluated on: SLERP model** (0-shot, limit=500)

| Benchmark | Accuracy |
|-----------|:--------:|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele Korean | 23.6% |
| Global MMLU Korean | 23.7% |

**Repetition suppression** (greedy decoding)

| Setting | 3-gram repetition rate |
|---------|:----------------------:|
| No rep_penalty | 74.5% |
| rep_penalty=1.2 | **5.5%** |

Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`

### DPO vs ORPO Comparison

| Metric | SLERP (SFTโ†’DPO) | ORPO | Winner |
|--------|:---------------:|:----:|:------:|
| Greedy repetition | 74.5% | 87.1% | SLERP |
| Chat quality | Fluent | Broken | SLERP |
| HellaSwag | **39.0%** | 35.0% | SLERP |
| Training time | 5d+8h | **12.8h** | ORPO |

ORPO's weakness: only 10K steps of training vs SFT's 65K โ€” insufficient base instruction-following before alignment kicks in.

### Usage

> **GGUF/Ollama not supported**: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.

**Prerequisites:**

```bash
# 1. Clone source code (custom architecture modules required)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo

# 2. Install dependencies
pip install torch safetensors tokenizers PyYAML
```

**Method 1: Direct safetensors loading (recommended)**

```python
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/EVAFRILL-Mo-3B/slerp"  # slerp/ directory of this repo

# Load config & model
with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

# Generate (recommended: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(256):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))
```

**Method 2: Evaluation framework runner**

The `evafrill_runner.py` in [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test) wraps the above into a simple API:

```python
from eval_framework.evafrill_runner import generate, unload_model

result = generate("Hello, please introduce yourself.")
print(result["response"])
print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
unload_model()
```

> Setup instructions: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-๋ชจ๋ธ-์„ค์ •-pytorch-์ง์ ‘-์ถ”๋ก )

**System requirements**: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)

### Reproducibility

| Path | Contents |
|------|----------|
| `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) |
| `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) |
| `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG |
| `configs/dpo_3b_1gpu.yaml` | DPO training config |
| `configs/orpo_3b_1gpu.yaml` | ORPO training config |
| `scripts/dpo.py` | DPO training code |
| `scripts/orpo_native.py` | ORPO training code |
| `scripts/sft.py` | SFT training code |
| `scripts/evafrill_eval.py` | Benchmark evaluation code |
| `scripts/merge_checkpoints.py` | SLERP checkpoint merging |

### Limitations

- **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
- **GGUF/Ollama**: Not supported โ€” custom hybrid Mamba-2 architecture cannot be converted with standard tools.
- **vLLM**: Theoretically possible but requires custom weight key mapping.
- **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` โ€” always use `repetition_penalty >= 1.2`.
- **Language coverage**: Performance is not guaranteed for languages other than Korean and English.

### Links

- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
- **Predecessor**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) | [๐Ÿค— HuggingFace](https://huggingface.co/pathcosmos/frankenstallm) โ€” Pure Transformer predecessor project
- **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)

### Acknowledgment / ๊ฐ์‚ฌ์˜ ๊ธ€

์ด ํ”„๋กœ์ ํŠธ๋Š” **๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€**์˜ **ใ€Œ์ฒจ๋‹จ GPU ํ™œ์šฉ ์ง€์› ์‚ฌ์—…ใ€** (๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ ๊ณต๊ณ  ์ œ2025-1068ํ˜ธ)์„ ํ†ตํ•ด ์ œ๊ณต๋œ GPU ์ปดํ“จํŒ… ์ž์›์„ ํ™œ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

> **๊ตญ๊ฐ€ AI์ปดํ“จํŒ…์ž์› ์ง€์›ํฌํ„ธ**: [https://aiinfrahub.kr](https://aiinfrahub.kr)
>
> - ์ฃผ๊ด€: ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ (MSIT), ์ •๋ณดํ†ต์‹ ์‚ฐ์—…์ง„ํฅ์› (NIPA)
> - ์šด์˜: ํ•œ๊ตญ์ •๋ณดํ†ต์‹ ์ง„ํฅํ˜‘ํšŒ (KAIT)

๋Œ€ํ•œ๋ฏผ๊ตญ ์ •๋ถ€์˜ AI ์ธํ”„๋ผ ์ง€์› ์‚ฌ์—… ๋•๋ถ„์— 7ร— NVIDIA B200 GPU ํ™˜๊ฒฝ์—์„œ ํ•œ๊ตญ์–ด 3B ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-Transformer ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตญ๊ฐ€ ์ฐจ์›์˜ AI ์ปดํ“จํŒ… ์ž์› ์ง€์›์— ๊นŠ์ด ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

This project was conducted using GPU computing resources provided through the **"Advanced GPU Utilization Support Program"** (MSIT Notice No. 2025-1068) by the **Ministry of Science and ICT (MSIT)** of the Republic of Korea.

> **National AI Computing Resource Support Portal**: [https://aiinfrahub.kr](https://aiinfrahub.kr)
>
> - Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)
> - Operated by: Korea Association of Information & Telecommunication (KAIT)

We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7ร— NVIDIA B200 GPUs.

---

### License

MIT License โ€” free to use, modify, and distribute commercially.