EVAFRILL-Mo-3B / README.md
somebody-to-love's picture
Update usage docs: replace AutoModel with working safetensors inference
e95edec
---
language:
- ko
- en
license: mit
library_name: pytorch
pipeline_tag: text-generation
tags:
- mamba2
- hybrid
- transformer
- korean
- from-scratch
- dpo
- slerp
- orpo
- nemotron-h
datasets:
- heegyu/orca-math-korean-preference-cleaned
- nayohan/preference-collection-ko-full
- kuotient/orca-math-word-problems-193k-korean
- FreedomIntelligence/alpaca-gpt4-korean
- heegyu/orca_ko
- HAERAE-HUB/KOFFQA-GuardInstruct-v1
model-index:
- name: EVAFRILL-Mo-3B
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: hellaswag
name: HellaSwag (0-shot, limit=500)
metrics:
- name: Accuracy
type: accuracy
value: 34.6
- task:
type: text-generation
dataset:
type: arc_easy
name: ARC-Easy (0-shot, limit=500)
metrics:
- name: Accuracy
type: accuracy
value: 32.0
- task:
type: text-generation
dataset:
type: belebele
name: Belebele Korean (0-shot, limit=500)
metrics:
- name: Accuracy
type: accuracy
value: 23.6
- task:
type: text-generation
dataset:
type: mmlu
name: Global MMLU Korean (0-shot, limit=500)
metrics:
- name: Accuracy
type: accuracy
value: 23.7
---
> [ํ•œ๊ตญ์–ด](#ํ•œ๊ตญ์–ด) | [English](#english)
---
# ํ•œ๊ตญ์–ด
## EVAFRILL-Mo 3B โ€” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer
### ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ
EVAFRILL-Mo 3B๋Š” NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) ์•„ํ‚คํ…์ฒ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ **๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ** 30์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
- 7ร— NVIDIA B200 GPU๋กœ 55B ํ† ํฐ ์‚ฌ์ „ํ•™์Šต (์•ฝ 60์‹œ๊ฐ„)
- ํ•œ๊ตญ์–ดยท์˜์–ดยท์ฝ”๋“œยท์ˆ˜ํ•™ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ
- SFT โ†’ DPO โ†’ SLERP ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๋‹จ์ผ ํ”„๋กœ์ ํŠธ์—์„œ ์ง์ ‘ ๊ตฌํ˜„
- ์™ธ๋ถ€ ํ”„๋ ˆ์ž„์›Œํฌ(Transformers Trainer, TRL) ์—†์ด PyTorch ๋„ค์ดํ‹ฐ๋ธŒ๋กœ ๊ตฌํ˜„
### ์•„ํ‚คํ…์ฒ˜
```
Type: Hybrid Mamba-2 + Transformer
Parameters: 2.94B (2,975,397,632)
Layers: 26 (24ร— Mamba-2 SSM + 2ร— Attention GQA)
d_model: 3,072
Vocabulary: 64,000 (custom SentencePiece)
Max seq length: 4,096
```
Mamba-2 SSM ๋ธ”๋ก์ด ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , 2๊ฐœ์˜ GQA Attention ๋ธ”๋ก์ด ์ „์—ญ ์ปจํ…์ŠคํŠธ๋ฅผ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.
ํ‘œ์ค€ Transformer ๋Œ€๋น„ ์ถ”๋ก  ์‹œ KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํฌ๊ฒŒ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค.
### ๊ฐœ๋ฐœ ๋ฐฐ๊ฒฝ ๋ฐ ํžˆ์Šคํ† ๋ฆฌ
EVAFRILL-Mo๋Š” 6๋‹จ๊ณ„์˜ ๋ฐ˜๋ณต์  ์„ค๊ณ„ ๊ณผ์ •์„ ๊ฑฐ์ณ ํƒ„์ƒํ–ˆ์Šต๋‹ˆ๋‹ค:
1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** โ€” ์ˆœ์ˆ˜ Transformer decoder-only LLM์œผ๋กœ ์‹œ์ž‘ํ•œ ์ „์‹  ํ”„๋กœ์ ํŠธ. ํ•œ๊ตญ์–ด+์˜์–ด+์ฝ”๋“œ+์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ๋กœ ์ปค์Šคํ…€ SentencePiece ํ† ํฌ๋‚˜์ด์ €(64K ์–ดํœ˜)๋ฅผ ํ•™์Šตํ•˜๊ณ , DDP ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
2. **Nemotron-H ์˜๊ฐ** โ€” NVIDIA์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer ์„ค๊ณ„๋ฅผ ํ•ต์‹ฌ ์›์น™๋งŒ ์ถ”์ถœํ•˜์—ฌ(fragmentation) ์ œํ•œ๋œ ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ์ถ•์†Œยท์ ์šฉ.
3. **์ฒด๊ณ„์  ๊ทœ๋ชจ ํƒ์ƒ‰** โ€” 5๊ฐœ ๊ทœ๋ชจ(1B~3B) ๋ชจ๋ธ์„ 7ร—B200์—์„œ ๋ฒค์น˜๋งˆํฌํ•˜์—ฌ Chinchilla-optimal ์ตœ๋Œ€ ๊ทœ๋ชจ(3B, 93% ๋‹ฌ์„ฑ) ๊ฒฐ์ •.
4. **1B โ†’ 3B ์ „ํ™˜** โ€” tok/s๊ฐ€ per-GPU ๊ฐ’์ž„์„ ๋ฐœ๊ฒฌํ•˜์—ฌ, 1B ๊ณผ์ž‰ํ•™์Šต(681%)์„ 3B ์ ์ •ํ•™์Šต(93%)์œผ๋กœ ์ „ํ™˜.
5. **3B ์‚ฌ์ „ํ•™์Šต** โ€” 319,772 steps, 55B tokens, 7ร—B200 FP8๋กœ 60์‹œ๊ฐ„ ์™„๋ฃŒ.
6. **Post-training** โ€” H100 MIG ํ™˜๊ฒฝ์—์„œ SFT โ†’ DPO โ†’ SLERP โ†’ ORPO ์‹คํ—˜๊นŒ์ง€ ์™„์ˆ˜.
### ํ•ต์‹ฌ ๊ธฐ์ˆ  ํ•˜์ด๋ผ์ดํŠธ
| ๊ธฐ์ˆ  | ํšจ๊ณผ |
|------|------|
| **Chunked Cross-Entropy** | 64K ์–ดํœ˜์—์„œ logits ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 1/8๋กœ ์ ˆ๊ฐ |
| **Mamba Memory Cliff ๋ฐœ๊ฒฌ** | batch 6โ†’7์—์„œ 47GBโ†’183GB+ ํญ์ฆ โ€” selective scan์˜ ๊ตฌ์กฐ์  ์ œ์•ฝ ๊ทœ๋ช… |
| **FP8 ๋„ค์ดํ‹ฐ๋ธŒ ํ•™์Šต** | TransformerEngine MXFP8BlockScaling์œผ๋กœ B200์—์„œ BF16 ๋Œ€๋น„ ~2๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰ |
| **LoRA B-zeroing** | DPO reference model์„ ๋ชจ๋ธ ๋ณต์ œ ์—†์ด LoRA B๋ฅผ ์ž„์‹œ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๊ณ„์‚ฐ โ€” VRAM 50% ์ ˆ์•ฝ |
| **SLERP ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ** | SFT ์ง€์‹ ๋ณด์กด + DPO ์ •๋ ฌ์„ ๊ตฌ๋ฉด ๋ณด๊ฐ„์œผ๋กœ ๊ท ํ˜• โ€” alignment tax ์™„ํ™” |
| **Native DPO/ORPO** | TRL ๋ฏธ์‚ฌ์šฉ, ์ปค์Šคํ…€ Mamba-2 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋ฅผ ์œ„ํ•ด ์ฒ˜์Œ๋ถ€ํ„ฐ PyTorch๋กœ ๊ตฌํ˜„ |
> ๐Ÿ“– **์ „์ฒด ๊ฐœ๋ฐœ ๊ณผ์ •, ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„ ๊ทผ๊ฑฐ, ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ์ƒ์„ธ๋Š” [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.**
### ๋ชจ๋ธ ๋ฒ„์ „
์ด ์ €์žฅ์†Œ์—๋Š” ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๊ฐ ๋‹จ๊ณ„์˜ ์ฒดํฌํฌ์ธํŠธ **7์ข…**์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
| ๋ฒ„์ „ | ๋””๋ ‰ํ† ๋ฆฌ | ํฌ๊ธฐ | ์„ค๋ช… | ๊ถŒ์žฅ |
|------|----------|------|------|:----:|
| **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 ๊ตฌ๋ฉด ์„ ํ˜• ๋ณด๊ฐ„ (ฮฑ=0.5) | โญ |
| Pretrain | `pretrain/` | 12.6 GB | ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (319K ์Šคํ…, 55B ํ† ํฐ) | |
| SFT v2 | `sft-v2/` | 6.3 GB | ๋ช…๋ น์–ด ํŒŒ์ธํŠœ๋‹ (65K ์Šคํ…) | |
| DPO R1 | `dpo-r1/` | 6.3 GB | ์„ ํ˜ธ๋„ ์ •๋ ฌ 1๋ผ์šด๋“œ (3K ์Šคํ…) | |
| DPO R2 | `dpo-r2/` | 6.3 GB | ๋ณด์ˆ˜์  ํŒŒ์ธํŠœ๋‹ 2๋ผ์šด๋“œ (2K ์Šคํ…) | |
| ORPO | `orpo/` | 6.3 GB | SFT+์ •๋ ฌ ๋™์‹œ ํ•™์Šต ์‹คํ—˜ (10K ์Šคํ…) | |
| DPO R3 | `dpo-r3/` | 6.3 GB | ๋ฐ˜๋ณต ์–ต์ œ ํŠนํ™” ์‹คํ—˜ (1K ์Šคํ…) | |
### ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ
```
Pretrain (55B tokens, 7ร—B200, 60h)
โ””โ”€โ–บ SFT v2 (65K steps, H100 MIG, 5์ผ)
โ”œโ”€โ–บ DPO R1 (3K steps) โ”€โ–บ DPO R2 (2K steps)
โ”‚ โ””โ”€โ–บ SLERP Merge (ฮฑ=0.5) โญ ์ตœ์ข… ๊ถŒ์žฅ
โ””โ”€โ–บ ORPO (10K steps, ์‹คํ—˜)
โ””โ”€โ–บ DPO R3 (1K steps, ๋ฐ˜๋ณต ํŠนํ™” ์‹คํ—˜)
```
๊ฐ ํ™”์‚ดํ‘œ๋Š” ๋…๋ฆฝ๋œ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ €์žฅ๋˜์–ด, ์ž„์˜์˜ ๋‹จ๊ณ„๋ถ€ํ„ฐ ์žฌํ˜„ยท๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
### ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ
**ํ‰๊ฐ€ ๋Œ€์ƒ: SLERP ๋ชจ๋ธ** (0-shot, limit=500)
| ๋ฒค์น˜๋งˆํฌ | ์ •ํ™•๋„ |
|----------|:------:|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele ํ•œ๊ตญ์–ด | 23.6% |
| Global MMLU ํ•œ๊ตญ์–ด | 23.7% |
**๋ฐ˜๋ณต ์ƒ์„ฑ ์–ต์ œ** (greedy decoding ๊ธฐ์ค€)
| ์„ค์ • | 3-gram ๋ฐ˜๋ณต๋ฅ  |
|------|:-------------:|
| rep_penalty ์—†์Œ | 74.5% |
| rep_penalty=1.2 | **5.5%** |
๊ถŒ์žฅ ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ: `temperature=0.7, repetition_penalty=1.2`
### DPO vs ORPO ๋น„๊ต
| ์ง€ํ‘œ | SLERP (SFTโ†’DPO) | ORPO | ์šฐ์„ธ |
|------|:---------------:|:----:|:----:|
| Greedy ๋ฐ˜๋ณต๋ฅ  | 74.5% | 87.1% | SLERP |
| ๋Œ€ํ™” ํ’ˆ์งˆ | ์ž์—ฐ์Šค๋Ÿฌ์›€ | ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์›€ | SLERP |
| HellaSwag | **39.0%** | 35.0% | SLERP |
| ํ•™์Šต ์‹œ๊ฐ„ | 5์ผ+8์‹œ๊ฐ„ | **12.8์‹œ๊ฐ„** | ORPO |
ORPO์˜ ์•ฝ์ : SFT 65K ์Šคํ… ๋Œ€๋น„ 10K ์Šคํ…๋งŒ ํ•™์Šต๋˜์–ด ๊ธฐ๋ฐ˜ ๋ช…๋ น์–ด ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.
### ์‚ฌ์šฉ๋ฒ•
> **GGUF/Ollama ๋ฏธ์ง€์›**: ์ปค์Šคํ…€ Mamba-2 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜๋กœ llama.cpp/GGUF/Ollama์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. PyTorch ์ง์ ‘ ์ถ”๋ก ๋งŒ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
**์‚ฌ์ „ ์ค€๋น„:**
```bash
# 1. ์†Œ์Šค ์ฝ”๋“œ ํด๋ก  (์ปค์Šคํ…€ ์•„ํ‚คํ…์ฒ˜ ๋ชจ๋“ˆ ํ•„์š”)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo
# 2. ์˜์กด์„ฑ ์„ค์น˜
pip install torch safetensors tokenizers PyYAML
```
**๋ฐฉ๋ฒ• 1: safetensors ์ง์ ‘ ๋กœ๋”ฉ (๊ถŒ์žฅ)**
```python
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors
CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # ์ด ์ €์žฅ์†Œ์˜ slerp/ ๋””๋ ‰ํ† ๋ฆฌ
# Config & ๋ชจ๋ธ ๋กœ๋“œ
with open(f"{CKPT}/config.json") as f:
data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False
model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()
tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
# ์ƒ์„ฑ (๊ถŒ์žฅ: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\n์ธ๊ณต์ง€๋Šฅ์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€์š”?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
with torch.no_grad():
for _ in range(256):
logits, _ = model(ids)
logits = logits[:, -1, :].float()
for prev_id in set(ids[0].tolist()):
if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
else: logits[0, prev_id] *= 1.2
probs = torch.softmax(logits / 0.7, dim=-1)
next_id = torch.multinomial(probs, 1)
ids = torch.cat([ids, next_id], dim=1)
if next_id.item() == tok.token_to_id("</s>"): break
print(tok.decode(ids[0].tolist()))
```
**๋ฐฉ๋ฒ• 2: ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๋Ÿฌ๋„ˆ ์‚ฌ์šฉ**
[frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test)์˜ `evafrill_runner.py`๊ฐ€ ์œ„ ๊ณผ์ •์„ ๋ž˜ํ•‘ํ•ฉ๋‹ˆ๋‹ค:
```python
from eval_framework.evafrill_runner import generate, unload_model
result = generate("ํ•œ๊ตญ์–ด๋กœ ์ธ์‚ฌํ•ด์ฃผ์„ธ์š”.")
print(result["response"])
print(f"์†๋„: {result['tokens_per_sec']:.1f} TPS")
unload_model()
```
> ์„ค์ • ๋ฐฉ๋ฒ•: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-๋ชจ๋ธ-์„ค์ •-pytorch-์ง์ ‘-์ถ”๋ก ) ์ฐธ์กฐ
**์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ**: GPU VRAM 8GB+ (BF16), CPU ์ถ”๋ก  ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ๊ทนํžˆ ๋А๋ฆผ (~0.5 TPS)
### ์žฌํ˜„ ์ž๋ฃŒ
| ๊ฒฝ๋กœ | ๋‚ด์šฉ |
|------|------|
| `data/combined_preference.jsonl` | ์„ ํ˜ธ๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ (684K ์Œ, 2.6 GB) |
| `data/repetition_preference.jsonl` | ๋ฐ˜๋ณต ์–ต์ œ ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ (105 ์Œ, ์ž๋™ ์ƒ์„ฑ) |
| `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG ์„ค์ • |
| `configs/dpo_3b_1gpu.yaml` | DPO ํ•™์Šต ์„ค์ • |
| `configs/orpo_3b_1gpu.yaml` | ORPO ํ•™์Šต ์„ค์ • |
| `scripts/dpo.py` | DPO ํ•™์Šต ์ฝ”๋“œ |
| `scripts/orpo_native.py` | ORPO ํ•™์Šต ์ฝ”๋“œ |
| `scripts/sft.py` | SFT ํ•™์Šต ์ฝ”๋“œ |
| `scripts/evafrill_eval.py` | ๋ฒค์น˜๋งˆํฌ ํ‰๊ฐ€ ์ฝ”๋“œ |
| `scripts/merge_checkpoints.py` | SLERP ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ |
### ์ œํ•œ์‚ฌํ•ญ
- **3B ๊ทœ๋ชจ ํ•œ๊ณ„**: ์‚ฌ์‹ค ์ •ํ™•๋„ยท๋ณต์žกํ•œ ์ถ”๋ก ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ ๋Œ€๋น„ ์„ฑ๋Šฅ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.
- **GGUF/Ollama ๋ถˆ๊ฐ€**: ์ปค์Šคํ…€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 ์•„ํ‚คํ…์ฒ˜๋กœ ํ‘œ์ค€ ๋ณ€ํ™˜ ํˆด์„ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
- **vLLM ์ œํ•œ์ **: ์ด๋ก ์ƒ ๊ฐ€๋Šฅํ•˜๋‚˜ ์ปค์Šคํ…€ weight key ๋งคํ•‘์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
- **๋ฐ˜๋ณต ์ƒ์„ฑ**: greedy decoding ์‹œ ๋ฐ˜๋ณต๋ฅ ์ด ๋†’์œผ๋ฏ€๋กœ ๋ฐ˜๋“œ์‹œ `repetition_penalty=1.2` ์ด์ƒ์„ ์„ค์ •ํ•˜์„ธ์š”.
- **์–ธ์–ด ํŽธ์ค‘**: ํ•œ๊ตญ์–ดยท์˜์–ด ์™ธ ์–ธ์–ด๋Š” ์„ฑ๋Šฅ์ด ๋ณด์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
### ๋งํฌ
- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
- **์ด์ „ ํ”„๋กœ์ ํŠธ**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) โ€” ์ˆœ์ˆ˜ Transformer ๊ธฐ๋ฐ˜ ์ „์‹  ํ”„๋กœ์ ํŠธ
- **์ฐธ์กฐ ๋…ผ๋ฌธ**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
### ๋ผ์ด์„ ์Šค
MIT License โ€” ์ƒ์—…์  ์ด์šฉยท์ˆ˜์ •ยท์žฌ๋ฐฐํฌ ๋ชจ๋‘ ์ž์œ ๋กญ์Šต๋‹ˆ๋‹ค.
---
# English
## EVAFRILL-Mo 3B โ€” Hybrid Mamba-2 + Transformer
### Introduction
EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.
- Pretrained on 55B tokens using 7ร— NVIDIA B200 GPUs (~60 hours)
- Mixed Korean, English, code, and math datasets
- Full SFT โ†’ DPO โ†’ SLERP pipeline implemented in pure PyTorch โ€” no Transformers Trainer or TRL
- Designed as a Korean-first model with strong multilingual capability
### Architecture
```
Type: Hybrid Mamba-2 + Transformer
Parameters: 2.94B (2,975,397,632)
Layers: 26 (24ร— Mamba-2 SSM + 2ร— Attention GQA)
d_model: 3,072
Vocabulary: 64,000 (custom SentencePiece)
Max seq length: 4,096
```
Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.
### Development Background & History
EVAFRILL-Mo was built through 6 iterative design stages:
1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** โ€” Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline.
2. **Nemotron-H Inspiration** โ€” Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware.
3. **Systematic Scale Search** โ€” Benchmarked 5 model sizes (1Bโ€“3B) on 7ร—B200 to find the Chinchilla-optimal maximum (3B, 93% achievement).
4. **1B โ†’ 3B Transition** โ€” Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%).
5. **3B Pretraining** โ€” 319,772 steps, 55B tokens, 60 hours on 7ร—B200 with FP8.
6. **Post-training** โ€” SFT โ†’ DPO โ†’ SLERP โ†’ ORPO experiments on H100 MIG.
### Key Technical Highlights
| Technique | Impact |
|-----------|--------|
| **Chunked Cross-Entropy** | Reduces logits memory by 8ร— for 64K vocabulary |
| **Mamba Memory Cliff Discovery** | Batch 6โ†’7 causes 47GBโ†’183GB+ explosion โ€” structural limitation of selective scan |
| **FP8 Native Training** | TransformerEngine MXFP8BlockScaling delivers ~2ร— throughput vs BF16 on B200 |
| **LoRA B-zeroing** | Computes DPO reference logprobs without model duplication โ€” 50% VRAM savings |
| **SLERP Checkpoint Merging** | Balances SFT knowledge + DPO alignment via spherical interpolation โ€” mitigates alignment tax |
| **Native DPO/ORPO** | No TRL dependency โ€” implemented from scratch in PyTorch for custom Mamba-2 hybrid |
> ๐Ÿ“– **For the complete development journey, architecture design rationale, and hardware optimization details, see the [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo).**
### Model Variants
This repository contains **7 checkpoints** representing each stage of the training pipeline.
| Variant | Directory | Size | Description | Recommended |
|---------|-----------|------|-------------|:-----------:|
| **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (ฮฑ=0.5) | โญ |
| Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | |
| SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | |
| DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | |
| DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
| ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
| DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | |
### Training Pipeline
```
Pretrain (55B tokens, 7ร—B200, 60h)
โ””โ”€โ–บ SFT v2 (65K steps, H100 MIG, 5 days)
โ”œโ”€โ–บ DPO R1 (3K steps) โ”€โ–บ DPO R2 (2K steps)
โ”‚ โ””โ”€โ–บ SLERP Merge (ฮฑ=0.5) โญ Final Recommended
โ””โ”€โ–บ ORPO (10K steps, experimental)
โ””โ”€โ–บ DPO R3 (1K steps, repetition experiment)
```
Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.
### Benchmark Results
**Evaluated on: SLERP model** (0-shot, limit=500)
| Benchmark | Accuracy |
|-----------|:--------:|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele Korean | 23.6% |
| Global MMLU Korean | 23.7% |
**Repetition suppression** (greedy decoding)
| Setting | 3-gram repetition rate |
|---------|:----------------------:|
| No rep_penalty | 74.5% |
| rep_penalty=1.2 | **5.5%** |
Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`
### DPO vs ORPO Comparison
| Metric | SLERP (SFTโ†’DPO) | ORPO | Winner |
|--------|:---------------:|:----:|:------:|
| Greedy repetition | 74.5% | 87.1% | SLERP |
| Chat quality | Fluent | Broken | SLERP |
| HellaSwag | **39.0%** | 35.0% | SLERP |
| Training time | 5d+8h | **12.8h** | ORPO |
ORPO's weakness: only 10K steps of training vs SFT's 65K โ€” insufficient base instruction-following before alignment kicks in.
### Usage
> **GGUF/Ollama not supported**: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.
**Prerequisites:**
```bash
# 1. Clone source code (custom architecture modules required)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo
# 2. Install dependencies
pip install torch safetensors tokenizers PyYAML
```
**Method 1: Direct safetensors loading (recommended)**
```python
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors
CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # slerp/ directory of this repo
# Load config & model
with open(f"{CKPT}/config.json") as f:
data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False
model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()
tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
# Generate (recommended: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
with torch.no_grad():
for _ in range(256):
logits, _ = model(ids)
logits = logits[:, -1, :].float()
for prev_id in set(ids[0].tolist()):
if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
else: logits[0, prev_id] *= 1.2
probs = torch.softmax(logits / 0.7, dim=-1)
next_id = torch.multinomial(probs, 1)
ids = torch.cat([ids, next_id], dim=1)
if next_id.item() == tok.token_to_id("</s>"): break
print(tok.decode(ids[0].tolist()))
```
**Method 2: Evaluation framework runner**
The `evafrill_runner.py` in [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test) wraps the above into a simple API:
```python
from eval_framework.evafrill_runner import generate, unload_model
result = generate("Hello, please introduce yourself.")
print(result["response"])
print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
unload_model()
```
> Setup instructions: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-๋ชจ๋ธ-์„ค์ •-pytorch-์ง์ ‘-์ถ”๋ก )
**System requirements**: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)
### Reproducibility
| Path | Contents |
|------|----------|
| `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) |
| `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) |
| `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG |
| `configs/dpo_3b_1gpu.yaml` | DPO training config |
| `configs/orpo_3b_1gpu.yaml` | ORPO training config |
| `scripts/dpo.py` | DPO training code |
| `scripts/orpo_native.py` | ORPO training code |
| `scripts/sft.py` | SFT training code |
| `scripts/evafrill_eval.py` | Benchmark evaluation code |
| `scripts/merge_checkpoints.py` | SLERP checkpoint merging |
### Limitations
- **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
- **GGUF/Ollama**: Not supported โ€” custom hybrid Mamba-2 architecture cannot be converted with standard tools.
- **vLLM**: Theoretically possible but requires custom weight key mapping.
- **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` โ€” always use `repetition_penalty >= 1.2`.
- **Language coverage**: Performance is not guaranteed for languages other than Korean and English.
### Links
- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
- **Predecessor**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) | [๐Ÿค— HuggingFace](https://huggingface.co/pathcosmos/frankenstallm) โ€” Pure Transformer predecessor project
- **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
### Acknowledgment / ๊ฐ์‚ฌ์˜ ๊ธ€
์ด ํ”„๋กœ์ ํŠธ๋Š” **๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€**์˜ **ใ€Œ์ฒจ๋‹จ GPU ํ™œ์šฉ ์ง€์› ์‚ฌ์—…ใ€** (๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ ๊ณต๊ณ  ์ œ2025-1068ํ˜ธ)์„ ํ†ตํ•ด ์ œ๊ณต๋œ GPU ์ปดํ“จํŒ… ์ž์›์„ ํ™œ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
> **๊ตญ๊ฐ€ AI์ปดํ“จํŒ…์ž์› ์ง€์›ํฌํ„ธ**: [https://aiinfrahub.kr](https://aiinfrahub.kr)
>
> - ์ฃผ๊ด€: ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ (MSIT), ์ •๋ณดํ†ต์‹ ์‚ฐ์—…์ง„ํฅ์› (NIPA)
> - ์šด์˜: ํ•œ๊ตญ์ •๋ณดํ†ต์‹ ์ง„ํฅํ˜‘ํšŒ (KAIT)
๋Œ€ํ•œ๋ฏผ๊ตญ ์ •๋ถ€์˜ AI ์ธํ”„๋ผ ์ง€์› ์‚ฌ์—… ๋•๋ถ„์— 7ร— NVIDIA B200 GPU ํ™˜๊ฒฝ์—์„œ ํ•œ๊ตญ์–ด 3B ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-Transformer ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตญ๊ฐ€ ์ฐจ์›์˜ AI ์ปดํ“จํŒ… ์ž์› ์ง€์›์— ๊นŠ์ด ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.
This project was conducted using GPU computing resources provided through the **"Advanced GPU Utilization Support Program"** (MSIT Notice No. 2025-1068) by the **Ministry of Science and ICT (MSIT)** of the Republic of Korea.
> **National AI Computing Resource Support Portal**: [https://aiinfrahub.kr](https://aiinfrahub.kr)
>
> - Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)
> - Operated by: Korea Association of Information & Telecommunication (KAIT)
We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7ร— NVIDIA B200 GPUs.
---
### License
MIT License โ€” free to use, modify, and distribute commercially.