| --- |
| language: |
| - ko |
| - en |
| license: mit |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - mamba2 |
| - hybrid |
| - transformer |
| - korean |
| - from-scratch |
| - dpo |
| - slerp |
| - orpo |
| - nemotron-h |
| datasets: |
| - heegyu/orca-math-korean-preference-cleaned |
| - nayohan/preference-collection-ko-full |
| - kuotient/orca-math-word-problems-193k-korean |
| - FreedomIntelligence/alpaca-gpt4-korean |
| - heegyu/orca_ko |
| - HAERAE-HUB/KOFFQA-GuardInstruct-v1 |
| model-index: |
| - name: EVAFRILL-Mo-3B |
| results: |
| - task: |
| type: text-generation |
| name: Text Generation |
| dataset: |
| type: hellaswag |
| name: HellaSwag (0-shot, limit=500) |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 34.6 |
| - task: |
| type: text-generation |
| dataset: |
| type: arc_easy |
| name: ARC-Easy (0-shot, limit=500) |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 32.0 |
| - task: |
| type: text-generation |
| dataset: |
| type: belebele |
| name: Belebele Korean (0-shot, limit=500) |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 23.6 |
| - task: |
| type: text-generation |
| dataset: |
| type: mmlu |
| name: Global MMLU Korean (0-shot, limit=500) |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 23.7 |
| --- |
| |
| > [ํ๊ตญ์ด](#ํ๊ตญ์ด) | [English](#english) |
|
|
| --- |
|
|
| # ํ๊ตญ์ด |
|
|
| ## EVAFRILL-Mo 3B โ ํ์ด๋ธ๋ฆฌ๋ Mamba-2 + Transformer |
|
|
| ### ํ๋ก์ ํธ ์๊ฐ |
|
|
| EVAFRILL-Mo 3B๋ NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) ์ํคํ
์ฒ์์ ์๊ฐ์ ๋ฐ์ **๋ฐ๋ฐ๋ฅ๋ถํฐ ์ง์ ๊ตฌํํ** 30์ต ํ๋ผ๋ฏธํฐ ํ์ด๋ธ๋ฆฌ๋ ์ธ์ด ๋ชจ๋ธ์
๋๋ค. |
|
|
| - 7ร NVIDIA B200 GPU๋ก 55B ํ ํฐ ์ฌ์ ํ์ต (์ฝ 60์๊ฐ) |
| - ํ๊ตญ์ดยท์์ดยท์ฝ๋ยท์ํ ํผํฉ ๋ฐ์ดํฐ์
์ฌ์ฉ |
| - SFT โ DPO โ SLERP ์ ์ฒด ํ์ดํ๋ผ์ธ์ ๋จ์ผ ํ๋ก์ ํธ์์ ์ง์ ๊ตฌํ |
| - ์ธ๋ถ ํ๋ ์์ํฌ(Transformers Trainer, TRL) ์์ด PyTorch ๋ค์ดํฐ๋ธ๋ก ๊ตฌํ |
|
|
| ### ์ํคํ
์ฒ |
|
|
| ``` |
| Type: Hybrid Mamba-2 + Transformer |
| Parameters: 2.94B (2,975,397,632) |
| Layers: 26 (24ร Mamba-2 SSM + 2ร Attention GQA) |
| d_model: 3,072 |
| Vocabulary: 64,000 (custom SentencePiece) |
| Max seq length: 4,096 |
| ``` |
|
|
| Mamba-2 SSM ๋ธ๋ก์ด ์ฅ๊ฑฐ๋ฆฌ ์์กด์ฑ์ ํจ์จ์ ์ผ๋ก ์ฒ๋ฆฌํ๊ณ , 2๊ฐ์ GQA Attention ๋ธ๋ก์ด ์ ์ญ ์ปจํ
์คํธ๋ฅผ ๋ณด์ํฉ๋๋ค. |
| ํ์ค Transformer ๋๋น ์ถ๋ก ์ KV ์บ์ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ํฌ๊ฒ ์ ๊ฐํฉ๋๋ค. |
|
|
| ### ๊ฐ๋ฐ ๋ฐฐ๊ฒฝ ๋ฐ ํ์คํ ๋ฆฌ |
|
|
| EVAFRILL-Mo๋ 6๋จ๊ณ์ ๋ฐ๋ณต์ ์ค๊ณ ๊ณผ์ ์ ๊ฑฐ์ณ ํ์ํ์ต๋๋ค: |
|
|
| 1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** โ ์์ Transformer decoder-only LLM์ผ๋ก ์์ํ ์ ์ ํ๋ก์ ํธ. ํ๊ตญ์ด+์์ด+์ฝ๋+์ํ ๋ฐ์ดํฐ๋ก ์ปค์คํ
SentencePiece ํ ํฌ๋์ด์ (64K ์ดํ)๋ฅผ ํ์ตํ๊ณ , DDP ํ์ต ํ์ดํ๋ผ์ธ์ ๊ตฌ์ถํ์ต๋๋ค. |
| 2. **Nemotron-H ์๊ฐ** โ NVIDIA์ ํ์ด๋ธ๋ฆฌ๋ Mamba-2 + Transformer ์ค๊ณ๋ฅผ ํต์ฌ ์์น๋ง ์ถ์ถํ์ฌ(fragmentation) ์ ํ๋ ํ๋์จ์ด์ ๋ง๊ฒ ์ถ์ยท์ ์ฉ. |
| 3. **์ฒด๊ณ์ ๊ท๋ชจ ํ์** โ 5๊ฐ ๊ท๋ชจ(1B~3B) ๋ชจ๋ธ์ 7รB200์์ ๋ฒค์น๋งํฌํ์ฌ Chinchilla-optimal ์ต๋ ๊ท๋ชจ(3B, 93% ๋ฌ์ฑ) ๊ฒฐ์ . |
| 4. **1B โ 3B ์ ํ** โ tok/s๊ฐ per-GPU ๊ฐ์์ ๋ฐ๊ฒฌํ์ฌ, 1B ๊ณผ์ํ์ต(681%)์ 3B ์ ์ ํ์ต(93%)์ผ๋ก ์ ํ. |
| 5. **3B ์ฌ์ ํ์ต** โ 319,772 steps, 55B tokens, 7รB200 FP8๋ก 60์๊ฐ ์๋ฃ. |
| 6. **Post-training** โ H100 MIG ํ๊ฒฝ์์ SFT โ DPO โ SLERP โ ORPO ์คํ๊น์ง ์์. |
|
|
| ### ํต์ฌ ๊ธฐ์ ํ์ด๋ผ์ดํธ |
|
|
| | ๊ธฐ์ | ํจ๊ณผ | |
| |------|------| |
| | **Chunked Cross-Entropy** | 64K ์ดํ์์ logits ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ 1/8๋ก ์ ๊ฐ | |
| | **Mamba Memory Cliff ๋ฐ๊ฒฌ** | batch 6โ7์์ 47GBโ183GB+ ํญ์ฆ โ selective scan์ ๊ตฌ์กฐ์ ์ ์ฝ ๊ท๋ช
| |
| | **FP8 ๋ค์ดํฐ๋ธ ํ์ต** | TransformerEngine MXFP8BlockScaling์ผ๋ก B200์์ BF16 ๋๋น ~2๋ฐฐ ์ฒ๋ฆฌ๋ | |
| | **LoRA B-zeroing** | DPO reference model์ ๋ชจ๋ธ ๋ณต์ ์์ด LoRA B๋ฅผ ์์ 0์ผ๋ก ๋ง๋ค์ด ๊ณ์ฐ โ VRAM 50% ์ ์ฝ | |
| | **SLERP ์ฒดํฌํฌ์ธํธ ๋ณํฉ** | SFT ์ง์ ๋ณด์กด + DPO ์ ๋ ฌ์ ๊ตฌ๋ฉด ๋ณด๊ฐ์ผ๋ก ๊ท ํ โ alignment tax ์ํ | |
| | **Native DPO/ORPO** | TRL ๋ฏธ์ฌ์ฉ, ์ปค์คํ
Mamba-2 ํ์ด๋ธ๋ฆฌ๋๋ฅผ ์ํด ์ฒ์๋ถํฐ PyTorch๋ก ๊ตฌํ | |
|
|
| > ๐ **์ ์ฒด ๊ฐ๋ฐ ๊ณผ์ , ์ํคํ
์ฒ ์ค๊ณ ๊ทผ๊ฑฐ, ํ๋์จ์ด ์ต์ ํ ์์ธ๋ [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo)๋ฅผ ์ฐธ์กฐํ์ธ์.** |
|
|
| ### ๋ชจ๋ธ ๋ฒ์ |
|
|
| ์ด ์ ์ฅ์์๋ ํ์ต ํ์ดํ๋ผ์ธ ๊ฐ ๋จ๊ณ์ ์ฒดํฌํฌ์ธํธ **7์ข
**์ด ํฌํจ๋ฉ๋๋ค. |
|
|
| | ๋ฒ์ | ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ์ค๋ช
| ๊ถ์ฅ | |
| |------|----------|------|------|:----:| |
| | **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 ๊ตฌ๋ฉด ์ ํ ๋ณด๊ฐ (ฮฑ=0.5) | โญ | |
| | Pretrain | `pretrain/` | 12.6 GB | ๊ธฐ๋ฐ ๋ชจ๋ธ (319K ์คํ
, 55B ํ ํฐ) | | |
| | SFT v2 | `sft-v2/` | 6.3 GB | ๋ช
๋ น์ด ํ์ธํ๋ (65K ์คํ
) | | |
| | DPO R1 | `dpo-r1/` | 6.3 GB | ์ ํธ๋ ์ ๋ ฌ 1๋ผ์ด๋ (3K ์คํ
) | | |
| | DPO R2 | `dpo-r2/` | 6.3 GB | ๋ณด์์ ํ์ธํ๋ 2๋ผ์ด๋ (2K ์คํ
) | | |
| | ORPO | `orpo/` | 6.3 GB | SFT+์ ๋ ฌ ๋์ ํ์ต ์คํ (10K ์คํ
) | | |
| | DPO R3 | `dpo-r3/` | 6.3 GB | ๋ฐ๋ณต ์ต์ ํนํ ์คํ (1K ์คํ
) | | |
|
|
| ### ํ์ต ํ์ดํ๋ผ์ธ |
|
|
| ``` |
| Pretrain (55B tokens, 7รB200, 60h) |
| โโโบ SFT v2 (65K steps, H100 MIG, 5์ผ) |
| โโโบ DPO R1 (3K steps) โโบ DPO R2 (2K steps) |
| โ โโโบ SLERP Merge (ฮฑ=0.5) โญ ์ต์ข
๊ถ์ฅ |
| โโโบ ORPO (10K steps, ์คํ) |
| โโโบ DPO R3 (1K steps, ๋ฐ๋ณต ํนํ ์คํ) |
| ``` |
|
|
| ๊ฐ ํ์ดํ๋ ๋
๋ฆฝ๋ ์ฒดํฌํฌ์ธํธ๋ก ์ ์ฅ๋์ด, ์์์ ๋จ๊ณ๋ถํฐ ์ฌํยท๋น๊ต๊ฐ ๊ฐ๋ฅํฉ๋๋ค. |
|
|
| ### ๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ |
|
|
| **ํ๊ฐ ๋์: SLERP ๋ชจ๋ธ** (0-shot, limit=500) |
|
|
| | ๋ฒค์น๋งํฌ | ์ ํ๋ | |
| |----------|:------:| |
| | HellaSwag | 34.6% | |
| | ARC-Easy | 32.0% | |
| | Belebele ํ๊ตญ์ด | 23.6% | |
| | Global MMLU ํ๊ตญ์ด | 23.7% | |
|
|
| **๋ฐ๋ณต ์์ฑ ์ต์ ** (greedy decoding ๊ธฐ์ค) |
|
|
| | ์ค์ | 3-gram ๋ฐ๋ณต๋ฅ | |
| |------|:-------------:| |
| | rep_penalty ์์ | 74.5% | |
| | rep_penalty=1.2 | **5.5%** | |
|
|
| ๊ถ์ฅ ์ถ๋ก ํ๋ผ๋ฏธํฐ: `temperature=0.7, repetition_penalty=1.2` |
|
|
| ### DPO vs ORPO ๋น๊ต |
|
|
| | ์งํ | SLERP (SFTโDPO) | ORPO | ์ฐ์ธ | |
| |------|:---------------:|:----:|:----:| |
| | Greedy ๋ฐ๋ณต๋ฅ | 74.5% | 87.1% | SLERP | |
| | ๋ํ ํ์ง | ์์ฐ์ค๋ฌ์ | ๋ถ์์ฐ์ค๋ฌ์ | SLERP | |
| | HellaSwag | **39.0%** | 35.0% | SLERP | |
| | ํ์ต ์๊ฐ | 5์ผ+8์๊ฐ | **12.8์๊ฐ** | ORPO | |
|
|
| ORPO์ ์ฝ์ : SFT 65K ์คํ
๋๋น 10K ์คํ
๋ง ํ์ต๋์ด ๊ธฐ๋ฐ ๋ช
๋ น์ด ์ดํด๊ฐ ๋ถ์กฑํฉ๋๋ค. |
|
|
| ### ์ฌ์ฉ๋ฒ |
|
|
| > **GGUF/Ollama ๋ฏธ์ง์**: ์ปค์คํ
Mamba-2 ํ์ด๋ธ๋ฆฌ๋ ์ํคํ
์ฒ๋ก llama.cpp/GGUF/Ollama์ ํธํ๋์ง ์์ต๋๋ค. PyTorch ์ง์ ์ถ๋ก ๋ง ๊ฐ๋ฅํฉ๋๋ค. |
|
|
| **์ฌ์ ์ค๋น:** |
|
|
| ```bash |
| # 1. ์์ค ์ฝ๋ ํด๋ก (์ปค์คํ
์ํคํ
์ฒ ๋ชจ๋ ํ์) |
| git clone https://github.com/pathcosmos/EVAFRILL-Mo |
| cd EVAFRILL-Mo |
| |
| # 2. ์์กด์ฑ ์ค์น |
| pip install torch safetensors tokenizers PyYAML |
| ``` |
|
|
| **๋ฐฉ๋ฒ 1: safetensors ์ง์ ๋ก๋ฉ (๊ถ์ฅ)** |
|
|
| ```python |
| import json |
| import torch |
| from model.config import LMConfig |
| from model.transformer import LLM |
| from tokenizers import Tokenizer |
| from safetensors.torch import load_file as load_safetensors |
| |
| CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # ์ด ์ ์ฅ์์ slerp/ ๋๋ ํ ๋ฆฌ |
| |
| # Config & ๋ชจ๋ธ ๋ก๋ |
| with open(f"{CKPT}/config.json") as f: |
| data = json.load(f) |
| for k in ("model_type", "architectures", "_variant", "_description"): |
| data.pop(k, None) |
| cfg = LMConfig(**data) |
| cfg.use_flash_attn = False |
| |
| model = LLM(cfg) |
| state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu") |
| model.load_state_dict(state, strict=False) |
| model = model.to(device="cuda:0", dtype=torch.bfloat16) |
| model.eval() |
| |
| tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json") |
| |
| # ์์ฑ (๊ถ์ฅ: temp=0.7, rep_penalty=1.2) |
| prompt = "<|user|>\n์ธ๊ณต์ง๋ฅ์ด๋ ๋ฌด์์ธ๊ฐ์?\n<|assistant|>\n" |
| ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0") |
| |
| with torch.no_grad(): |
| for _ in range(256): |
| logits, _ = model(ids) |
| logits = logits[:, -1, :].float() |
| for prev_id in set(ids[0].tolist()): |
| if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2 |
| else: logits[0, prev_id] *= 1.2 |
| probs = torch.softmax(logits / 0.7, dim=-1) |
| next_id = torch.multinomial(probs, 1) |
| ids = torch.cat([ids, next_id], dim=1) |
| if next_id.item() == tok.token_to_id("</s>"): break |
| |
| print(tok.decode(ids[0].tolist())) |
| ``` |
|
|
| **๋ฐฉ๋ฒ 2: ํ๊ฐ ํ๋ ์์ํฌ ๋ฌ๋ ์ฌ์ฉ** |
|
|
| [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test)์ `evafrill_runner.py`๊ฐ ์ ๊ณผ์ ์ ๋ํํฉ๋๋ค: |
|
|
| ```python |
| from eval_framework.evafrill_runner import generate, unload_model |
| |
| result = generate("ํ๊ตญ์ด๋ก ์ธ์ฌํด์ฃผ์ธ์.") |
| print(result["response"]) |
| print(f"์๋: {result['tokens_per_sec']:.1f} TPS") |
| unload_model() |
| ``` |
|
|
| > ์ค์ ๋ฐฉ๋ฒ: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-๋ชจ๋ธ-์ค์ -pytorch-์ง์ -์ถ๋ก ) ์ฐธ์กฐ |
|
|
| **์์คํ
์๊ตฌ์ฌํญ**: GPU VRAM 8GB+ (BF16), CPU ์ถ๋ก ๊ฐ๋ฅํ์ง๋ง ๊ทนํ ๋๋ฆผ (~0.5 TPS) |
|
|
| ### ์ฌํ ์๋ฃ |
|
|
| | ๊ฒฝ๋ก | ๋ด์ฉ | |
| |------|------| |
| | `data/combined_preference.jsonl` | ์ ํธ๋ ํ์ต ๋ฐ์ดํฐ (684K ์, 2.6 GB) | |
| | `data/repetition_preference.jsonl` | ๋ฐ๋ณต ์ต์ ์ ํธ๋ ๋ฐ์ดํฐ (105 ์, ์๋ ์์ฑ) | |
| | `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG ์ค์ | |
| | `configs/dpo_3b_1gpu.yaml` | DPO ํ์ต ์ค์ | |
| | `configs/orpo_3b_1gpu.yaml` | ORPO ํ์ต ์ค์ | |
| | `scripts/dpo.py` | DPO ํ์ต ์ฝ๋ | |
| | `scripts/orpo_native.py` | ORPO ํ์ต ์ฝ๋ | |
| | `scripts/sft.py` | SFT ํ์ต ์ฝ๋ | |
| | `scripts/evafrill_eval.py` | ๋ฒค์น๋งํฌ ํ๊ฐ ์ฝ๋ | |
| | `scripts/merge_checkpoints.py` | SLERP ์ฒดํฌํฌ์ธํธ ๋ณํฉ | |
|
|
| ### ์ ํ์ฌํญ |
|
|
| - **3B ๊ท๋ชจ ํ๊ณ**: ์ฌ์ค ์ ํ๋ยท๋ณต์กํ ์ถ๋ก ์ ํ๊ณ๊ฐ ์์ผ๋ฉฐ, ๋ํ ๋ชจ๋ธ ๋๋น ์ฑ๋ฅ์ด ๋ฎ์ต๋๋ค. |
| - **GGUF/Ollama ๋ถ๊ฐ**: ์ปค์คํ
ํ์ด๋ธ๋ฆฌ๋ Mamba-2 ์ํคํ
์ฒ๋ก ํ์ค ๋ณํ ํด์ ์ง์ํ์ง ์์ต๋๋ค. |
| - **vLLM ์ ํ์ **: ์ด๋ก ์ ๊ฐ๋ฅํ๋ ์ปค์คํ
weight key ๋งคํ์ด ํ์ํฉ๋๋ค. |
| - **๋ฐ๋ณต ์์ฑ**: greedy decoding ์ ๋ฐ๋ณต๋ฅ ์ด ๋์ผ๋ฏ๋ก ๋ฐ๋์ `repetition_penalty=1.2` ์ด์์ ์ค์ ํ์ธ์. |
| - **์ธ์ด ํธ์ค**: ํ๊ตญ์ดยท์์ด ์ธ ์ธ์ด๋ ์ฑ๋ฅ์ด ๋ณด์ฅ๋์ง ์์ต๋๋ค. |
|
|
| ### ๋งํฌ |
|
|
| - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo) |
| - **์ด์ ํ๋ก์ ํธ**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) โ ์์ Transformer ๊ธฐ๋ฐ ์ ์ ํ๋ก์ ํธ |
| - **์ฐธ์กฐ ๋
ผ๋ฌธ**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624) |
|
|
| ### ๋ผ์ด์ ์ค |
|
|
| MIT License โ ์์
์ ์ด์ฉยท์์ ยท์ฌ๋ฐฐํฌ ๋ชจ๋ ์์ ๋กญ์ต๋๋ค. |
|
|
| --- |
|
|
| # English |
|
|
| ## EVAFRILL-Mo 3B โ Hybrid Mamba-2 + Transformer |
|
|
| ### Introduction |
|
|
| EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture. |
|
|
| - Pretrained on 55B tokens using 7ร NVIDIA B200 GPUs (~60 hours) |
| - Mixed Korean, English, code, and math datasets |
| - Full SFT โ DPO โ SLERP pipeline implemented in pure PyTorch โ no Transformers Trainer or TRL |
| - Designed as a Korean-first model with strong multilingual capability |
|
|
| ### Architecture |
|
|
| ``` |
| Type: Hybrid Mamba-2 + Transformer |
| Parameters: 2.94B (2,975,397,632) |
| Layers: 26 (24ร Mamba-2 SSM + 2ร Attention GQA) |
| d_model: 3,072 |
| Vocabulary: 64,000 (custom SentencePiece) |
| Max seq length: 4,096 |
| ``` |
|
|
| Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context. |
| Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference. |
|
|
| ### Development Background & History |
|
|
| EVAFRILL-Mo was built through 6 iterative design stages: |
|
|
| 1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** โ Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline. |
| 2. **Nemotron-H Inspiration** โ Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware. |
| 3. **Systematic Scale Search** โ Benchmarked 5 model sizes (1Bโ3B) on 7รB200 to find the Chinchilla-optimal maximum (3B, 93% achievement). |
| 4. **1B โ 3B Transition** โ Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%). |
| 5. **3B Pretraining** โ 319,772 steps, 55B tokens, 60 hours on 7รB200 with FP8. |
| 6. **Post-training** โ SFT โ DPO โ SLERP โ ORPO experiments on H100 MIG. |
|
|
| ### Key Technical Highlights |
|
|
| | Technique | Impact | |
| |-----------|--------| |
| | **Chunked Cross-Entropy** | Reduces logits memory by 8ร for 64K vocabulary | |
| | **Mamba Memory Cliff Discovery** | Batch 6โ7 causes 47GBโ183GB+ explosion โ structural limitation of selective scan | |
| | **FP8 Native Training** | TransformerEngine MXFP8BlockScaling delivers ~2ร throughput vs BF16 on B200 | |
| | **LoRA B-zeroing** | Computes DPO reference logprobs without model duplication โ 50% VRAM savings | |
| | **SLERP Checkpoint Merging** | Balances SFT knowledge + DPO alignment via spherical interpolation โ mitigates alignment tax | |
| | **Native DPO/ORPO** | No TRL dependency โ implemented from scratch in PyTorch for custom Mamba-2 hybrid | |
|
|
| > ๐ **For the complete development journey, architecture design rationale, and hardware optimization details, see the [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo).** |
|
|
| ### Model Variants |
|
|
| This repository contains **7 checkpoints** representing each stage of the training pipeline. |
|
|
| | Variant | Directory | Size | Description | Recommended | |
| |---------|-----------|------|-------------|:-----------:| |
| | **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (ฮฑ=0.5) | โญ | |
| | Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | | |
| | SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | | |
| | DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | | |
| | DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | | |
| | ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | | |
| | DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | | |
|
|
| ### Training Pipeline |
|
|
| ``` |
| Pretrain (55B tokens, 7รB200, 60h) |
| โโโบ SFT v2 (65K steps, H100 MIG, 5 days) |
| โโโบ DPO R1 (3K steps) โโบ DPO R2 (2K steps) |
| โ โโโบ SLERP Merge (ฮฑ=0.5) โญ Final Recommended |
| โโโบ ORPO (10K steps, experimental) |
| โโโบ DPO R3 (1K steps, repetition experiment) |
| ``` |
|
|
| Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage. |
|
|
| ### Benchmark Results |
|
|
| **Evaluated on: SLERP model** (0-shot, limit=500) |
|
|
| | Benchmark | Accuracy | |
| |-----------|:--------:| |
| | HellaSwag | 34.6% | |
| | ARC-Easy | 32.0% | |
| | Belebele Korean | 23.6% | |
| | Global MMLU Korean | 23.7% | |
|
|
| **Repetition suppression** (greedy decoding) |
|
|
| | Setting | 3-gram repetition rate | |
| |---------|:----------------------:| |
| | No rep_penalty | 74.5% | |
| | rep_penalty=1.2 | **5.5%** | |
|
|
| Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2` |
|
|
| ### DPO vs ORPO Comparison |
|
|
| | Metric | SLERP (SFTโDPO) | ORPO | Winner | |
| |--------|:---------------:|:----:|:------:| |
| | Greedy repetition | 74.5% | 87.1% | SLERP | |
| | Chat quality | Fluent | Broken | SLERP | |
| | HellaSwag | **39.0%** | 35.0% | SLERP | |
| | Training time | 5d+8h | **12.8h** | ORPO | |
|
|
| ORPO's weakness: only 10K steps of training vs SFT's 65K โ insufficient base instruction-following before alignment kicks in. |
|
|
| ### Usage |
|
|
| > **GGUF/Ollama not supported**: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only. |
|
|
| **Prerequisites:** |
|
|
| ```bash |
| # 1. Clone source code (custom architecture modules required) |
| git clone https://github.com/pathcosmos/EVAFRILL-Mo |
| cd EVAFRILL-Mo |
| |
| # 2. Install dependencies |
| pip install torch safetensors tokenizers PyYAML |
| ``` |
|
|
| **Method 1: Direct safetensors loading (recommended)** |
|
|
| ```python |
| import json |
| import torch |
| from model.config import LMConfig |
| from model.transformer import LLM |
| from tokenizers import Tokenizer |
| from safetensors.torch import load_file as load_safetensors |
| |
| CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # slerp/ directory of this repo |
| |
| # Load config & model |
| with open(f"{CKPT}/config.json") as f: |
| data = json.load(f) |
| for k in ("model_type", "architectures", "_variant", "_description"): |
| data.pop(k, None) |
| cfg = LMConfig(**data) |
| cfg.use_flash_attn = False |
| |
| model = LLM(cfg) |
| state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu") |
| model.load_state_dict(state, strict=False) |
| model = model.to(device="cuda:0", dtype=torch.bfloat16) |
| model.eval() |
| |
| tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json") |
| |
| # Generate (recommended: temp=0.7, rep_penalty=1.2) |
| prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n" |
| ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0") |
| |
| with torch.no_grad(): |
| for _ in range(256): |
| logits, _ = model(ids) |
| logits = logits[:, -1, :].float() |
| for prev_id in set(ids[0].tolist()): |
| if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2 |
| else: logits[0, prev_id] *= 1.2 |
| probs = torch.softmax(logits / 0.7, dim=-1) |
| next_id = torch.multinomial(probs, 1) |
| ids = torch.cat([ids, next_id], dim=1) |
| if next_id.item() == tok.token_to_id("</s>"): break |
| |
| print(tok.decode(ids[0].tolist())) |
| ``` |
|
|
| **Method 2: Evaluation framework runner** |
|
|
| The `evafrill_runner.py` in [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test) wraps the above into a simple API: |
|
|
| ```python |
| from eval_framework.evafrill_runner import generate, unload_model |
| |
| result = generate("Hello, please introduce yourself.") |
| print(result["response"]) |
| print(f"Speed: {result['tokens_per_sec']:.1f} TPS") |
| unload_model() |
| ``` |
|
|
| > Setup instructions: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-๋ชจ๋ธ-์ค์ -pytorch-์ง์ -์ถ๋ก ) |
|
|
| **System requirements**: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS) |
|
|
| ### Reproducibility |
|
|
| | Path | Contents | |
| |------|----------| |
| | `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) | |
| | `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) | |
| | `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG | |
| | `configs/dpo_3b_1gpu.yaml` | DPO training config | |
| | `configs/orpo_3b_1gpu.yaml` | ORPO training config | |
| | `scripts/dpo.py` | DPO training code | |
| | `scripts/orpo_native.py` | ORPO training code | |
| | `scripts/sft.py` | SFT training code | |
| | `scripts/evafrill_eval.py` | Benchmark evaluation code | |
| | `scripts/merge_checkpoints.py` | SLERP checkpoint merging | |
|
|
| ### Limitations |
|
|
| - **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models. |
| - **GGUF/Ollama**: Not supported โ custom hybrid Mamba-2 architecture cannot be converted with standard tools. |
| - **vLLM**: Theoretically possible but requires custom weight key mapping. |
| - **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` โ always use `repetition_penalty >= 1.2`. |
| - **Language coverage**: Performance is not guaranteed for languages other than Korean and English. |
|
|
| ### Links |
|
|
| - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo) |
| - **Predecessor**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) | [๐ค HuggingFace](https://huggingface.co/pathcosmos/frankenstallm) โ Pure Transformer predecessor project |
| - **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624) |
|
|
| ### Acknowledgment / ๊ฐ์ฌ์ ๊ธ |
|
|
| ์ด ํ๋ก์ ํธ๋ **๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ**์ **ใ์ฒจ๋จ GPU ํ์ฉ ์ง์ ์ฌ์
ใ** (๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ ๊ณต๊ณ ์ 2025-1068ํธ)์ ํตํด ์ ๊ณต๋ GPU ์ปดํจํ
์์์ ํ์ฉํ์ฌ ์ํ๋์์ต๋๋ค. |
|
|
| > **๊ตญ๊ฐ AI์ปดํจํ
์์ ์ง์ํฌํธ**: [https://aiinfrahub.kr](https://aiinfrahub.kr) |
| > |
| > - ์ฃผ๊ด: ๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ (MSIT), ์ ๋ณดํต์ ์ฐ์
์งํฅ์ (NIPA) |
| > - ์ด์: ํ๊ตญ์ ๋ณดํต์ ์งํฅํํ (KAIT) |
|
|
| ๋ํ๋ฏผ๊ตญ ์ ๋ถ์ AI ์ธํ๋ผ ์ง์ ์ฌ์
๋๋ถ์ 7ร NVIDIA B200 GPU ํ๊ฒฝ์์ ํ๊ตญ์ด 3B ํ์ด๋ธ๋ฆฌ๋ Mamba-Transformer ๋ชจ๋ธ์ ์ฒ์๋ถํฐ ํ์ตํ ์ ์์์ต๋๋ค. ๊ตญ๊ฐ ์ฐจ์์ AI ์ปดํจํ
์์ ์ง์์ ๊น์ด ๊ฐ์ฌ๋๋ฆฝ๋๋ค. |
|
|
| This project was conducted using GPU computing resources provided through the **"Advanced GPU Utilization Support Program"** (MSIT Notice No. 2025-1068) by the **Ministry of Science and ICT (MSIT)** of the Republic of Korea. |
|
|
| > **National AI Computing Resource Support Portal**: [https://aiinfrahub.kr](https://aiinfrahub.kr) |
| > |
| > - Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA) |
| > - Operated by: Korea Association of Information & Telecommunication (KAIT) |
|
|
| We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7ร NVIDIA B200 GPUs. |
|
|
| --- |
|
|
| ### License |
|
|
| MIT License โ free to use, modify, and distribute commercially. |
|
|