ํ•œ๊ตญ์–ด | English


ํ•œ๊ตญ์–ด

EVAFRILL-Mo 3B โ€” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer

ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ

EVAFRILL-Mo 3B๋Š” NVIDIA Nemotron-H ์•„ํ‚คํ…์ฒ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ 30์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • 7ร— NVIDIA B200 GPU๋กœ 55B ํ† ํฐ ์‚ฌ์ „ํ•™์Šต (์•ฝ 60์‹œ๊ฐ„)
  • ํ•œ๊ตญ์–ดยท์˜์–ดยท์ฝ”๋“œยท์ˆ˜ํ•™ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ
  • SFT โ†’ DPO โ†’ SLERP ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๋‹จ์ผ ํ”„๋กœ์ ํŠธ์—์„œ ์ง์ ‘ ๊ตฌํ˜„
  • ์™ธ๋ถ€ ํ”„๋ ˆ์ž„์›Œํฌ(Transformers Trainer, TRL) ์—†์ด PyTorch ๋„ค์ดํ‹ฐ๋ธŒ๋กœ ๊ตฌํ˜„

์•„ํ‚คํ…์ฒ˜

Type:           Hybrid Mamba-2 + Transformer
Parameters:     2.94B (2,975,397,632)
Layers:         26 (24ร— Mamba-2 SSM + 2ร— Attention GQA)
d_model:        3,072
Vocabulary:     64,000 (custom SentencePiece)
Max seq length: 4,096

Mamba-2 SSM ๋ธ”๋ก์ด ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , 2๊ฐœ์˜ GQA Attention ๋ธ”๋ก์ด ์ „์—ญ ์ปจํ…์ŠคํŠธ๋ฅผ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค. ํ‘œ์ค€ Transformer ๋Œ€๋น„ ์ถ”๋ก  ์‹œ KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํฌ๊ฒŒ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ๋ฐœ ๋ฐฐ๊ฒฝ ๋ฐ ํžˆ์Šคํ† ๋ฆฌ

EVAFRILL-Mo๋Š” 6๋‹จ๊ณ„์˜ ๋ฐ˜๋ณต์  ์„ค๊ณ„ ๊ณผ์ •์„ ๊ฑฐ์ณ ํƒ„์ƒํ–ˆ์Šต๋‹ˆ๋‹ค:

  1. FRANKENSTALLM โ€” ์ˆœ์ˆ˜ Transformer decoder-only LLM์œผ๋กœ ์‹œ์ž‘ํ•œ ์ „์‹  ํ”„๋กœ์ ํŠธ. ํ•œ๊ตญ์–ด+์˜์–ด+์ฝ”๋“œ+์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ๋กœ ์ปค์Šคํ…€ SentencePiece ํ† ํฌ๋‚˜์ด์ €(64K ์–ดํœ˜)๋ฅผ ํ•™์Šตํ•˜๊ณ , DDP ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. Nemotron-H ์˜๊ฐ โ€” NVIDIA์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer ์„ค๊ณ„๋ฅผ ํ•ต์‹ฌ ์›์น™๋งŒ ์ถ”์ถœํ•˜์—ฌ(fragmentation) ์ œํ•œ๋œ ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ์ถ•์†Œยท์ ์šฉ.
  3. ์ฒด๊ณ„์  ๊ทœ๋ชจ ํƒ์ƒ‰ โ€” 5๊ฐœ ๊ทœ๋ชจ(1B~3B) ๋ชจ๋ธ์„ 7ร—B200์—์„œ ๋ฒค์น˜๋งˆํฌํ•˜์—ฌ Chinchilla-optimal ์ตœ๋Œ€ ๊ทœ๋ชจ(3B, 93% ๋‹ฌ์„ฑ) ๊ฒฐ์ •.
  4. 1B โ†’ 3B ์ „ํ™˜ โ€” tok/s๊ฐ€ per-GPU ๊ฐ’์ž„์„ ๋ฐœ๊ฒฌํ•˜์—ฌ, 1B ๊ณผ์ž‰ํ•™์Šต(681%)์„ 3B ์ ์ •ํ•™์Šต(93%)์œผ๋กœ ์ „ํ™˜.
  5. 3B ์‚ฌ์ „ํ•™์Šต โ€” 319,772 steps, 55B tokens, 7ร—B200 FP8๋กœ 60์‹œ๊ฐ„ ์™„๋ฃŒ.
  6. Post-training โ€” H100 MIG ํ™˜๊ฒฝ์—์„œ SFT โ†’ DPO โ†’ SLERP โ†’ ORPO ์‹คํ—˜๊นŒ์ง€ ์™„์ˆ˜.

ํ•ต์‹ฌ ๊ธฐ์ˆ  ํ•˜์ด๋ผ์ดํŠธ

๊ธฐ์ˆ  ํšจ๊ณผ
Chunked Cross-Entropy 64K ์–ดํœ˜์—์„œ logits ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 1/8๋กœ ์ ˆ๊ฐ
Mamba Memory Cliff ๋ฐœ๊ฒฌ batch 6โ†’7์—์„œ 47GBโ†’183GB+ ํญ์ฆ โ€” selective scan์˜ ๊ตฌ์กฐ์  ์ œ์•ฝ ๊ทœ๋ช…
FP8 ๋„ค์ดํ‹ฐ๋ธŒ ํ•™์Šต TransformerEngine MXFP8BlockScaling์œผ๋กœ B200์—์„œ BF16 ๋Œ€๋น„ ~2๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰
LoRA B-zeroing DPO reference model์„ ๋ชจ๋ธ ๋ณต์ œ ์—†์ด LoRA B๋ฅผ ์ž„์‹œ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๊ณ„์‚ฐ โ€” VRAM 50% ์ ˆ์•ฝ
SLERP ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ SFT ์ง€์‹ ๋ณด์กด + DPO ์ •๋ ฌ์„ ๊ตฌ๋ฉด ๋ณด๊ฐ„์œผ๋กœ ๊ท ํ˜• โ€” alignment tax ์™„ํ™”
Native DPO/ORPO TRL ๋ฏธ์‚ฌ์šฉ, ์ปค์Šคํ…€ Mamba-2 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋ฅผ ์œ„ํ•ด ์ฒ˜์Œ๋ถ€ํ„ฐ PyTorch๋กœ ๊ตฌํ˜„

๐Ÿ“– ์ „์ฒด ๊ฐœ๋ฐœ ๊ณผ์ •, ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„ ๊ทผ๊ฑฐ, ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ์ƒ์„ธ๋Š” GitHub README๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๋ชจ๋ธ ๋ฒ„์ „

์ด ์ €์žฅ์†Œ์—๋Š” ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๊ฐ ๋‹จ๊ณ„์˜ ์ฒดํฌํฌ์ธํŠธ 7์ข…์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

๋ฒ„์ „ ๋””๋ ‰ํ† ๋ฆฌ ํฌ๊ธฐ ์„ค๋ช… ๊ถŒ์žฅ
SLERP slerp/ 6.3 GB SFT + DPO R2 ๊ตฌ๋ฉด ์„ ํ˜• ๋ณด๊ฐ„ (ฮฑ=0.5) โญ
Pretrain pretrain/ 12.6 GB ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (319K ์Šคํ…, 55B ํ† ํฐ)
SFT v2 sft-v2/ 6.3 GB ๋ช…๋ น์–ด ํŒŒ์ธํŠœ๋‹ (65K ์Šคํ…)
DPO R1 dpo-r1/ 6.3 GB ์„ ํ˜ธ๋„ ์ •๋ ฌ 1๋ผ์šด๋“œ (3K ์Šคํ…)
DPO R2 dpo-r2/ 6.3 GB ๋ณด์ˆ˜์  ํŒŒ์ธํŠœ๋‹ 2๋ผ์šด๋“œ (2K ์Šคํ…)
ORPO orpo/ 6.3 GB SFT+์ •๋ ฌ ๋™์‹œ ํ•™์Šต ์‹คํ—˜ (10K ์Šคํ…)
DPO R3 dpo-r3/ 6.3 GB ๋ฐ˜๋ณต ์–ต์ œ ํŠนํ™” ์‹คํ—˜ (1K ์Šคํ…)

ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ

Pretrain (55B tokens, 7ร—B200, 60h)
  โ””โ”€โ–บ SFT v2 (65K steps, H100 MIG, 5์ผ)
        โ”œโ”€โ–บ DPO R1 (3K steps) โ”€โ–บ DPO R2 (2K steps)
        โ”‚     โ””โ”€โ–บ SLERP Merge (ฮฑ=0.5) โญ ์ตœ์ข… ๊ถŒ์žฅ
        โ””โ”€โ–บ ORPO (10K steps, ์‹คํ—˜)
              โ””โ”€โ–บ DPO R3 (1K steps, ๋ฐ˜๋ณต ํŠนํ™” ์‹คํ—˜)

๊ฐ ํ™”์‚ดํ‘œ๋Š” ๋…๋ฆฝ๋œ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ €์žฅ๋˜์–ด, ์ž„์˜์˜ ๋‹จ๊ณ„๋ถ€ํ„ฐ ์žฌํ˜„ยท๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

ํ‰๊ฐ€ ๋Œ€์ƒ: SLERP ๋ชจ๋ธ (0-shot, limit=500)

๋ฒค์น˜๋งˆํฌ ์ •ํ™•๋„
HellaSwag 34.6%
ARC-Easy 32.0%
Belebele ํ•œ๊ตญ์–ด 23.6%
Global MMLU ํ•œ๊ตญ์–ด 23.7%

๋ฐ˜๋ณต ์ƒ์„ฑ ์–ต์ œ (greedy decoding ๊ธฐ์ค€)

์„ค์ • 3-gram ๋ฐ˜๋ณต๋ฅ 
rep_penalty ์—†์Œ 74.5%
rep_penalty=1.2 5.5%

๊ถŒ์žฅ ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ: temperature=0.7, repetition_penalty=1.2

DPO vs ORPO ๋น„๊ต

์ง€ํ‘œ SLERP (SFTโ†’DPO) ORPO ์šฐ์„ธ
Greedy ๋ฐ˜๋ณต๋ฅ  74.5% 87.1% SLERP
๋Œ€ํ™” ํ’ˆ์งˆ ์ž์—ฐ์Šค๋Ÿฌ์›€ ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์›€ SLERP
HellaSwag 39.0% 35.0% SLERP
ํ•™์Šต ์‹œ๊ฐ„ 5์ผ+8์‹œ๊ฐ„ 12.8์‹œ๊ฐ„ ORPO

ORPO์˜ ์•ฝ์ : SFT 65K ์Šคํ… ๋Œ€๋น„ 10K ์Šคํ…๋งŒ ํ•™์Šต๋˜์–ด ๊ธฐ๋ฐ˜ ๋ช…๋ น์–ด ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋ฒ•

GGUF/Ollama ๋ฏธ์ง€์›: ์ปค์Šคํ…€ Mamba-2 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜๋กœ llama.cpp/GGUF/Ollama์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. PyTorch ์ง์ ‘ ์ถ”๋ก ๋งŒ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„:

# 1. ์†Œ์Šค ์ฝ”๋“œ ํด๋ก  (์ปค์Šคํ…€ ์•„ํ‚คํ…์ฒ˜ ๋ชจ๋“ˆ ํ•„์š”)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo

# 2. ์˜์กด์„ฑ ์„ค์น˜
pip install torch safetensors tokenizers PyYAML

๋ฐฉ๋ฒ• 1: safetensors ์ง์ ‘ ๋กœ๋”ฉ (๊ถŒ์žฅ)

import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/EVAFRILL-Mo-3B/slerp"  # ์ด ์ €์žฅ์†Œ์˜ slerp/ ๋””๋ ‰ํ† ๋ฆฌ

# Config & ๋ชจ๋ธ ๋กœ๋“œ
with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

# ์ƒ์„ฑ (๊ถŒ์žฅ: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\n์ธ๊ณต์ง€๋Šฅ์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€์š”?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(256):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))

๋ฐฉ๋ฒ• 2: ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๋Ÿฌ๋„ˆ ์‚ฌ์šฉ

frankenstallm_test์˜ evafrill_runner.py๊ฐ€ ์œ„ ๊ณผ์ •์„ ๋ž˜ํ•‘ํ•ฉ๋‹ˆ๋‹ค:

from eval_framework.evafrill_runner import generate, unload_model

result = generate("ํ•œ๊ตญ์–ด๋กœ ์ธ์‚ฌํ•ด์ฃผ์„ธ์š”.")
print(result["response"])
print(f"์†๋„: {result['tokens_per_sec']:.1f} TPS")
unload_model()

์„ค์ • ๋ฐฉ๋ฒ•: frankenstallm_test README ์ฐธ์กฐ

์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ: GPU VRAM 8GB+ (BF16), CPU ์ถ”๋ก  ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ๊ทนํžˆ ๋А๋ฆผ (~0.5 TPS)

์žฌํ˜„ ์ž๋ฃŒ

๊ฒฝ๋กœ ๋‚ด์šฉ
data/combined_preference.jsonl ์„ ํ˜ธ๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ (684K ์Œ, 2.6 GB)
data/repetition_preference.jsonl ๋ฐ˜๋ณต ์–ต์ œ ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ (105 ์Œ, ์ž๋™ ์ƒ์„ฑ)
configs/korean_3b_sft_1gpu.yaml SFT H100 MIG ์„ค์ •
configs/dpo_3b_1gpu.yaml DPO ํ•™์Šต ์„ค์ •
configs/orpo_3b_1gpu.yaml ORPO ํ•™์Šต ์„ค์ •
scripts/dpo.py DPO ํ•™์Šต ์ฝ”๋“œ
scripts/orpo_native.py ORPO ํ•™์Šต ์ฝ”๋“œ
scripts/sft.py SFT ํ•™์Šต ์ฝ”๋“œ
scripts/evafrill_eval.py ๋ฒค์น˜๋งˆํฌ ํ‰๊ฐ€ ์ฝ”๋“œ
scripts/merge_checkpoints.py SLERP ์ฒดํฌํฌ์ธํŠธ ๋ณ‘ํ•ฉ

์ œํ•œ์‚ฌํ•ญ

  • 3B ๊ทœ๋ชจ ํ•œ๊ณ„: ์‚ฌ์‹ค ์ •ํ™•๋„ยท๋ณต์žกํ•œ ์ถ”๋ก ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ๋Œ€ํ˜• ๋ชจ๋ธ ๋Œ€๋น„ ์„ฑ๋Šฅ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.
  • GGUF/Ollama ๋ถˆ๊ฐ€: ์ปค์Šคํ…€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 ์•„ํ‚คํ…์ฒ˜๋กœ ํ‘œ์ค€ ๋ณ€ํ™˜ ํˆด์„ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • vLLM ์ œํ•œ์ : ์ด๋ก ์ƒ ๊ฐ€๋Šฅํ•˜๋‚˜ ์ปค์Šคํ…€ weight key ๋งคํ•‘์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ณต ์ƒ์„ฑ: greedy decoding ์‹œ ๋ฐ˜๋ณต๋ฅ ์ด ๋†’์œผ๋ฏ€๋กœ ๋ฐ˜๋“œ์‹œ repetition_penalty=1.2 ์ด์ƒ์„ ์„ค์ •ํ•˜์„ธ์š”.
  • ์–ธ์–ด ํŽธ์ค‘: ํ•œ๊ตญ์–ดยท์˜์–ด ์™ธ ์–ธ์–ด๋Š” ์„ฑ๋Šฅ์ด ๋ณด์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋งํฌ

๋ผ์ด์„ ์Šค

MIT License โ€” ์ƒ์—…์  ์ด์šฉยท์ˆ˜์ •ยท์žฌ๋ฐฐํฌ ๋ชจ๋‘ ์ž์œ ๋กญ์Šต๋‹ˆ๋‹ค.


English

EVAFRILL-Mo 3B โ€” Hybrid Mamba-2 + Transformer

Introduction

EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built entirely from scratch, inspired by NVIDIA's Nemotron-H architecture.

  • Pretrained on 55B tokens using 7ร— NVIDIA B200 GPUs (~60 hours)
  • Mixed Korean, English, code, and math datasets
  • Full SFT โ†’ DPO โ†’ SLERP pipeline implemented in pure PyTorch โ€” no Transformers Trainer or TRL
  • Designed as a Korean-first model with strong multilingual capability

Architecture

Type:           Hybrid Mamba-2 + Transformer
Parameters:     2.94B (2,975,397,632)
Layers:         26 (24ร— Mamba-2 SSM + 2ร— Attention GQA)
d_model:        3,072
Vocabulary:     64,000 (custom SentencePiece)
Max seq length: 4,096

Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context. Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.

Development Background & History

EVAFRILL-Mo was built through 6 iterative design stages:

  1. FRANKENSTALLM โ€” Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline.
  2. Nemotron-H Inspiration โ€” Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware.
  3. Systematic Scale Search โ€” Benchmarked 5 model sizes (1Bโ€“3B) on 7ร—B200 to find the Chinchilla-optimal maximum (3B, 93% achievement).
  4. 1B โ†’ 3B Transition โ€” Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%).
  5. 3B Pretraining โ€” 319,772 steps, 55B tokens, 60 hours on 7ร—B200 with FP8.
  6. Post-training โ€” SFT โ†’ DPO โ†’ SLERP โ†’ ORPO experiments on H100 MIG.

Key Technical Highlights

Technique Impact
Chunked Cross-Entropy Reduces logits memory by 8ร— for 64K vocabulary
Mamba Memory Cliff Discovery Batch 6โ†’7 causes 47GBโ†’183GB+ explosion โ€” structural limitation of selective scan
FP8 Native Training TransformerEngine MXFP8BlockScaling delivers ~2ร— throughput vs BF16 on B200
LoRA B-zeroing Computes DPO reference logprobs without model duplication โ€” 50% VRAM savings
SLERP Checkpoint Merging Balances SFT knowledge + DPO alignment via spherical interpolation โ€” mitigates alignment tax
Native DPO/ORPO No TRL dependency โ€” implemented from scratch in PyTorch for custom Mamba-2 hybrid

๐Ÿ“– For the complete development journey, architecture design rationale, and hardware optimization details, see the GitHub README.

Model Variants

This repository contains 7 checkpoints representing each stage of the training pipeline.

Variant Directory Size Description Recommended
SLERP slerp/ 6.3 GB Spherical interpolation of SFT + DPO R2 (ฮฑ=0.5) โญ
Pretrain pretrain/ 12.6 GB Base model (319K steps, 55B tokens)
SFT v2 sft-v2/ 6.3 GB Instruction-tuned (65K steps)
DPO R1 dpo-r1/ 6.3 GB Preference-aligned Round 1 (3K steps)
DPO R2 dpo-r2/ 6.3 GB Conservative fine-tuning Round 2 (2K steps)
ORPO orpo/ 6.3 GB Simultaneous SFT+alignment experiment (10K steps)
DPO R3 dpo-r3/ 6.3 GB Repetition-targeted experiment (1K steps)

Training Pipeline

Pretrain (55B tokens, 7ร—B200, 60h)
  โ””โ”€โ–บ SFT v2 (65K steps, H100 MIG, 5 days)
        โ”œโ”€โ–บ DPO R1 (3K steps) โ”€โ–บ DPO R2 (2K steps)
        โ”‚     โ””โ”€โ–บ SLERP Merge (ฮฑ=0.5) โญ Final Recommended
        โ””โ”€โ–บ ORPO (10K steps, experimental)
              โ””โ”€โ–บ DPO R3 (1K steps, repetition experiment)

Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.

Benchmark Results

Evaluated on: SLERP model (0-shot, limit=500)

Benchmark Accuracy
HellaSwag 34.6%
ARC-Easy 32.0%
Belebele Korean 23.6%
Global MMLU Korean 23.7%

Repetition suppression (greedy decoding)

Setting 3-gram repetition rate
No rep_penalty 74.5%
rep_penalty=1.2 5.5%

Recommended inference parameters: temperature=0.7, repetition_penalty=1.2

DPO vs ORPO Comparison

Metric SLERP (SFTโ†’DPO) ORPO Winner
Greedy repetition 74.5% 87.1% SLERP
Chat quality Fluent Broken SLERP
HellaSwag 39.0% 35.0% SLERP
Training time 5d+8h 12.8h ORPO

ORPO's weakness: only 10K steps of training vs SFT's 65K โ€” insufficient base instruction-following before alignment kicks in.

Usage

GGUF/Ollama not supported: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.

Prerequisites:

# 1. Clone source code (custom architecture modules required)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo

# 2. Install dependencies
pip install torch safetensors tokenizers PyYAML

Method 1: Direct safetensors loading (recommended)

import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/EVAFRILL-Mo-3B/slerp"  # slerp/ directory of this repo

# Load config & model
with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

# Generate (recommended: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(256):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))

Method 2: Evaluation framework runner

The evafrill_runner.py in frankenstallm_test wraps the above into a simple API:

from eval_framework.evafrill_runner import generate, unload_model

result = generate("Hello, please introduce yourself.")
print(result["response"])
print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
unload_model()

Setup instructions: frankenstallm_test README

System requirements: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)

Reproducibility

Path Contents
data/combined_preference.jsonl Preference training data (684K pairs, 2.6 GB)
data/repetition_preference.jsonl Repetition-suppression preference data (105 pairs, auto-generated)
configs/korean_3b_sft_1gpu.yaml SFT config for H100 MIG
configs/dpo_3b_1gpu.yaml DPO training config
configs/orpo_3b_1gpu.yaml ORPO training config
scripts/dpo.py DPO training code
scripts/orpo_native.py ORPO training code
scripts/sft.py SFT training code
scripts/evafrill_eval.py Benchmark evaluation code
scripts/merge_checkpoints.py SLERP checkpoint merging

Limitations

  • 3B scale: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
  • GGUF/Ollama: Not supported โ€” custom hybrid Mamba-2 architecture cannot be converted with standard tools.
  • vLLM: Theoretically possible but requires custom weight key mapping.
  • Greedy repetition: ~74.5% 3-gram repetition rate without repetition_penalty โ€” always use repetition_penalty >= 1.2.
  • Language coverage: Performance is not guaranteed for languages other than Korean and English.

Links

Acknowledgment / ๊ฐ์‚ฌ์˜ ๊ธ€

์ด ํ”„๋กœ์ ํŠธ๋Š” ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€์˜ ใ€Œ์ฒจ๋‹จ GPU ํ™œ์šฉ ์ง€์› ์‚ฌ์—…ใ€ (๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ ๊ณต๊ณ  ์ œ2025-1068ํ˜ธ)์„ ํ†ตํ•ด ์ œ๊ณต๋œ GPU ์ปดํ“จํŒ… ์ž์›์„ ํ™œ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ตญ๊ฐ€ AI์ปดํ“จํŒ…์ž์› ์ง€์›ํฌํ„ธ: https://aiinfrahub.kr

  • ์ฃผ๊ด€: ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ (MSIT), ์ •๋ณดํ†ต์‹ ์‚ฐ์—…์ง„ํฅ์› (NIPA)
  • ์šด์˜: ํ•œ๊ตญ์ •๋ณดํ†ต์‹ ์ง„ํฅํ˜‘ํšŒ (KAIT)

๋Œ€ํ•œ๋ฏผ๊ตญ ์ •๋ถ€์˜ AI ์ธํ”„๋ผ ์ง€์› ์‚ฌ์—… ๋•๋ถ„์— 7ร— NVIDIA B200 GPU ํ™˜๊ฒฝ์—์„œ ํ•œ๊ตญ์–ด 3B ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-Transformer ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตญ๊ฐ€ ์ฐจ์›์˜ AI ์ปดํ“จํŒ… ์ž์› ์ง€์›์— ๊นŠ์ด ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

This project was conducted using GPU computing resources provided through the "Advanced GPU Utilization Support Program" (MSIT Notice No. 2025-1068) by the Ministry of Science and ICT (MSIT) of the Republic of Korea.

National AI Computing Resource Support Portal: https://aiinfrahub.kr

  • Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)
  • Operated by: Korea Association of Information & Telecommunication (KAIT)

We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7ร— NVIDIA B200 GPUs.


License

MIT License โ€” free to use, modify, and distribute commercially.

Downloads last month
1,085
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train pathcosmos/EVAFRILL-Mo-3B

Paper for pathcosmos/EVAFRILL-Mo-3B

Evaluation results

  • Accuracy on HellaSwag (0-shot, limit=500)
    self-reported
    34.600
  • Accuracy on ARC-Easy (0-shot, limit=500)
    self-reported
    32.000
  • Accuracy on Belebele Korean (0-shot, limit=500)
    self-reported
    23.600
  • Accuracy on Global MMLU Korean (0-shot, limit=500)
    self-reported
    23.700