ํ๊ตญ์ด
EVAFRILL-Mo 3B โ ํ์ด๋ธ๋ฆฌ๋ Mamba-2 + Transformer
ํ๋ก์ ํธ ์๊ฐ
EVAFRILL-Mo 3B๋ NVIDIA Nemotron-H ์ํคํ ์ฒ์์ ์๊ฐ์ ๋ฐ์ ๋ฐ๋ฐ๋ฅ๋ถํฐ ์ง์ ๊ตฌํํ 30์ต ํ๋ผ๋ฏธํฐ ํ์ด๋ธ๋ฆฌ๋ ์ธ์ด ๋ชจ๋ธ์ ๋๋ค.
- 7ร NVIDIA B200 GPU๋ก 55B ํ ํฐ ์ฌ์ ํ์ต (์ฝ 60์๊ฐ)
- ํ๊ตญ์ดยท์์ดยท์ฝ๋ยท์ํ ํผํฉ ๋ฐ์ดํฐ์ ์ฌ์ฉ
- SFT โ DPO โ SLERP ์ ์ฒด ํ์ดํ๋ผ์ธ์ ๋จ์ผ ํ๋ก์ ํธ์์ ์ง์ ๊ตฌํ
- ์ธ๋ถ ํ๋ ์์ํฌ(Transformers Trainer, TRL) ์์ด PyTorch ๋ค์ดํฐ๋ธ๋ก ๊ตฌํ
์ํคํ ์ฒ
Type: Hybrid Mamba-2 + Transformer
Parameters: 2.94B (2,975,397,632)
Layers: 26 (24ร Mamba-2 SSM + 2ร Attention GQA)
d_model: 3,072
Vocabulary: 64,000 (custom SentencePiece)
Max seq length: 4,096
Mamba-2 SSM ๋ธ๋ก์ด ์ฅ๊ฑฐ๋ฆฌ ์์กด์ฑ์ ํจ์จ์ ์ผ๋ก ์ฒ๋ฆฌํ๊ณ , 2๊ฐ์ GQA Attention ๋ธ๋ก์ด ์ ์ญ ์ปจํ ์คํธ๋ฅผ ๋ณด์ํฉ๋๋ค. ํ์ค Transformer ๋๋น ์ถ๋ก ์ KV ์บ์ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ํฌ๊ฒ ์ ๊ฐํฉ๋๋ค.
๊ฐ๋ฐ ๋ฐฐ๊ฒฝ ๋ฐ ํ์คํ ๋ฆฌ
EVAFRILL-Mo๋ 6๋จ๊ณ์ ๋ฐ๋ณต์ ์ค๊ณ ๊ณผ์ ์ ๊ฑฐ์ณ ํ์ํ์ต๋๋ค:
- FRANKENSTALLM โ ์์ Transformer decoder-only LLM์ผ๋ก ์์ํ ์ ์ ํ๋ก์ ํธ. ํ๊ตญ์ด+์์ด+์ฝ๋+์ํ ๋ฐ์ดํฐ๋ก ์ปค์คํ SentencePiece ํ ํฌ๋์ด์ (64K ์ดํ)๋ฅผ ํ์ตํ๊ณ , DDP ํ์ต ํ์ดํ๋ผ์ธ์ ๊ตฌ์ถํ์ต๋๋ค.
- Nemotron-H ์๊ฐ โ NVIDIA์ ํ์ด๋ธ๋ฆฌ๋ Mamba-2 + Transformer ์ค๊ณ๋ฅผ ํต์ฌ ์์น๋ง ์ถ์ถํ์ฌ(fragmentation) ์ ํ๋ ํ๋์จ์ด์ ๋ง๊ฒ ์ถ์ยท์ ์ฉ.
- ์ฒด๊ณ์ ๊ท๋ชจ ํ์ โ 5๊ฐ ๊ท๋ชจ(1B~3B) ๋ชจ๋ธ์ 7รB200์์ ๋ฒค์น๋งํฌํ์ฌ Chinchilla-optimal ์ต๋ ๊ท๋ชจ(3B, 93% ๋ฌ์ฑ) ๊ฒฐ์ .
- 1B โ 3B ์ ํ โ tok/s๊ฐ per-GPU ๊ฐ์์ ๋ฐ๊ฒฌํ์ฌ, 1B ๊ณผ์ํ์ต(681%)์ 3B ์ ์ ํ์ต(93%)์ผ๋ก ์ ํ.
- 3B ์ฌ์ ํ์ต โ 319,772 steps, 55B tokens, 7รB200 FP8๋ก 60์๊ฐ ์๋ฃ.
- Post-training โ H100 MIG ํ๊ฒฝ์์ SFT โ DPO โ SLERP โ ORPO ์คํ๊น์ง ์์.
ํต์ฌ ๊ธฐ์ ํ์ด๋ผ์ดํธ
| ๊ธฐ์ | ํจ๊ณผ |
|---|---|
| Chunked Cross-Entropy | 64K ์ดํ์์ logits ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ 1/8๋ก ์ ๊ฐ |
| Mamba Memory Cliff ๋ฐ๊ฒฌ | batch 6โ7์์ 47GBโ183GB+ ํญ์ฆ โ selective scan์ ๊ตฌ์กฐ์ ์ ์ฝ ๊ท๋ช |
| FP8 ๋ค์ดํฐ๋ธ ํ์ต | TransformerEngine MXFP8BlockScaling์ผ๋ก B200์์ BF16 ๋๋น ~2๋ฐฐ ์ฒ๋ฆฌ๋ |
| LoRA B-zeroing | DPO reference model์ ๋ชจ๋ธ ๋ณต์ ์์ด LoRA B๋ฅผ ์์ 0์ผ๋ก ๋ง๋ค์ด ๊ณ์ฐ โ VRAM 50% ์ ์ฝ |
| SLERP ์ฒดํฌํฌ์ธํธ ๋ณํฉ | SFT ์ง์ ๋ณด์กด + DPO ์ ๋ ฌ์ ๊ตฌ๋ฉด ๋ณด๊ฐ์ผ๋ก ๊ท ํ โ alignment tax ์ํ |
| Native DPO/ORPO | TRL ๋ฏธ์ฌ์ฉ, ์ปค์คํ Mamba-2 ํ์ด๋ธ๋ฆฌ๋๋ฅผ ์ํด ์ฒ์๋ถํฐ PyTorch๋ก ๊ตฌํ |
๐ ์ ์ฒด ๊ฐ๋ฐ ๊ณผ์ , ์ํคํ ์ฒ ์ค๊ณ ๊ทผ๊ฑฐ, ํ๋์จ์ด ์ต์ ํ ์์ธ๋ GitHub README๋ฅผ ์ฐธ์กฐํ์ธ์.
๋ชจ๋ธ ๋ฒ์
์ด ์ ์ฅ์์๋ ํ์ต ํ์ดํ๋ผ์ธ ๊ฐ ๋จ๊ณ์ ์ฒดํฌํฌ์ธํธ 7์ข ์ด ํฌํจ๋ฉ๋๋ค.
| ๋ฒ์ | ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ์ค๋ช | ๊ถ์ฅ |
|---|---|---|---|---|
| SLERP | slerp/ |
6.3 GB | SFT + DPO R2 ๊ตฌ๋ฉด ์ ํ ๋ณด๊ฐ (ฮฑ=0.5) | โญ |
| Pretrain | pretrain/ |
12.6 GB | ๊ธฐ๋ฐ ๋ชจ๋ธ (319K ์คํ , 55B ํ ํฐ) | |
| SFT v2 | sft-v2/ |
6.3 GB | ๋ช ๋ น์ด ํ์ธํ๋ (65K ์คํ ) | |
| DPO R1 | dpo-r1/ |
6.3 GB | ์ ํธ๋ ์ ๋ ฌ 1๋ผ์ด๋ (3K ์คํ ) | |
| DPO R2 | dpo-r2/ |
6.3 GB | ๋ณด์์ ํ์ธํ๋ 2๋ผ์ด๋ (2K ์คํ ) | |
| ORPO | orpo/ |
6.3 GB | SFT+์ ๋ ฌ ๋์ ํ์ต ์คํ (10K ์คํ ) | |
| DPO R3 | dpo-r3/ |
6.3 GB | ๋ฐ๋ณต ์ต์ ํนํ ์คํ (1K ์คํ ) |
ํ์ต ํ์ดํ๋ผ์ธ
Pretrain (55B tokens, 7รB200, 60h)
โโโบ SFT v2 (65K steps, H100 MIG, 5์ผ)
โโโบ DPO R1 (3K steps) โโบ DPO R2 (2K steps)
โ โโโบ SLERP Merge (ฮฑ=0.5) โญ ์ต์ข
๊ถ์ฅ
โโโบ ORPO (10K steps, ์คํ)
โโโบ DPO R3 (1K steps, ๋ฐ๋ณต ํนํ ์คํ)
๊ฐ ํ์ดํ๋ ๋ ๋ฆฝ๋ ์ฒดํฌํฌ์ธํธ๋ก ์ ์ฅ๋์ด, ์์์ ๋จ๊ณ๋ถํฐ ์ฌํยท๋น๊ต๊ฐ ๊ฐ๋ฅํฉ๋๋ค.
๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ
ํ๊ฐ ๋์: SLERP ๋ชจ๋ธ (0-shot, limit=500)
| ๋ฒค์น๋งํฌ | ์ ํ๋ |
|---|---|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele ํ๊ตญ์ด | 23.6% |
| Global MMLU ํ๊ตญ์ด | 23.7% |
๋ฐ๋ณต ์์ฑ ์ต์ (greedy decoding ๊ธฐ์ค)
| ์ค์ | 3-gram ๋ฐ๋ณต๋ฅ |
|---|---|
| rep_penalty ์์ | 74.5% |
| rep_penalty=1.2 | 5.5% |
๊ถ์ฅ ์ถ๋ก ํ๋ผ๋ฏธํฐ: temperature=0.7, repetition_penalty=1.2
DPO vs ORPO ๋น๊ต
| ์งํ | SLERP (SFTโDPO) | ORPO | ์ฐ์ธ |
|---|---|---|---|
| Greedy ๋ฐ๋ณต๋ฅ | 74.5% | 87.1% | SLERP |
| ๋ํ ํ์ง | ์์ฐ์ค๋ฌ์ | ๋ถ์์ฐ์ค๋ฌ์ | SLERP |
| HellaSwag | 39.0% | 35.0% | SLERP |
| ํ์ต ์๊ฐ | 5์ผ+8์๊ฐ | 12.8์๊ฐ | ORPO |
ORPO์ ์ฝ์ : SFT 65K ์คํ ๋๋น 10K ์คํ ๋ง ํ์ต๋์ด ๊ธฐ๋ฐ ๋ช ๋ น์ด ์ดํด๊ฐ ๋ถ์กฑํฉ๋๋ค.
์ฌ์ฉ๋ฒ
GGUF/Ollama ๋ฏธ์ง์: ์ปค์คํ Mamba-2 ํ์ด๋ธ๋ฆฌ๋ ์ํคํ ์ฒ๋ก llama.cpp/GGUF/Ollama์ ํธํ๋์ง ์์ต๋๋ค. PyTorch ์ง์ ์ถ๋ก ๋ง ๊ฐ๋ฅํฉ๋๋ค.
์ฌ์ ์ค๋น:
# 1. ์์ค ์ฝ๋ ํด๋ก (์ปค์คํ
์ํคํ
์ฒ ๋ชจ๋ ํ์)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo
# 2. ์์กด์ฑ ์ค์น
pip install torch safetensors tokenizers PyYAML
๋ฐฉ๋ฒ 1: safetensors ์ง์ ๋ก๋ฉ (๊ถ์ฅ)
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors
CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # ์ด ์ ์ฅ์์ slerp/ ๋๋ ํ ๋ฆฌ
# Config & ๋ชจ๋ธ ๋ก๋
with open(f"{CKPT}/config.json") as f:
data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False
model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()
tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
# ์์ฑ (๊ถ์ฅ: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\n์ธ๊ณต์ง๋ฅ์ด๋ ๋ฌด์์ธ๊ฐ์?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
with torch.no_grad():
for _ in range(256):
logits, _ = model(ids)
logits = logits[:, -1, :].float()
for prev_id in set(ids[0].tolist()):
if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
else: logits[0, prev_id] *= 1.2
probs = torch.softmax(logits / 0.7, dim=-1)
next_id = torch.multinomial(probs, 1)
ids = torch.cat([ids, next_id], dim=1)
if next_id.item() == tok.token_to_id("</s>"): break
print(tok.decode(ids[0].tolist()))
๋ฐฉ๋ฒ 2: ํ๊ฐ ํ๋ ์์ํฌ ๋ฌ๋ ์ฌ์ฉ
frankenstallm_test์ evafrill_runner.py๊ฐ ์ ๊ณผ์ ์ ๋ํํฉ๋๋ค:
from eval_framework.evafrill_runner import generate, unload_model
result = generate("ํ๊ตญ์ด๋ก ์ธ์ฌํด์ฃผ์ธ์.")
print(result["response"])
print(f"์๋: {result['tokens_per_sec']:.1f} TPS")
unload_model()
์ค์ ๋ฐฉ๋ฒ: frankenstallm_test README ์ฐธ์กฐ
์์คํ ์๊ตฌ์ฌํญ: GPU VRAM 8GB+ (BF16), CPU ์ถ๋ก ๊ฐ๋ฅํ์ง๋ง ๊ทนํ ๋๋ฆผ (~0.5 TPS)
์ฌํ ์๋ฃ
| ๊ฒฝ๋ก | ๋ด์ฉ |
|---|---|
data/combined_preference.jsonl |
์ ํธ๋ ํ์ต ๋ฐ์ดํฐ (684K ์, 2.6 GB) |
data/repetition_preference.jsonl |
๋ฐ๋ณต ์ต์ ์ ํธ๋ ๋ฐ์ดํฐ (105 ์, ์๋ ์์ฑ) |
configs/korean_3b_sft_1gpu.yaml |
SFT H100 MIG ์ค์ |
configs/dpo_3b_1gpu.yaml |
DPO ํ์ต ์ค์ |
configs/orpo_3b_1gpu.yaml |
ORPO ํ์ต ์ค์ |
scripts/dpo.py |
DPO ํ์ต ์ฝ๋ |
scripts/orpo_native.py |
ORPO ํ์ต ์ฝ๋ |
scripts/sft.py |
SFT ํ์ต ์ฝ๋ |
scripts/evafrill_eval.py |
๋ฒค์น๋งํฌ ํ๊ฐ ์ฝ๋ |
scripts/merge_checkpoints.py |
SLERP ์ฒดํฌํฌ์ธํธ ๋ณํฉ |
์ ํ์ฌํญ
- 3B ๊ท๋ชจ ํ๊ณ: ์ฌ์ค ์ ํ๋ยท๋ณต์กํ ์ถ๋ก ์ ํ๊ณ๊ฐ ์์ผ๋ฉฐ, ๋ํ ๋ชจ๋ธ ๋๋น ์ฑ๋ฅ์ด ๋ฎ์ต๋๋ค.
- GGUF/Ollama ๋ถ๊ฐ: ์ปค์คํ ํ์ด๋ธ๋ฆฌ๋ Mamba-2 ์ํคํ ์ฒ๋ก ํ์ค ๋ณํ ํด์ ์ง์ํ์ง ์์ต๋๋ค.
- vLLM ์ ํ์ : ์ด๋ก ์ ๊ฐ๋ฅํ๋ ์ปค์คํ weight key ๋งคํ์ด ํ์ํฉ๋๋ค.
- ๋ฐ๋ณต ์์ฑ: greedy decoding ์ ๋ฐ๋ณต๋ฅ ์ด ๋์ผ๋ฏ๋ก ๋ฐ๋์
repetition_penalty=1.2์ด์์ ์ค์ ํ์ธ์. - ์ธ์ด ํธ์ค: ํ๊ตญ์ดยท์์ด ์ธ ์ธ์ด๋ ์ฑ๋ฅ์ด ๋ณด์ฅ๋์ง ์์ต๋๋ค.
๋งํฌ
- GitHub: pathcosmos/EVAFRILL-Mo
- ์ด์ ํ๋ก์ ํธ: FRANKENSTALLM โ ์์ Transformer ๊ธฐ๋ฐ ์ ์ ํ๋ก์ ํธ
- ์ฐธ์กฐ ๋ ผ๋ฌธ: Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
๋ผ์ด์ ์ค
MIT License โ ์์ ์ ์ด์ฉยท์์ ยท์ฌ๋ฐฐํฌ ๋ชจ๋ ์์ ๋กญ์ต๋๋ค.
English
EVAFRILL-Mo 3B โ Hybrid Mamba-2 + Transformer
Introduction
EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built entirely from scratch, inspired by NVIDIA's Nemotron-H architecture.
- Pretrained on 55B tokens using 7ร NVIDIA B200 GPUs (~60 hours)
- Mixed Korean, English, code, and math datasets
- Full SFT โ DPO โ SLERP pipeline implemented in pure PyTorch โ no Transformers Trainer or TRL
- Designed as a Korean-first model with strong multilingual capability
Architecture
Type: Hybrid Mamba-2 + Transformer
Parameters: 2.94B (2,975,397,632)
Layers: 26 (24ร Mamba-2 SSM + 2ร Attention GQA)
d_model: 3,072
Vocabulary: 64,000 (custom SentencePiece)
Max seq length: 4,096
Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context. Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.
Development Background & History
EVAFRILL-Mo was built through 6 iterative design stages:
- FRANKENSTALLM โ Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline.
- Nemotron-H Inspiration โ Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware.
- Systematic Scale Search โ Benchmarked 5 model sizes (1Bโ3B) on 7รB200 to find the Chinchilla-optimal maximum (3B, 93% achievement).
- 1B โ 3B Transition โ Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%).
- 3B Pretraining โ 319,772 steps, 55B tokens, 60 hours on 7รB200 with FP8.
- Post-training โ SFT โ DPO โ SLERP โ ORPO experiments on H100 MIG.
Key Technical Highlights
| Technique | Impact |
|---|---|
| Chunked Cross-Entropy | Reduces logits memory by 8ร for 64K vocabulary |
| Mamba Memory Cliff Discovery | Batch 6โ7 causes 47GBโ183GB+ explosion โ structural limitation of selective scan |
| FP8 Native Training | TransformerEngine MXFP8BlockScaling delivers ~2ร throughput vs BF16 on B200 |
| LoRA B-zeroing | Computes DPO reference logprobs without model duplication โ 50% VRAM savings |
| SLERP Checkpoint Merging | Balances SFT knowledge + DPO alignment via spherical interpolation โ mitigates alignment tax |
| Native DPO/ORPO | No TRL dependency โ implemented from scratch in PyTorch for custom Mamba-2 hybrid |
๐ For the complete development journey, architecture design rationale, and hardware optimization details, see the GitHub README.
Model Variants
This repository contains 7 checkpoints representing each stage of the training pipeline.
| Variant | Directory | Size | Description | Recommended |
|---|---|---|---|---|
| SLERP | slerp/ |
6.3 GB | Spherical interpolation of SFT + DPO R2 (ฮฑ=0.5) | โญ |
| Pretrain | pretrain/ |
12.6 GB | Base model (319K steps, 55B tokens) | |
| SFT v2 | sft-v2/ |
6.3 GB | Instruction-tuned (65K steps) | |
| DPO R1 | dpo-r1/ |
6.3 GB | Preference-aligned Round 1 (3K steps) | |
| DPO R2 | dpo-r2/ |
6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
| ORPO | orpo/ |
6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
| DPO R3 | dpo-r3/ |
6.3 GB | Repetition-targeted experiment (1K steps) |
Training Pipeline
Pretrain (55B tokens, 7รB200, 60h)
โโโบ SFT v2 (65K steps, H100 MIG, 5 days)
โโโบ DPO R1 (3K steps) โโบ DPO R2 (2K steps)
โ โโโบ SLERP Merge (ฮฑ=0.5) โญ Final Recommended
โโโบ ORPO (10K steps, experimental)
โโโบ DPO R3 (1K steps, repetition experiment)
Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.
Benchmark Results
Evaluated on: SLERP model (0-shot, limit=500)
| Benchmark | Accuracy |
|---|---|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele Korean | 23.6% |
| Global MMLU Korean | 23.7% |
Repetition suppression (greedy decoding)
| Setting | 3-gram repetition rate |
|---|---|
| No rep_penalty | 74.5% |
| rep_penalty=1.2 | 5.5% |
Recommended inference parameters: temperature=0.7, repetition_penalty=1.2
DPO vs ORPO Comparison
| Metric | SLERP (SFTโDPO) | ORPO | Winner |
|---|---|---|---|
| Greedy repetition | 74.5% | 87.1% | SLERP |
| Chat quality | Fluent | Broken | SLERP |
| HellaSwag | 39.0% | 35.0% | SLERP |
| Training time | 5d+8h | 12.8h | ORPO |
ORPO's weakness: only 10K steps of training vs SFT's 65K โ insufficient base instruction-following before alignment kicks in.
Usage
GGUF/Ollama not supported: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.
Prerequisites:
# 1. Clone source code (custom architecture modules required)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo
# 2. Install dependencies
pip install torch safetensors tokenizers PyYAML
Method 1: Direct safetensors loading (recommended)
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors
CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # slerp/ directory of this repo
# Load config & model
with open(f"{CKPT}/config.json") as f:
data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False
model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()
tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
# Generate (recommended: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
with torch.no_grad():
for _ in range(256):
logits, _ = model(ids)
logits = logits[:, -1, :].float()
for prev_id in set(ids[0].tolist()):
if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
else: logits[0, prev_id] *= 1.2
probs = torch.softmax(logits / 0.7, dim=-1)
next_id = torch.multinomial(probs, 1)
ids = torch.cat([ids, next_id], dim=1)
if next_id.item() == tok.token_to_id("</s>"): break
print(tok.decode(ids[0].tolist()))
Method 2: Evaluation framework runner
The evafrill_runner.py in frankenstallm_test wraps the above into a simple API:
from eval_framework.evafrill_runner import generate, unload_model
result = generate("Hello, please introduce yourself.")
print(result["response"])
print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
unload_model()
Setup instructions: frankenstallm_test README
System requirements: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)
Reproducibility
| Path | Contents |
|---|---|
data/combined_preference.jsonl |
Preference training data (684K pairs, 2.6 GB) |
data/repetition_preference.jsonl |
Repetition-suppression preference data (105 pairs, auto-generated) |
configs/korean_3b_sft_1gpu.yaml |
SFT config for H100 MIG |
configs/dpo_3b_1gpu.yaml |
DPO training config |
configs/orpo_3b_1gpu.yaml |
ORPO training config |
scripts/dpo.py |
DPO training code |
scripts/orpo_native.py |
ORPO training code |
scripts/sft.py |
SFT training code |
scripts/evafrill_eval.py |
Benchmark evaluation code |
scripts/merge_checkpoints.py |
SLERP checkpoint merging |
Limitations
- 3B scale: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
- GGUF/Ollama: Not supported โ custom hybrid Mamba-2 architecture cannot be converted with standard tools.
- vLLM: Theoretically possible but requires custom weight key mapping.
- Greedy repetition: ~74.5% 3-gram repetition rate without
repetition_penaltyโ always userepetition_penalty >= 1.2. - Language coverage: Performance is not guaranteed for languages other than Korean and English.
Links
- GitHub: pathcosmos/EVAFRILL-Mo
- Predecessor: FRANKENSTALLM | ๐ค HuggingFace โ Pure Transformer predecessor project
- Reference paper: Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Acknowledgment / ๊ฐ์ฌ์ ๊ธ
์ด ํ๋ก์ ํธ๋ ๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ์ ใ์ฒจ๋จ GPU ํ์ฉ ์ง์ ์ฌ์ ใ (๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ ๊ณต๊ณ ์ 2025-1068ํธ)์ ํตํด ์ ๊ณต๋ GPU ์ปดํจํ ์์์ ํ์ฉํ์ฌ ์ํ๋์์ต๋๋ค.
๊ตญ๊ฐ AI์ปดํจํ ์์ ์ง์ํฌํธ: https://aiinfrahub.kr
- ์ฃผ๊ด: ๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ (MSIT), ์ ๋ณดํต์ ์ฐ์ ์งํฅ์ (NIPA)
- ์ด์: ํ๊ตญ์ ๋ณดํต์ ์งํฅํํ (KAIT)
๋ํ๋ฏผ๊ตญ ์ ๋ถ์ AI ์ธํ๋ผ ์ง์ ์ฌ์ ๋๋ถ์ 7ร NVIDIA B200 GPU ํ๊ฒฝ์์ ํ๊ตญ์ด 3B ํ์ด๋ธ๋ฆฌ๋ Mamba-Transformer ๋ชจ๋ธ์ ์ฒ์๋ถํฐ ํ์ตํ ์ ์์์ต๋๋ค. ๊ตญ๊ฐ ์ฐจ์์ AI ์ปดํจํ ์์ ์ง์์ ๊น์ด ๊ฐ์ฌ๋๋ฆฝ๋๋ค.
This project was conducted using GPU computing resources provided through the "Advanced GPU Utilization Support Program" (MSIT Notice No. 2025-1068) by the Ministry of Science and ICT (MSIT) of the Republic of Korea.
National AI Computing Resource Support Portal: https://aiinfrahub.kr
- Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)
- Operated by: Korea Association of Information & Telecommunication (KAIT)
We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7ร NVIDIA B200 GPUs.
License
MIT License โ free to use, modify, and distribute commercially.
- Downloads last month
- 1,085
Datasets used to train pathcosmos/EVAFRILL-Mo-3B
Paper for pathcosmos/EVAFRILL-Mo-3B
Evaluation results
- Accuracy on HellaSwag (0-shot, limit=500)self-reported34.600
- Accuracy on ARC-Easy (0-shot, limit=500)self-reported32.000
- Accuracy on Belebele Korean (0-shot, limit=500)self-reported23.600
- Accuracy on Global MMLU Korean (0-shot, limit=500)self-reported23.700