README.md · pathcosmos/EVAFRILL-Mo-3B at main

File size: 21,971 Bytes

42ca925
 
29fc577
 
42ca925
cceae9d
 
42ca925
29fc577
 
 
 
 
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
 
cceae9d
42ca925
cceae9d
42ca925
cceae9d
42ca925
cceae9d
29fc577
cceae9d
29fc577
cceae9d
 
 
 
 
 
29fc577
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
bd64859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29fc577
 
 
cceae9d
 
 
 
 
29fc577
42ca925
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
cceae9d
e95edec
cceae9d
 
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
cceae9d
e95edec
 
 
cceae9d
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
cceae9d
e95edec
cceae9d
e95edec
 
 
 
 
 
 
cceae9d
 
e95edec
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd64859
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
29fc577
 
 
 
 
 
 
 
42ca925
cceae9d
 
42ca925
bd64859
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
42ca925
cceae9d
42ca925
cceae9d
 
 
 
 
 
 
 
 
42ca925
cceae9d
29fc577
 
cceae9d
 
 
 
 
 
29fc577
42ca925
cceae9d
42ca925
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29fc577
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
e95edec
42ca925
e95edec
29fc577
 
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
e95edec
cceae9d
e95edec
 
 
cceae9d
e95edec
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
e95edec
cceae9d
e95edec
cceae9d
e95edec
 
 
 
 
 
 
29fc577
42ca925
e95edec
 
 
 
cceae9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42ca925
cceae9d
 
 
 
 
42ca925
cceae9d
42ca925
29fc577
8297984
cceae9d
42ca925
f6a641a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cceae9d
42ca925
cceae9d

---
language:
  - ko
  - en
license: mit
library_name: pytorch
pipeline_tag: text-generation
tags:
  - mamba2
  - hybrid
  - transformer
  - korean
  - from-scratch
  - dpo
  - slerp
  - orpo
  - nemotron-h
datasets:
  - heegyu/orca-math-korean-preference-cleaned
  - nayohan/preference-collection-ko-full
  - kuotient/orca-math-word-problems-193k-korean
  - FreedomIntelligence/alpaca-gpt4-korean
  - heegyu/orca_ko
  - HAERAE-HUB/KOFFQA-GuardInstruct-v1
model-index:
  - name: EVAFRILL-Mo-3B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: hellaswag
          name: HellaSwag (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 34.6
      - task:
          type: text-generation
        dataset:
          type: arc_easy
          name: ARC-Easy (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 32.0
      - task:
          type: text-generation
        dataset:
          type: belebele
          name: Belebele Korean (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 23.6
      - task:
          type: text-generation
        dataset:
          type: mmlu
          name: Global MMLU Korean (0-shot, limit=500)
        metrics:
          - name: Accuracy
            type: accuracy
            value: 23.7
---

> [한국어](#한국어) | [English](#english)

---

# 한국어

## EVAFRILL-Mo 3B — 하이브리드 Mamba-2 + Transformer

### 프로젝트 소개

EVAFRILL-Mo 3B는 NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) 아키텍처에서 영감을 받아 **밑바닥부터 직접 구현한** 30억 파라미터 하이브리드 언어 모델입니다.

- 7× NVIDIA B200 GPU로 55B 토큰 사전학습 (약 60시간)
- 한국어·영어·코드·수학 혼합 데이터셋 사용
- SFT → DPO → SLERP 전체 파이프라인을 단일 프로젝트에서 직접 구현
- 외부 프레임워크(Transformers Trainer, TRL) 없이 PyTorch 네이티브로 구현

### 아키텍처

```
Type:           Hybrid Mamba-2 + Transformer
Parameters:     2.94B (2,975,397,632)
Layers:         26 (24× Mamba-2 SSM + 2× Attention GQA)
d_model:        3,072
Vocabulary:     64,000 (custom SentencePiece)
Max seq length: 4,096
```

Mamba-2 SSM 블록이 장거리 의존성을 효율적으로 처리하고, 2개의 GQA Attention 블록이 전역 컨텍스트를 보완합니다.
표준 Transformer 대비 추론 시 KV 캐시 메모리를 크게 절감합니다.

### 개발 배경 및 히스토리

EVAFRILL-Mo는 6단계의 반복적 설계 과정을 거쳐 탄생했습니다:

1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** — 순수 Transformer decoder-only LLM으로 시작한 전신 프로젝트. 한국어+영어+코드+수학 데이터로 커스텀 SentencePiece 토크나이저(64K 어휘)를 학습하고, DDP 학습 파이프라인을 구축했습니다.
2. **Nemotron-H 영감** — NVIDIA의 하이브리드 Mamba-2 + Transformer 설계를 핵심 원칙만 추출하여(fragmentation) 제한된 하드웨어에 맞게 축소·적용.
3. **체계적 규모 탐색** — 5개 규모(1B~3B) 모델을 7×B200에서 벤치마크하여 Chinchilla-optimal 최대 규모(3B, 93% 달성) 결정.
4. **1B → 3B 전환** — tok/s가 per-GPU 값임을 발견하여, 1B 과잉학습(681%)을 3B 적정학습(93%)으로 전환.
5. **3B 사전학습** — 319,772 steps, 55B tokens, 7×B200 FP8로 60시간 완료.
6. **Post-training** — H100 MIG 환경에서 SFT → DPO → SLERP → ORPO 실험까지 완수.

### 핵심 기술 하이라이트

| 기술 | 효과 |
|------|------|
| **Chunked Cross-Entropy** | 64K 어휘에서 logits 메모리 사용량을 1/8로 절감 |
| **Mamba Memory Cliff 발견** | batch 6→7에서 47GB→183GB+ 폭증 — selective scan의 구조적 제약 규명 |
| **FP8 네이티브 학습** | TransformerEngine MXFP8BlockScaling으로 B200에서 BF16 대비 ~2배 처리량 |
| **LoRA B-zeroing** | DPO reference model을 모델 복제 없이 LoRA B를 임시 0으로 만들어 계산 — VRAM 50% 절약 |
| **SLERP 체크포인트 병합** | SFT 지식 보존 + DPO 정렬을 구면 보간으로 균형 — alignment tax 완화 |
| **Native DPO/ORPO** | TRL 미사용, 커스텀 Mamba-2 하이브리드를 위해 처음부터 PyTorch로 구현 |

> 📖 **전체 개발 과정, 아키텍처 설계 근거, 하드웨어 최적화 상세는 [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo)를 참조하세요.**

### 모델 버전

이 저장소에는 학습 파이프라인 각 단계의 체크포인트 **7종**이 포함됩니다.

| 버전 | 디렉토리 | 크기 | 설명 | 권장 |
|------|----------|------|------|:----:|
| **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 구면 선형 보간 (α=0.5) | ⭐ |
| Pretrain | `pretrain/` | 12.6 GB | 기반 모델 (319K 스텝, 55B 토큰) | |
| SFT v2 | `sft-v2/` | 6.3 GB | 명령어 파인튜닝 (65K 스텝) | |
| DPO R1 | `dpo-r1/` | 6.3 GB | 선호도 정렬 1라운드 (3K 스텝) | |
| DPO R2 | `dpo-r2/` | 6.3 GB | 보수적 파인튜닝 2라운드 (2K 스텝) | |
| ORPO | `orpo/` | 6.3 GB | SFT+정렬 동시 학습 실험 (10K 스텝) | |
| DPO R3 | `dpo-r3/` | 6.3 GB | 반복 억제 특화 실험 (1K 스텝) | |

### 학습 파이프라인

```
Pretrain (55B tokens, 7×B200, 60h)
  └─► SFT v2 (65K steps, H100 MIG, 5일)
        ├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
        │     └─► SLERP Merge (α=0.5) ⭐ 최종 권장
        └─► ORPO (10K steps, 실험)
              └─► DPO R3 (1K steps, 반복 특화 실험)
```

각 화살표는 독립된 체크포인트로 저장되어, 임의의 단계부터 재현·비교가 가능합니다.

### 벤치마크 결과

**평가 대상: SLERP 모델** (0-shot, limit=500)

| 벤치마크 | 정확도 |
|----------|:------:|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele 한국어 | 23.6% |
| Global MMLU 한국어 | 23.7% |

**반복 생성 억제** (greedy decoding 기준)

| 설정 | 3-gram 반복률 |
|------|:-------------:|
| rep_penalty 없음 | 74.5% |
| rep_penalty=1.2 | **5.5%** |

권장 추론 파라미터: `temperature=0.7, repetition_penalty=1.2`

### DPO vs ORPO 비교

| 지표 | SLERP (SFT→DPO) | ORPO | 우세 |
|------|:---------------:|:----:|:----:|
| Greedy 반복률 | 74.5% | 87.1% | SLERP |
| 대화 품질 | 자연스러움 | 부자연스러움 | SLERP |
| HellaSwag | **39.0%** | 35.0% | SLERP |
| 학습 시간 | 5일+8시간 | **12.8시간** | ORPO |

ORPO의 약점: SFT 65K 스텝 대비 10K 스텝만 학습되어 기반 명령어 이해가 부족합니다.

### 사용법

> **GGUF/Ollama 미지원**: 커스텀 Mamba-2 하이브리드 아키텍처로 llama.cpp/GGUF/Ollama와 호환되지 않습니다. PyTorch 직접 추론만 가능합니다.

**사전 준비:**

```bash
# 1. 소스 코드 클론 (커스텀 아키텍처 모듈 필요)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo

# 2. 의존성 설치
pip install torch safetensors tokenizers PyYAML
```

**방법 1: safetensors 직접 로딩 (권장)**

```python
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/EVAFRILL-Mo-3B/slerp"  # 이 저장소의 slerp/ 디렉토리

# Config & 모델 로드
with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

# 생성 (권장: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\n인공지능이란 무엇인가요?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(256):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))
```

**방법 2: 평가 프레임워크 러너 사용**

[frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test)의 `evafrill_runner.py`가 위 과정을 래핑합니다:

```python
from eval_framework.evafrill_runner import generate, unload_model

result = generate("한국어로 인사해주세요.")
print(result["response"])
print(f"속도: {result['tokens_per_sec']:.1f} TPS")
unload_model()
```

> 설정 방법: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-모델-설정-pytorch-직접-추론) 참조

**시스템 요구사항**: GPU VRAM 8GB+ (BF16), CPU 추론 가능하지만 극히 느림 (~0.5 TPS)

### 재현 자료

| 경로 | 내용 |
|------|------|
| `data/combined_preference.jsonl` | 선호도 학습 데이터 (684K 쌍, 2.6 GB) |
| `data/repetition_preference.jsonl` | 반복 억제 선호도 데이터 (105 쌍, 자동 생성) |
| `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG 설정 |
| `configs/dpo_3b_1gpu.yaml` | DPO 학습 설정 |
| `configs/orpo_3b_1gpu.yaml` | ORPO 학습 설정 |
| `scripts/dpo.py` | DPO 학습 코드 |
| `scripts/orpo_native.py` | ORPO 학습 코드 |
| `scripts/sft.py` | SFT 학습 코드 |
| `scripts/evafrill_eval.py` | 벤치마크 평가 코드 |
| `scripts/merge_checkpoints.py` | SLERP 체크포인트 병합 |

### 제한사항

- **3B 규모 한계**: 사실 정확도·복잡한 추론에 한계가 있으며, 대형 모델 대비 성능이 낮습니다.
- **GGUF/Ollama 불가**: 커스텀 하이브리드 Mamba-2 아키텍처로 표준 변환 툴을 지원하지 않습니다.
- **vLLM 제한적**: 이론상 가능하나 커스텀 weight key 매핑이 필요합니다.
- **반복 생성**: greedy decoding 시 반복률이 높으므로 반드시 `repetition_penalty=1.2` 이상을 설정하세요.
- **언어 편중**: 한국어·영어 외 언어는 성능이 보장되지 않습니다.

### 링크

- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
- **이전 프로젝트**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — 순수 Transformer 기반 전신 프로젝트
- **참조 논문**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)

### 라이선스

MIT License — 상업적 이용·수정·재배포 모두 자유롭습니다.

---

# English

## EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer

### Introduction

EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.

- Pretrained on 55B tokens using 7× NVIDIA B200 GPUs (~60 hours)
- Mixed Korean, English, code, and math datasets
- Full SFT → DPO → SLERP pipeline implemented in pure PyTorch — no Transformers Trainer or TRL
- Designed as a Korean-first model with strong multilingual capability

### Architecture

```
Type:           Hybrid Mamba-2 + Transformer
Parameters:     2.94B (2,975,397,632)
Layers:         26 (24× Mamba-2 SSM + 2× Attention GQA)
d_model:        3,072
Vocabulary:     64,000 (custom SentencePiece)
Max seq length: 4,096
```

Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.

### Development Background & History

EVAFRILL-Mo was built through 6 iterative design stages:

1. **[FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM)** — Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline.
2. **Nemotron-H Inspiration** — Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware.
3. **Systematic Scale Search** — Benchmarked 5 model sizes (1B–3B) on 7×B200 to find the Chinchilla-optimal maximum (3B, 93% achievement).
4. **1B → 3B Transition** — Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%).
5. **3B Pretraining** — 319,772 steps, 55B tokens, 60 hours on 7×B200 with FP8.
6. **Post-training** — SFT → DPO → SLERP → ORPO experiments on H100 MIG.

### Key Technical Highlights

| Technique | Impact |
|-----------|--------|
| **Chunked Cross-Entropy** | Reduces logits memory by 8× for 64K vocabulary |
| **Mamba Memory Cliff Discovery** | Batch 6→7 causes 47GB→183GB+ explosion — structural limitation of selective scan |
| **FP8 Native Training** | TransformerEngine MXFP8BlockScaling delivers ~2× throughput vs BF16 on B200 |
| **LoRA B-zeroing** | Computes DPO reference logprobs without model duplication — 50% VRAM savings |
| **SLERP Checkpoint Merging** | Balances SFT knowledge + DPO alignment via spherical interpolation — mitigates alignment tax |
| **Native DPO/ORPO** | No TRL dependency — implemented from scratch in PyTorch for custom Mamba-2 hybrid |

> 📖 **For the complete development journey, architecture design rationale, and hardware optimization details, see the [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo).**

### Model Variants

This repository contains **7 checkpoints** representing each stage of the training pipeline.

| Variant | Directory | Size | Description | Recommended |
|---------|-----------|------|-------------|:-----------:|
| **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (α=0.5) | ⭐ |
| Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | |
| SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | |
| DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | |
| DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
| ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
| DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | |

### Training Pipeline

```
Pretrain (55B tokens, 7×B200, 60h)
  └─► SFT v2 (65K steps, H100 MIG, 5 days)
        ├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
        │     └─► SLERP Merge (α=0.5) ⭐ Final Recommended
        └─► ORPO (10K steps, experimental)
              └─► DPO R3 (1K steps, repetition experiment)
```

Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.

### Benchmark Results

**Evaluated on: SLERP model** (0-shot, limit=500)

| Benchmark | Accuracy |
|-----------|:--------:|
| HellaSwag | 34.6% |
| ARC-Easy | 32.0% |
| Belebele Korean | 23.6% |
| Global MMLU Korean | 23.7% |

**Repetition suppression** (greedy decoding)

| Setting | 3-gram repetition rate |
|---------|:----------------------:|
| No rep_penalty | 74.5% |
| rep_penalty=1.2 | **5.5%** |

Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`

### DPO vs ORPO Comparison

| Metric | SLERP (SFT→DPO) | ORPO | Winner |
|--------|:---------------:|:----:|:------:|
| Greedy repetition | 74.5% | 87.1% | SLERP |
| Chat quality | Fluent | Broken | SLERP |
| HellaSwag | **39.0%** | 35.0% | SLERP |
| Training time | 5d+8h | **12.8h** | ORPO |

ORPO's weakness: only 10K steps of training vs SFT's 65K — insufficient base instruction-following before alignment kicks in.

### Usage

> **GGUF/Ollama not supported**: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.

**Prerequisites:**

```bash
# 1. Clone source code (custom architecture modules required)
git clone https://github.com/pathcosmos/EVAFRILL-Mo
cd EVAFRILL-Mo

# 2. Install dependencies
pip install torch safetensors tokenizers PyYAML
```

**Method 1: Direct safetensors loading (recommended)**

```python
import json
import torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/EVAFRILL-Mo-3B/slerp"  # slerp/ directory of this repo

# Load config & model
with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16)
model.eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

# Generate (recommended: temp=0.7, rep_penalty=1.2)
prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(256):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))
```

**Method 2: Evaluation framework runner**

The `evafrill_runner.py` in [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test) wraps the above into a simple API:

```python
from eval_framework.evafrill_runner import generate, unload_model

result = generate("Hello, please introduce yourself.")
print(result["response"])
print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
unload_model()
```

> Setup instructions: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-모델-설정-pytorch-직접-추론)

**System requirements**: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)

### Reproducibility

| Path | Contents |
|------|----------|
| `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) |
| `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) |
| `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG |
| `configs/dpo_3b_1gpu.yaml` | DPO training config |
| `configs/orpo_3b_1gpu.yaml` | ORPO training config |
| `scripts/dpo.py` | DPO training code |
| `scripts/orpo_native.py` | ORPO training code |
| `scripts/sft.py` | SFT training code |
| `scripts/evafrill_eval.py` | Benchmark evaluation code |
| `scripts/merge_checkpoints.py` | SLERP checkpoint merging |

### Limitations

- **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
- **GGUF/Ollama**: Not supported — custom hybrid Mamba-2 architecture cannot be converted with standard tools.
- **vLLM**: Theoretically possible but requires custom weight key mapping.
- **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` — always use `repetition_penalty >= 1.2`.
- **Language coverage**: Performance is not guaranteed for languages other than Korean and English.

### Links

- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
- **Predecessor**: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) | [🤗 HuggingFace](https://huggingface.co/pathcosmos/frankenstallm) — Pure Transformer predecessor project
- **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)

### Acknowledgment / 감사의 글

이 프로젝트는 **과학기술정보통신부**의 **「첨단 GPU 활용 지원 사업」** (과학기술정보통신부 공고 제2025-1068호)을 통해 제공된 GPU 컴퓨팅 자원을 활용하여 수행되었습니다.

> **국가 AI컴퓨팅자원 지원포털**: [https://aiinfrahub.kr](https://aiinfrahub.kr)
>
> - 주관: 과학기술정보통신부 (MSIT), 정보통신산업진흥원 (NIPA)
> - 운영: 한국정보통신진흥협회 (KAIT)

대한민국 정부의 AI 인프라 지원 사업 덕분에 7× NVIDIA B200 GPU 환경에서 한국어 3B 하이브리드 Mamba-Transformer 모델을 처음부터 학습할 수 있었습니다. 국가 차원의 AI 컴퓨팅 자원 지원에 깊이 감사드립니다.

This project was conducted using GPU computing resources provided through the **"Advanced GPU Utilization Support Program"** (MSIT Notice No. 2025-1068) by the **Ministry of Science and ICT (MSIT)** of the Republic of Korea.

> **National AI Computing Resource Support Portal**: [https://aiinfrahub.kr](https://aiinfrahub.kr)
>
> - Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)
> - Operated by: Korea Association of Information & Telecommunication (KAIT)

We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7× NVIDIA B200 GPUs.

---

### License

MIT License — free to use, modify, and distribute commercially.