Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -3,6 +3,8 @@ language:
|
|
| 3 |
- ko
|
| 4 |
- en
|
| 5 |
license: mit
|
|
|
|
|
|
|
| 6 |
tags:
|
| 7 |
- mamba2
|
| 8 |
- hybrid
|
|
@@ -12,49 +14,226 @@ tags:
|
|
| 12 |
- dpo
|
| 13 |
- slerp
|
| 14 |
- orpo
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
-
#
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
| DPO R2 | `dpo-r2/` | 6.3GB | Conservative fine-tuning Round 2 | |
|
| 36 |
-
| ORPO | `orpo/` | 6.3GB | SFT+alignment simultaneous (experimental) | |
|
| 37 |
-
| DPO R3 | `dpo-r3/` | 6.3GB | Repetition-targeted (experimental) | |
|
| 38 |
|
| 39 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
```
|
| 42 |
Pretrain (55B tokens, 7×B200, 60h)
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
DPO Round 2 (2K steps, conservative, loss 0.692→0.689)
|
| 49 |
-
↓
|
| 50 |
-
SLERP Merge (α=0.5, SFT 50% + DPO 50%) ← RECOMMENDED
|
| 51 |
-
↓
|
| 52 |
-
ORPO Experiment (10K steps, alternative approach)
|
| 53 |
-
↓
|
| 54 |
-
DPO Round 3 (1K steps, repetition-targeted experiment)
|
| 55 |
```
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
```
|
| 60 |
Type: Hybrid Mamba-2 + Transformer
|
|
@@ -65,82 +244,128 @@ Vocabulary: 64,000 (custom SentencePiece)
|
|
| 65 |
Max seq length: 4,096
|
| 66 |
```
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
| Metric | Value |
|
| 71 |
-
|--------|-------|
|
| 72 |
-
| Greedy 3-gram repetition | 74.5% (→ 5.5% with rep_penalty=1.2) |
|
| 73 |
-
| hellaswag (0-shot) | 34.6% |
|
| 74 |
-
| arc_easy (0-shot) | 32.0% |
|
| 75 |
-
| belebele_kor (0-shot) | 23.6% |
|
| 76 |
-
| global_mmlu_ko (0-shot) | 23.7% |
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
| Metric | SLERP | ORPO | Winner |
|
| 83 |
-
|--------|:-----:|:----:|:------:|
|
| 84 |
-
| Greedy repetition | **74.5%** | 87.1% | SLERP |
|
| 85 |
-
| Chat quality | ✅ Fluent | ❌ Broken | SLERP |
|
| 86 |
-
| hellaswag | **39.0%** | 35.0% | SLERP |
|
| 87 |
-
| Training time | 5d+8h | **12.8h** | ORPO |
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
##
|
| 92 |
|
| 93 |
```
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
├──
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
├── dpo-r3/ # DPO Round 3
|
| 101 |
-
├── data/ # Preference datasets for reproducibility
|
| 102 |
-
│ ├── combined_preference.jsonl (684K pairs, 2.6GB)
|
| 103 |
-
│ └── repetition_preference.jsonl (105 pairs, self-generated)
|
| 104 |
-
├── configs/ # Training YAML configs
|
| 105 |
-
│ ├── korean_3b_sft_1gpu.yaml
|
| 106 |
-
│ ├── dpo_3b_1gpu.yaml
|
| 107 |
-
│ └── orpo_3b_1gpu.yaml
|
| 108 |
-
└── scripts/ # Training & evaluation code
|
| 109 |
-
├── dpo.py, orpo_native.py, sft.py
|
| 110 |
-
├── lora.py, merge_checkpoints.py
|
| 111 |
-
├── evafrill_eval.py
|
| 112 |
-
└── generate_repetition_preference.py
|
| 113 |
```
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
|
|
|
| 121 |
import torch
|
| 122 |
from model.transformer import LLM
|
| 123 |
from tokenizers import Tokenizer
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
model.eval()
|
| 128 |
|
| 129 |
tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
```
|
| 131 |
|
| 132 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
-
- **3B scale**: Factual accuracy and complex reasoning are limited
|
| 135 |
-
- **GGUF/Ollama**: Not
|
| 136 |
-
- **vLLM**: Theoretically possible but requires custom weight key mapping
|
| 137 |
-
- **Greedy repetition**: ~74.5% without
|
|
|
|
| 138 |
|
| 139 |
-
## Links
|
| 140 |
|
| 141 |
- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
|
| 142 |
-
- **
|
| 143 |
|
| 144 |
-
## License
|
| 145 |
|
| 146 |
-
MIT
|
|
|
|
| 3 |
- ko
|
| 4 |
- en
|
| 5 |
license: mit
|
| 6 |
+
library_name: pytorch
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
tags:
|
| 9 |
- mamba2
|
| 10 |
- hybrid
|
|
|
|
| 14 |
- dpo
|
| 15 |
- slerp
|
| 16 |
- orpo
|
| 17 |
+
- nemotron-h
|
| 18 |
+
datasets:
|
| 19 |
+
- heegyu/orca-math-korean-preference-cleaned
|
| 20 |
+
- nayohan/preference-collection-ko-full
|
| 21 |
+
- kuotient/orca-math-word-problems-193k-korean
|
| 22 |
+
- FreedomIntelligence/alpaca-gpt4-korean
|
| 23 |
+
- heegyu/orca_ko
|
| 24 |
+
- HAERAE-HUB/KOFFQA-GuardInstruct-v1
|
| 25 |
+
model-index:
|
| 26 |
+
- name: EVAFRILL-Mo-3B
|
| 27 |
+
results:
|
| 28 |
+
- task:
|
| 29 |
+
type: text-generation
|
| 30 |
+
name: Text Generation
|
| 31 |
+
dataset:
|
| 32 |
+
type: hellaswag
|
| 33 |
+
name: HellaSwag (0-shot, limit=500)
|
| 34 |
+
metrics:
|
| 35 |
+
- name: Accuracy
|
| 36 |
+
type: accuracy
|
| 37 |
+
value: 34.6
|
| 38 |
+
- task:
|
| 39 |
+
type: text-generation
|
| 40 |
+
dataset:
|
| 41 |
+
type: arc_easy
|
| 42 |
+
name: ARC-Easy (0-shot, limit=500)
|
| 43 |
+
metrics:
|
| 44 |
+
- name: Accuracy
|
| 45 |
+
type: accuracy
|
| 46 |
+
value: 32.0
|
| 47 |
+
- task:
|
| 48 |
+
type: text-generation
|
| 49 |
+
dataset:
|
| 50 |
+
type: belebele
|
| 51 |
+
name: Belebele Korean (0-shot, limit=500)
|
| 52 |
+
metrics:
|
| 53 |
+
- name: Accuracy
|
| 54 |
+
type: accuracy
|
| 55 |
+
value: 23.6
|
| 56 |
+
- task:
|
| 57 |
+
type: text-generation
|
| 58 |
+
dataset:
|
| 59 |
+
type: mmlu
|
| 60 |
+
name: Global MMLU Korean (0-shot, limit=500)
|
| 61 |
+
metrics:
|
| 62 |
+
- name: Accuracy
|
| 63 |
+
type: accuracy
|
| 64 |
+
value: 23.7
|
| 65 |
---
|
| 66 |
|
| 67 |
+
> [한국어](#한국어) | [English](#english)
|
| 68 |
|
| 69 |
+
---
|
| 70 |
|
| 71 |
+
# 한국어
|
| 72 |
|
| 73 |
+
## EVAFRILL-Mo 3B — 하이브리드 Mamba-2 + Transformer
|
| 74 |
|
| 75 |
+
### 프로젝트 소개
|
| 76 |
|
| 77 |
+
EVAFRILL-Mo 3B는 NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) 아키텍처에서 영감을 받아 **밑바닥부터 직접 구현한** 30억 파라미터 하이브리드 언어 모델입니다.
|
| 78 |
+
|
| 79 |
+
- 7× NVIDIA B200 GPU로 55B 토큰 사전학습 (약 60시간)
|
| 80 |
+
- 한국어·영어·코드·수학 혼합 데이터셋 사용
|
| 81 |
+
- SFT → DPO → SLERP 전체 파이프라인을 단일 프로젝트에서 직접 구현
|
| 82 |
+
- 외부 프레임워크(Transformers Trainer, TRL) 없이 PyTorch 네이티브로 구현
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
+
### 아키텍처
|
| 85 |
+
|
| 86 |
+
```
|
| 87 |
+
Type: Hybrid Mamba-2 + Transformer
|
| 88 |
+
Parameters: 2.94B (2,975,397,632)
|
| 89 |
+
Layers: 26 (24× Mamba-2 SSM + 2× Attention GQA)
|
| 90 |
+
d_model: 3,072
|
| 91 |
+
Vocabulary: 64,000 (custom SentencePiece)
|
| 92 |
+
Max seq length: 4,096
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Mamba-2 SSM 블록이 장거리 의존성을 효율적으로 처리하고, 2개의 GQA Attention 블록이 전역 컨텍스트를 보완합니다.
|
| 96 |
+
표준 Transformer 대비 추론 시 KV 캐시 메모리를 크게 절감합니다.
|
| 97 |
+
|
| 98 |
+
### 모델 버전
|
| 99 |
+
|
| 100 |
+
이 저장소에는 학습 파이프라인 각 단계의 체크포인트 **7종**이 포함됩니다.
|
| 101 |
+
|
| 102 |
+
| 버전 | 디렉토리 | 크기 | 설명 | 권장 |
|
| 103 |
+
|------|----------|------|------|:----:|
|
| 104 |
+
| **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 구면 선형 보간 (α=0.5) | ⭐ |
|
| 105 |
+
| Pretrain | `pretrain/` | 12.6 GB | 기반 모델 (319K 스텝, 55B 토큰) | |
|
| 106 |
+
| SFT v2 | `sft-v2/` | 6.3 GB | 명령어 파인튜닝 (65K 스텝) | |
|
| 107 |
+
| DPO R1 | `dpo-r1/` | 6.3 GB | 선호도 정렬 1라운드 (3K 스텝) | |
|
| 108 |
+
| DPO R2 | `dpo-r2/` | 6.3 GB | 보수적 파인튜닝 2라운드 (2K 스텝) | |
|
| 109 |
+
| ORPO | `orpo/` | 6.3 GB | SFT+정렬 동시 학습 실험 (10K 스텝) | |
|
| 110 |
+
| DPO R3 | `dpo-r3/` | 6.3 GB | 반복 억제 특화 실험 (1K 스텝) | |
|
| 111 |
+
|
| 112 |
+
### 학습 파이프라인
|
| 113 |
|
| 114 |
```
|
| 115 |
Pretrain (55B tokens, 7×B200, 60h)
|
| 116 |
+
��─► SFT v2 (65K steps, H100 MIG, 5일)
|
| 117 |
+
├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
|
| 118 |
+
│ └─► SLERP Merge (α=0.5) ⭐ 최종 권장
|
| 119 |
+
└─► ORPO (10K steps, 실험)
|
| 120 |
+
└─► DPO R3 (1K steps, 반복 특화 실험)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
```
|
| 122 |
|
| 123 |
+
각 화살표는 독립된 체크포인트로 저장되어, 임의의 단계부터 재현·비교가 가능합니다.
|
| 124 |
+
|
| 125 |
+
### 벤치마크 결과
|
| 126 |
+
|
| 127 |
+
**평가 대상: SLERP 모델** (0-shot, limit=500)
|
| 128 |
+
|
| 129 |
+
| 벤치마크 | 정확도 |
|
| 130 |
+
|----------|:------:|
|
| 131 |
+
| HellaSwag | 34.6% |
|
| 132 |
+
| ARC-Easy | 32.0% |
|
| 133 |
+
| Belebele 한국어 | 23.6% |
|
| 134 |
+
| Global MMLU 한국어 | 23.7% |
|
| 135 |
+
|
| 136 |
+
**반복 생성 억제** (greedy decoding 기준)
|
| 137 |
+
|
| 138 |
+
| 설정 | 3-gram 반복률 |
|
| 139 |
+
|------|:-------------:|
|
| 140 |
+
| rep_penalty 없음 | 74.5% |
|
| 141 |
+
| rep_penalty=1.2 | **5.5%** |
|
| 142 |
+
|
| 143 |
+
권장 추론 파라미터: `temperature=0.7, repetition_penalty=1.2`
|
| 144 |
+
|
| 145 |
+
### DPO vs ORPO 비교
|
| 146 |
+
|
| 147 |
+
| 지표 | SLERP (SFT→DPO) | ORPO | 우세 |
|
| 148 |
+
|------|:---------------:|:----:|:----:|
|
| 149 |
+
| Greedy 반복률 | 74.5% | 87.1% | SLERP |
|
| 150 |
+
| 대화 품질 | 자연스러움 | 부자연스러움 | SLERP |
|
| 151 |
+
| HellaSwag | **39.0%** | 35.0% | SLERP |
|
| 152 |
+
| 학습 시간 | 5일+8시간 | **12.8시간** | ORPO |
|
| 153 |
+
|
| 154 |
+
ORPO의 약점: SFT 65K 스텝 대비 10K 스텝만 학습되어 기반 명령어 이해가 부족합니다.
|
| 155 |
+
|
| 156 |
+
### 사용법
|
| 157 |
+
|
| 158 |
+
```python
|
| 159 |
+
import torch
|
| 160 |
+
from model.transformer import LLM
|
| 161 |
+
from tokenizers import Tokenizer
|
| 162 |
+
|
| 163 |
+
# 커스텀 아키텍처이므로 저장소 클론 후 사용
|
| 164 |
+
# git clone https://github.com/pathcosmos/EVAFRILL-Mo
|
| 165 |
+
|
| 166 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 167 |
+
|
| 168 |
+
model = LLM.from_pretrained("hf_export/slerp")
|
| 169 |
+
model = model.to(device=device, dtype=torch.bfloat16)
|
| 170 |
+
model.eval()
|
| 171 |
+
|
| 172 |
+
tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
|
| 173 |
+
|
| 174 |
+
prompt = "인공지능이란 무엇인가요?"
|
| 175 |
+
ids = tok.encode(prompt).ids
|
| 176 |
+
input_ids = torch.tensor([ids], device=device)
|
| 177 |
+
|
| 178 |
+
with torch.no_grad():
|
| 179 |
+
output = model.generate(
|
| 180 |
+
input_ids,
|
| 181 |
+
max_new_tokens=256,
|
| 182 |
+
temperature=0.7,
|
| 183 |
+
repetition_penalty=1.2,
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
print(tok.decode(output[0].tolist()))
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
### 재현 자료
|
| 190 |
+
|
| 191 |
+
| 경로 | 내용 |
|
| 192 |
+
|------|------|
|
| 193 |
+
| `data/combined_preference.jsonl` | 선호도 학습 데이터 (684K 쌍, 2.6 GB) |
|
| 194 |
+
| `data/repetition_preference.jsonl` | 반복 억제 선호도 데이터 (105 쌍, 자동 생성) |
|
| 195 |
+
| `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG 설정 |
|
| 196 |
+
| `configs/dpo_3b_1gpu.yaml` | DPO 학습 설정 |
|
| 197 |
+
| `configs/orpo_3b_1gpu.yaml` | ORPO 학습 설정 |
|
| 198 |
+
| `scripts/dpo.py` | DPO 학습 코드 |
|
| 199 |
+
| `scripts/orpo_native.py` | ORPO 학습 코드 |
|
| 200 |
+
| `scripts/sft.py` | SFT 학습 코드 |
|
| 201 |
+
| `scripts/evafrill_eval.py` | 벤치마크 평가 코드 |
|
| 202 |
+
| `scripts/merge_checkpoints.py` | SLERP 체크포인트 병합 |
|
| 203 |
+
|
| 204 |
+
### 제한사항
|
| 205 |
+
|
| 206 |
+
- **3B 규모 한계**: 사실 정확도·복잡한 추론에 한계가 있으며, 대형 모델 대비 성능이 낮습니다.
|
| 207 |
+
- **GGUF/Ollama 불가**: 커스텀 하이브리드 Mamba-2 아키텍처로 표준 변환 툴을 지원하지 않습니다.
|
| 208 |
+
- **vLLM 제한적**: 이론상 가능하나 커스텀 weight key 매핑이 필요합니다.
|
| 209 |
+
- **반복 생성**: greedy decoding 시 반복률이 높으므로 반드시 `repetition_penalty=1.2` 이상을 설정하세요.
|
| 210 |
+
- **언어 편중**: 한국어·영어 외 언어는 성능이 보장되지 않습니다.
|
| 211 |
+
|
| 212 |
+
### 링크
|
| 213 |
+
|
| 214 |
+
- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
|
| 215 |
+
- **참조 논문**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
|
| 216 |
+
|
| 217 |
+
### 라이선스
|
| 218 |
+
|
| 219 |
+
MIT License — 상업적 이용·수정·재배포 모두 자유롭습니다.
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
# English
|
| 224 |
+
|
| 225 |
+
## EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer
|
| 226 |
+
|
| 227 |
+
### Introduction
|
| 228 |
+
|
| 229 |
+
EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.
|
| 230 |
+
|
| 231 |
+
- Pretrained on 55B tokens using 7× NVIDIA B200 GPUs (~60 hours)
|
| 232 |
+
- Mixed Korean, English, code, and math datasets
|
| 233 |
+
- Full SFT → DPO → SLERP pipeline implemented in pure PyTorch — no Transformers Trainer or TRL
|
| 234 |
+
- Designed as a Korean-first model with strong multilingual capability
|
| 235 |
+
|
| 236 |
+
### Architecture
|
| 237 |
|
| 238 |
```
|
| 239 |
Type: Hybrid Mamba-2 + Transformer
|
|
|
|
| 244 |
Max seq length: 4,096
|
| 245 |
```
|
| 246 |
|
| 247 |
+
Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
|
| 248 |
+
Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
|
| 250 |
+
### Model Variants
|
| 251 |
|
| 252 |
+
This repository contains **7 checkpoints** representing each stage of the training pipeline.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
|
| 254 |
+
| Variant | Directory | Size | Description | Recommended |
|
| 255 |
+
|---------|-----------|------|-------------|:-----------:|
|
| 256 |
+
| **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (α=0.5) | ⭐ |
|
| 257 |
+
| Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | |
|
| 258 |
+
| SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | |
|
| 259 |
+
| DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | |
|
| 260 |
+
| DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
|
| 261 |
+
| ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
|
| 262 |
+
| DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | |
|
| 263 |
|
| 264 |
+
### Training Pipeline
|
| 265 |
|
| 266 |
```
|
| 267 |
+
Pretrain (55B tokens, 7×B200, 60h)
|
| 268 |
+
└─► SFT v2 (65K steps, H100 MIG, 5 days)
|
| 269 |
+
├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
|
| 270 |
+
│ └─► SLERP Merge (α=0.5) ⭐ Final Recommended
|
| 271 |
+
└─► ORPO (10K steps, experimental)
|
| 272 |
+
└─► DPO R3 (1K steps, repetition experiment)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
```
|
| 274 |
|
| 275 |
+
Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.
|
| 276 |
|
| 277 |
+
### Benchmark Results
|
| 278 |
+
|
| 279 |
+
**Evaluated on: SLERP model** (0-shot, limit=500)
|
| 280 |
+
|
| 281 |
+
| Benchmark | Accuracy |
|
| 282 |
+
|-----------|:--------:|
|
| 283 |
+
| HellaSwag | 34.6% |
|
| 284 |
+
| ARC-Easy | 32.0% |
|
| 285 |
+
| Belebele Korean | 23.6% |
|
| 286 |
+
| Global MMLU Korean | 23.7% |
|
| 287 |
+
|
| 288 |
+
**Repetition suppression** (greedy decoding)
|
| 289 |
+
|
| 290 |
+
| Setting | 3-gram repetition rate |
|
| 291 |
+
|---------|:----------------------:|
|
| 292 |
+
| No rep_penalty | 74.5% |
|
| 293 |
+
| rep_penalty=1.2 | **5.5%** |
|
| 294 |
+
|
| 295 |
+
Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`
|
| 296 |
+
|
| 297 |
+
### DPO vs ORPO Comparison
|
| 298 |
+
|
| 299 |
+
| Metric | SLERP (SFT→DPO) | ORPO | Winner |
|
| 300 |
+
|--------|:---------------:|:----:|:------:|
|
| 301 |
+
| Greedy repetition | 74.5% | 87.1% | SLERP |
|
| 302 |
+
| Chat quality | Fluent | Broken | SLERP |
|
| 303 |
+
| HellaSwag | **39.0%** | 35.0% | SLERP |
|
| 304 |
+
| Training time | 5d+8h | **12.8h** | ORPO |
|
| 305 |
+
|
| 306 |
+
ORPO's weakness: only 10K steps of training vs SFT's 65K — insufficient base instruction-following before alignment kicks in.
|
| 307 |
+
|
| 308 |
+
### Usage
|
| 309 |
|
| 310 |
+
```python
|
| 311 |
import torch
|
| 312 |
from model.transformer import LLM
|
| 313 |
from tokenizers import Tokenizer
|
| 314 |
|
| 315 |
+
# Requires cloning the repository (custom architecture — not loadable via AutoModel)
|
| 316 |
+
# git clone https://github.com/pathcosmos/EVAFRILL-Mo
|
| 317 |
+
|
| 318 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 319 |
+
|
| 320 |
+
model = LLM.from_pretrained("hf_export/slerp")
|
| 321 |
+
model = model.to(device=device, dtype=torch.bfloat16)
|
| 322 |
model.eval()
|
| 323 |
|
| 324 |
tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
|
| 325 |
+
|
| 326 |
+
prompt = "What is artificial intelligence?"
|
| 327 |
+
ids = tok.encode(prompt).ids
|
| 328 |
+
input_ids = torch.tensor([ids], device=device)
|
| 329 |
+
|
| 330 |
+
with torch.no_grad():
|
| 331 |
+
output = model.generate(
|
| 332 |
+
input_ids,
|
| 333 |
+
max_new_tokens=256,
|
| 334 |
+
temperature=0.7,
|
| 335 |
+
repetition_penalty=1.2,
|
| 336 |
+
)
|
| 337 |
+
|
| 338 |
+
print(tok.decode(output[0].tolist()))
|
| 339 |
```
|
| 340 |
|
| 341 |
+
### Reproducibility
|
| 342 |
+
|
| 343 |
+
| Path | Contents |
|
| 344 |
+
|------|----------|
|
| 345 |
+
| `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) |
|
| 346 |
+
| `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) |
|
| 347 |
+
| `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG |
|
| 348 |
+
| `configs/dpo_3b_1gpu.yaml` | DPO training config |
|
| 349 |
+
| `configs/orpo_3b_1gpu.yaml` | ORPO training config |
|
| 350 |
+
| `scripts/dpo.py` | DPO training code |
|
| 351 |
+
| `scripts/orpo_native.py` | ORPO training code |
|
| 352 |
+
| `scripts/sft.py` | SFT training code |
|
| 353 |
+
| `scripts/evafrill_eval.py` | Benchmark evaluation code |
|
| 354 |
+
| `scripts/merge_checkpoints.py` | SLERP checkpoint merging |
|
| 355 |
+
|
| 356 |
+
### Limitations
|
| 357 |
|
| 358 |
+
- **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
|
| 359 |
+
- **GGUF/Ollama**: Not supported — custom hybrid Mamba-2 architecture cannot be converted with standard tools.
|
| 360 |
+
- **vLLM**: Theoretically possible but requires custom weight key mapping.
|
| 361 |
+
- **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` — always use `repetition_penalty >= 1.2`.
|
| 362 |
+
- **Language coverage**: Performance is not guaranteed for languages other than Korean and English.
|
| 363 |
|
| 364 |
+
### Links
|
| 365 |
|
| 366 |
- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
|
| 367 |
+
- **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
|
| 368 |
|
| 369 |
+
### License
|
| 370 |
|
| 371 |
+
MIT License — free to use, modify, and distribute commercially.
|