pathcosmos commited on
Commit
cceae9d
·
verified ·
1 Parent(s): 29fc577

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +309 -84
README.md CHANGED
@@ -3,6 +3,8 @@ language:
3
  - ko
4
  - en
5
  license: mit
 
 
6
  tags:
7
  - mamba2
8
  - hybrid
@@ -12,49 +14,226 @@ tags:
12
  - dpo
13
  - slerp
14
  - orpo
15
- library_name: pytorch
16
- pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
- # EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer
20
 
21
- **A 3-billion-parameter hybrid Mamba-2 + Transformer language model built from scratch.**
22
 
23
- Inspired by the NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture. Pretrained on 55B tokens across Korean, English, code, and math using 7× NVIDIA B200 GPUs.
24
 
25
- ## Model Variants
26
 
27
- This repository contains **7 model versions** representing each stage of the training pipeline, plus training data and scripts for full reproducibility.
28
 
29
- | Variant | Directory | Size | Description | Recommended |
30
- |---------|-----------|------|-------------|:-----------:|
31
- | **SLERP** | `slerp/` | 6.3GB | SFT + DPO merged (α=0.5) | ⭐ **Yes** |
32
- | Pretrain | `pretrain/` | 12.6GB | Base model (319K steps, 55B tokens) | |
33
- | SFT v2 | `sft-v2/` | 6.3GB | Instruction-tuned (65K steps) | |
34
- | DPO R1 | `dpo-r1/` | 6.3GB | Preference-aligned Round 1 | |
35
- | DPO R2 | `dpo-r2/` | 6.3GB | Conservative fine-tuning Round 2 | |
36
- | ORPO | `orpo/` | 6.3GB | SFT+alignment simultaneous (experimental) | |
37
- | DPO R3 | `dpo-r3/` | 6.3GB | Repetition-targeted (experimental) | |
38
 
39
- ## Training Pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ```
42
  Pretrain (55B tokens, 7×B200, 60h)
43
-
44
- SFT v2 (65K steps, H100 MIG, 5 days)
45
-
46
- DPO Round 1 (3K steps, LoRA, loss 0.693→0.565)
47
-
48
- DPO Round 2 (2K steps, conservative, loss 0.692→0.689)
49
-
50
- SLERP Merge (α=0.5, SFT 50% + DPO 50%) ← RECOMMENDED
51
-
52
- ORPO Experiment (10K steps, alternative approach)
53
-
54
- DPO Round 3 (1K steps, repetition-targeted experiment)
55
  ```
56
 
57
- ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  ```
60
  Type: Hybrid Mamba-2 + Transformer
@@ -65,82 +244,128 @@ Vocabulary: 64,000 (custom SentencePiece)
65
  Max seq length: 4,096
66
  ```
67
 
68
- ## Benchmark Results (SLERP, recommended model)
69
-
70
- | Metric | Value |
71
- |--------|-------|
72
- | Greedy 3-gram repetition | 74.5% (→ 5.5% with rep_penalty=1.2) |
73
- | hellaswag (0-shot) | 34.6% |
74
- | arc_easy (0-shot) | 32.0% |
75
- | belebele_kor (0-shot) | 23.6% |
76
- | global_mmlu_ko (0-shot) | 23.7% |
77
 
78
- **Recommended inference**: `temperature=0.7, repetition_penalty=1.2`
79
 
80
- ## SFT→DPO→SLERP vs ORPO Comparison
81
-
82
- | Metric | SLERP | ORPO | Winner |
83
- |--------|:-----:|:----:|:------:|
84
- | Greedy repetition | **74.5%** | 87.1% | SLERP |
85
- | Chat quality | ✅ Fluent | ❌ Broken | SLERP |
86
- | hellaswag | **39.0%** | 35.0% | SLERP |
87
- | Training time | 5d+8h | **12.8h** | ORPO |
88
 
89
- ORPO's weakness: insufficient SFT learning at 10K steps (vs SFT's 65K).
 
 
 
 
 
 
 
 
90
 
91
- ## Repository Structure
92
 
93
  ```
94
- ├── slerp/ # Recommended final model
95
- pretrain/ # Base pretrained model
96
- ├── sft-v2/ # SFT instruction-tuned
97
- dpo-r1/ # DPO Round 1 + LoRA weights
98
- dpo-r2/ # DPO Round 2 + LoRA weights
99
- orpo/ # ORPO experiment + LoRA weights
100
- ├── dpo-r3/ # DPO Round 3
101
- ├── data/ # Preference datasets for reproducibility
102
- │ ├── combined_preference.jsonl (684K pairs, 2.6GB)
103
- │ └── repetition_preference.jsonl (105 pairs, self-generated)
104
- ├── configs/ # Training YAML configs
105
- │ ├── korean_3b_sft_1gpu.yaml
106
- │ ├── dpo_3b_1gpu.yaml
107
- │ └── orpo_3b_1gpu.yaml
108
- └── scripts/ # Training & evaluation code
109
- ├── dpo.py, orpo_native.py, sft.py
110
- ├── lora.py, merge_checkpoints.py
111
- ├── evafrill_eval.py
112
- └── generate_repetition_preference.py
113
  ```
114
 
115
- ## Usage
116
 
117
- ```python
118
- # This is a custom architecture — use the project's native loading code
119
- # Clone: https://github.com/pathcosmos/EVAFRILL-Mo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
 
121
  import torch
122
  from model.transformer import LLM
123
  from tokenizers import Tokenizer
124
 
125
- model = LLM.from_pretrained("checkpoints/3b_dpo/checkpoint-slerp")
126
- model = model.to(device="cuda:0", dtype=torch.bfloat16)
 
 
 
 
 
127
  model.eval()
128
 
129
  tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  ```
131
 
132
- ## Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
- - **3B scale**: Factual accuracy and complex reasoning are limited
135
- - **GGUF/Ollama**: Not possible due to custom hybrid Mamba-2 architecture
136
- - **vLLM**: Theoretically possible but requires custom weight key mapping
137
- - **Greedy repetition**: ~74.5% without rep_penalty (use rep_penalty=1.2)
 
138
 
139
- ## Links
140
 
141
  - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
142
- - **Paper reference**: [Nemotron-H](https://arxiv.org/abs/2504.03624)
143
 
144
- ## License
145
 
146
- MIT
 
3
  - ko
4
  - en
5
  license: mit
6
+ library_name: pytorch
7
+ pipeline_tag: text-generation
8
  tags:
9
  - mamba2
10
  - hybrid
 
14
  - dpo
15
  - slerp
16
  - orpo
17
+ - nemotron-h
18
+ datasets:
19
+ - heegyu/orca-math-korean-preference-cleaned
20
+ - nayohan/preference-collection-ko-full
21
+ - kuotient/orca-math-word-problems-193k-korean
22
+ - FreedomIntelligence/alpaca-gpt4-korean
23
+ - heegyu/orca_ko
24
+ - HAERAE-HUB/KOFFQA-GuardInstruct-v1
25
+ model-index:
26
+ - name: EVAFRILL-Mo-3B
27
+ results:
28
+ - task:
29
+ type: text-generation
30
+ name: Text Generation
31
+ dataset:
32
+ type: hellaswag
33
+ name: HellaSwag (0-shot, limit=500)
34
+ metrics:
35
+ - name: Accuracy
36
+ type: accuracy
37
+ value: 34.6
38
+ - task:
39
+ type: text-generation
40
+ dataset:
41
+ type: arc_easy
42
+ name: ARC-Easy (0-shot, limit=500)
43
+ metrics:
44
+ - name: Accuracy
45
+ type: accuracy
46
+ value: 32.0
47
+ - task:
48
+ type: text-generation
49
+ dataset:
50
+ type: belebele
51
+ name: Belebele Korean (0-shot, limit=500)
52
+ metrics:
53
+ - name: Accuracy
54
+ type: accuracy
55
+ value: 23.6
56
+ - task:
57
+ type: text-generation
58
+ dataset:
59
+ type: mmlu
60
+ name: Global MMLU Korean (0-shot, limit=500)
61
+ metrics:
62
+ - name: Accuracy
63
+ type: accuracy
64
+ value: 23.7
65
  ---
66
 
67
+ > [한국어](#한국어) | [English](#english)
68
 
69
+ ---
70
 
71
+ # 한국어
72
 
73
+ ## EVAFRILL-Mo 3B — 하이브리드 Mamba-2 + Transformer
74
 
75
+ ### 프로젝트 소개
76
 
77
+ EVAFRILL-Mo 3B는 NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) 아키텍처에서 영감을 받아 **밑바닥부터 직접 구현한** 30억 파라미터 하이브리드 언어 모델입니다.
78
+
79
+ - NVIDIA B200 GPU로 55B 토큰 사전학습 ( 60시간)
80
+ - 한국어·영어·코드·수학 혼합 데이터셋 사용
81
+ - SFT DPO SLERP 전체 파이프라인을 단일 프로젝트에서 직접 구현
82
+ - 외부 프레임워크(Transformers Trainer, TRL) 없이 PyTorch 네이티브로 구현
 
 
 
83
 
84
+ ### 아키텍처
85
+
86
+ ```
87
+ Type: Hybrid Mamba-2 + Transformer
88
+ Parameters: 2.94B (2,975,397,632)
89
+ Layers: 26 (24× Mamba-2 SSM + 2× Attention GQA)
90
+ d_model: 3,072
91
+ Vocabulary: 64,000 (custom SentencePiece)
92
+ Max seq length: 4,096
93
+ ```
94
+
95
+ Mamba-2 SSM 블록이 장거리 의존성을 효율적으로 처리하고, 2개의 GQA Attention 블록이 전역 컨텍스트를 보완합니다.
96
+ 표준 Transformer 대비 추론 시 KV 캐시 메모리를 크게 절감합니다.
97
+
98
+ ### 모델 버전
99
+
100
+ 이 저장소에는 학습 파이프라인 각 단계의 체크포인트 **7종**이 포함됩니다.
101
+
102
+ | 버전 | 디렉토리 | 크기 | 설명 | 권장 |
103
+ |------|----------|------|------|:----:|
104
+ | **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 구면 선형 보간 (α=0.5) | ⭐ |
105
+ | Pretrain | `pretrain/` | 12.6 GB | 기반 모델 (319K 스텝, 55B 토큰) | |
106
+ | SFT v2 | `sft-v2/` | 6.3 GB | 명령어 파인튜닝 (65K 스텝) | |
107
+ | DPO R1 | `dpo-r1/` | 6.3 GB | 선호도 정렬 1라운드 (3K 스텝) | |
108
+ | DPO R2 | `dpo-r2/` | 6.3 GB | 보수적 파인튜닝 2라운드 (2K 스텝) | |
109
+ | ORPO | `orpo/` | 6.3 GB | SFT+정렬 동시 학습 실험 (10K 스텝) | |
110
+ | DPO R3 | `dpo-r3/` | 6.3 GB | 반복 억제 특화 실험 (1K 스텝) | |
111
+
112
+ ### 학습 파이프라인
113
 
114
  ```
115
  Pretrain (55B tokens, 7×B200, 60h)
116
+ ��─► SFT v2 (65K steps, H100 MIG, 5일)
117
+ ├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
118
+ │ └─► SLERP Merge (α=0.5) ⭐ 최종 권장
119
+ └─► ORPO (10K steps, 실험)
120
+ └─► DPO R3 (1K steps, 반복 특화 실험)
 
 
 
 
 
 
 
121
  ```
122
 
123
+ 화살표는 독립된 체크포인트로 저장되어, 임의의 단계부터 재현·비교가 가능합니다.
124
+
125
+ ### 벤치마크 결과
126
+
127
+ **평가 대상: SLERP 모델** (0-shot, limit=500)
128
+
129
+ | 벤치마크 | 정확도 |
130
+ |----------|:------:|
131
+ | HellaSwag | 34.6% |
132
+ | ARC-Easy | 32.0% |
133
+ | Belebele 한국어 | 23.6% |
134
+ | Global MMLU 한국어 | 23.7% |
135
+
136
+ **반복 생성 억제** (greedy decoding 기준)
137
+
138
+ | 설정 | 3-gram 반복률 |
139
+ |------|:-------------:|
140
+ | rep_penalty 없음 | 74.5% |
141
+ | rep_penalty=1.2 | **5.5%** |
142
+
143
+ 권장 추론 파라미터: `temperature=0.7, repetition_penalty=1.2`
144
+
145
+ ### DPO vs ORPO 비교
146
+
147
+ | 지표 | SLERP (SFT→DPO) | ORPO | 우세 |
148
+ |------|:---------------:|:----:|:----:|
149
+ | Greedy 반복률 | 74.5% | 87.1% | SLERP |
150
+ | 대화 품질 | 자연스러움 | 부자연스러움 | SLERP |
151
+ | HellaSwag | **39.0%** | 35.0% | SLERP |
152
+ | 학습 시간 | 5일+8시간 | **12.8시간** | ORPO |
153
+
154
+ ORPO의 약점: SFT 65K 스텝 대비 10K 스텝만 학습되어 기반 명령어 이해가 부족합니다.
155
+
156
+ ### 사용법
157
+
158
+ ```python
159
+ import torch
160
+ from model.transformer import LLM
161
+ from tokenizers import Tokenizer
162
+
163
+ # 커스텀 아키텍처이므로 저장소 클론 후 사용
164
+ # git clone https://github.com/pathcosmos/EVAFRILL-Mo
165
+
166
+ device = "cuda" if torch.cuda.is_available() else "cpu"
167
+
168
+ model = LLM.from_pretrained("hf_export/slerp")
169
+ model = model.to(device=device, dtype=torch.bfloat16)
170
+ model.eval()
171
+
172
+ tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
173
+
174
+ prompt = "인공지능이란 무엇인가요?"
175
+ ids = tok.encode(prompt).ids
176
+ input_ids = torch.tensor([ids], device=device)
177
+
178
+ with torch.no_grad():
179
+ output = model.generate(
180
+ input_ids,
181
+ max_new_tokens=256,
182
+ temperature=0.7,
183
+ repetition_penalty=1.2,
184
+ )
185
+
186
+ print(tok.decode(output[0].tolist()))
187
+ ```
188
+
189
+ ### 재현 자료
190
+
191
+ | 경로 | 내용 |
192
+ |------|------|
193
+ | `data/combined_preference.jsonl` | 선호도 학습 데이터 (684K 쌍, 2.6 GB) |
194
+ | `data/repetition_preference.jsonl` | 반복 억제 선호도 데이터 (105 쌍, 자동 생성) |
195
+ | `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG 설정 |
196
+ | `configs/dpo_3b_1gpu.yaml` | DPO 학습 설정 |
197
+ | `configs/orpo_3b_1gpu.yaml` | ORPO 학습 설정 |
198
+ | `scripts/dpo.py` | DPO 학습 코드 |
199
+ | `scripts/orpo_native.py` | ORPO 학습 코드 |
200
+ | `scripts/sft.py` | SFT 학습 코드 |
201
+ | `scripts/evafrill_eval.py` | 벤치마크 평가 코드 |
202
+ | `scripts/merge_checkpoints.py` | SLERP 체크포인트 병합 |
203
+
204
+ ### 제한사항
205
+
206
+ - **3B 규모 한계**: 사실 정확도·복잡한 추론에 한계가 있으며, 대형 모델 대비 성능이 낮습니다.
207
+ - **GGUF/Ollama 불가**: 커스텀 하이브리드 Mamba-2 아키텍처로 표준 변환 툴을 지원하지 않습니다.
208
+ - **vLLM 제한적**: 이론상 가능하나 커스텀 weight key 매핑이 필요합니다.
209
+ - **반복 생성**: greedy decoding 시 반복률이 높으므로 반드시 `repetition_penalty=1.2` 이상을 설정하세요.
210
+ - **언어 편중**: 한국어·영어 외 언어는 성능이 보장되지 않습니다.
211
+
212
+ ### 링크
213
+
214
+ - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
215
+ - **참조 논문**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
216
+
217
+ ### 라이선스
218
+
219
+ MIT License — 상업적 이용·수정·재배포 모두 자유롭습니다.
220
+
221
+ ---
222
+
223
+ # English
224
+
225
+ ## EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer
226
+
227
+ ### Introduction
228
+
229
+ EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.
230
+
231
+ - Pretrained on 55B tokens using 7× NVIDIA B200 GPUs (~60 hours)
232
+ - Mixed Korean, English, code, and math datasets
233
+ - Full SFT → DPO → SLERP pipeline implemented in pure PyTorch — no Transformers Trainer or TRL
234
+ - Designed as a Korean-first model with strong multilingual capability
235
+
236
+ ### Architecture
237
 
238
  ```
239
  Type: Hybrid Mamba-2 + Transformer
 
244
  Max seq length: 4,096
245
  ```
246
 
247
+ Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
248
+ Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.
 
 
 
 
 
 
 
249
 
250
+ ### Model Variants
251
 
252
+ This repository contains **7 checkpoints** representing each stage of the training pipeline.
 
 
 
 
 
 
 
253
 
254
+ | Variant | Directory | Size | Description | Recommended |
255
+ |---------|-----------|------|-------------|:-----------:|
256
+ | **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (α=0.5) | ⭐ |
257
+ | Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | |
258
+ | SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | |
259
+ | DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | |
260
+ | DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
261
+ | ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
262
+ | DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | |
263
 
264
+ ### Training Pipeline
265
 
266
  ```
267
+ Pretrain (55B tokens, 7×B200, 60h)
268
+ SFT v2 (65K steps, H100 MIG, 5 days)
269
+ ├─► DPO R1 (3K steps) DPO R2 (2K steps)
270
+ │ └ SLERP Merge (α=0.5) Final Recommended
271
+ ORPO (10K steps, experimental)
272
+ DPO R3 (1K steps, repetition experiment)
 
 
 
 
 
 
 
 
 
 
 
 
 
273
  ```
274
 
275
+ Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.
276
 
277
+ ### Benchmark Results
278
+
279
+ **Evaluated on: SLERP model** (0-shot, limit=500)
280
+
281
+ | Benchmark | Accuracy |
282
+ |-----------|:--------:|
283
+ | HellaSwag | 34.6% |
284
+ | ARC-Easy | 32.0% |
285
+ | Belebele Korean | 23.6% |
286
+ | Global MMLU Korean | 23.7% |
287
+
288
+ **Repetition suppression** (greedy decoding)
289
+
290
+ | Setting | 3-gram repetition rate |
291
+ |---------|:----------------------:|
292
+ | No rep_penalty | 74.5% |
293
+ | rep_penalty=1.2 | **5.5%** |
294
+
295
+ Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`
296
+
297
+ ### DPO vs ORPO Comparison
298
+
299
+ | Metric | SLERP (SFT→DPO) | ORPO | Winner |
300
+ |--------|:---------------:|:----:|:------:|
301
+ | Greedy repetition | 74.5% | 87.1% | SLERP |
302
+ | Chat quality | Fluent | Broken | SLERP |
303
+ | HellaSwag | **39.0%** | 35.0% | SLERP |
304
+ | Training time | 5d+8h | **12.8h** | ORPO |
305
+
306
+ ORPO's weakness: only 10K steps of training vs SFT's 65K — insufficient base instruction-following before alignment kicks in.
307
+
308
+ ### Usage
309
 
310
+ ```python
311
  import torch
312
  from model.transformer import LLM
313
  from tokenizers import Tokenizer
314
 
315
+ # Requires cloning the repository (custom architecture — not loadable via AutoModel)
316
+ # git clone https://github.com/pathcosmos/EVAFRILL-Mo
317
+
318
+ device = "cuda" if torch.cuda.is_available() else "cpu"
319
+
320
+ model = LLM.from_pretrained("hf_export/slerp")
321
+ model = model.to(device=device, dtype=torch.bfloat16)
322
  model.eval()
323
 
324
  tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
325
+
326
+ prompt = "What is artificial intelligence?"
327
+ ids = tok.encode(prompt).ids
328
+ input_ids = torch.tensor([ids], device=device)
329
+
330
+ with torch.no_grad():
331
+ output = model.generate(
332
+ input_ids,
333
+ max_new_tokens=256,
334
+ temperature=0.7,
335
+ repetition_penalty=1.2,
336
+ )
337
+
338
+ print(tok.decode(output[0].tolist()))
339
  ```
340
 
341
+ ### Reproducibility
342
+
343
+ | Path | Contents |
344
+ |------|----------|
345
+ | `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) |
346
+ | `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) |
347
+ | `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG |
348
+ | `configs/dpo_3b_1gpu.yaml` | DPO training config |
349
+ | `configs/orpo_3b_1gpu.yaml` | ORPO training config |
350
+ | `scripts/dpo.py` | DPO training code |
351
+ | `scripts/orpo_native.py` | ORPO training code |
352
+ | `scripts/sft.py` | SFT training code |
353
+ | `scripts/evafrill_eval.py` | Benchmark evaluation code |
354
+ | `scripts/merge_checkpoints.py` | SLERP checkpoint merging |
355
+
356
+ ### Limitations
357
 
358
+ - **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
359
+ - **GGUF/Ollama**: Not supported custom hybrid Mamba-2 architecture cannot be converted with standard tools.
360
+ - **vLLM**: Theoretically possible but requires custom weight key mapping.
361
+ - **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` — always use `repetition_penalty >= 1.2`.
362
+ - **Language coverage**: Performance is not guaranteed for languages other than Korean and English.
363
 
364
+ ### Links
365
 
366
  - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
367
+ - **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
368
 
369
+ ### License
370
 
371
+ MIT License — free to use, modify, and distribute commercially.