Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +309 -84

README.md CHANGED Viewed

@@ -3,6 +3,8 @@ language:
   - ko
   - en
 license: mit
 tags:
   - mamba2
   - hybrid
@@ -12,49 +14,226 @@ tags:
   - dpo
   - slerp
   - orpo
-library_name: pytorch
-pipeline_tag: text-generation
 ---
-# EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer
-**A 3-billion-parameter hybrid Mamba-2 + Transformer language model built from scratch.**
-Inspired by the NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture. Pretrained on 55B tokens across Korean, English, code, and math using 7× NVIDIA B200 GPUs.
-## Model Variants
-This repository contains **7 model versions** representing each stage of the training pipeline, plus training data and scripts for full reproducibility.
-| Variant | Directory | Size | Description | Recommended |
-|---------|-----------|------|-------------|:-----------:|
-| **SLERP** | `slerp/` | 6.3GB | SFT + DPO merged (α=0.5) | ⭐ **Yes** |
-| Pretrain | `pretrain/` | 12.6GB | Base model (319K steps, 55B tokens) | |
-| SFT v2 | `sft-v2/` | 6.3GB | Instruction-tuned (65K steps) | |
-| DPO R1 | `dpo-r1/` | 6.3GB | Preference-aligned Round 1 | |
-| DPO R2 | `dpo-r2/` | 6.3GB | Conservative fine-tuning Round 2 | |
-| ORPO | `orpo/` | 6.3GB | SFT+alignment simultaneous (experimental) | |
-| DPO R3 | `dpo-r3/` | 6.3GB | Repetition-targeted (experimental) | |
-## Training Pipeline
 ```
 Pretrain (55B tokens, 7×B200, 60h)
-    ↓
-SFT v2 (65K steps, H100 MIG, 5 days)
-    ↓
-DPO Round 1 (3K steps, LoRA, loss 0.693→0.565)
-    ↓
-DPO Round 2 (2K steps, conservative, loss 0.692→0.689)
-    ↓
-SLERP Merge (α=0.5, SFT 50% + DPO 50%)  ← RECOMMENDED
-    ↓
-ORPO Experiment (10K steps, alternative approach)
-    ↓
-DPO Round 3 (1K steps, repetition-targeted experiment)
 ```
-## Architecture
 ```
 Type:           Hybrid Mamba-2 + Transformer
@@ -65,82 +244,128 @@ Vocabulary:     64,000 (custom SentencePiece)
 Max seq length: 4,096
 ```
-## Benchmark Results (SLERP, recommended model)
-| Metric | Value |
-|--------|-------|
-| Greedy 3-gram repetition | 74.5% (→ 5.5% with rep_penalty=1.2) |
-| hellaswag (0-shot) | 34.6% |
-| arc_easy (0-shot) | 32.0% |
-| belebele_kor (0-shot) | 23.6% |
-| global_mmlu_ko (0-shot) | 23.7% |
-**Recommended inference**: `temperature=0.7, repetition_penalty=1.2`
-## SFT→DPO→SLERP vs ORPO Comparison
-| Metric | SLERP | ORPO | Winner |
-|--------|:-----:|:----:|:------:|
-| Greedy repetition | **74.5%** | 87.1% | SLERP |
-| Chat quality | ✅ Fluent | ❌ Broken | SLERP |
-| hellaswag | **39.0%** | 35.0% | SLERP |
-| Training time | 5d+8h | **12.8h** | ORPO |
-ORPO's weakness: insufficient SFT learning at 10K steps (vs SFT's 65K).
-## Repository Structure
 ```
-├── slerp/                    # ⭐ Recommended final model
-├── pretrain/                 # Base pretrained model
-├── sft-v2/                   # SFT instruction-tuned
-├── dpo-r1/                   # DPO Round 1 + LoRA weights
-├── dpo-r2/                   # DPO Round 2 + LoRA weights
-├── orpo/                     # ORPO experiment + LoRA weights
-├── dpo-r3/                   # DPO Round 3
-├── data/                     # Preference datasets for reproducibility
-│   ├── combined_preference.jsonl    (684K pairs, 2.6GB)
-│   └── repetition_preference.jsonl  (105 pairs, self-generated)
-├── configs/                  # Training YAML configs
-│   ├── korean_3b_sft_1gpu.yaml
-│   ├── dpo_3b_1gpu.yaml
-│   └── orpo_3b_1gpu.yaml
-└── scripts/                  # Training & evaluation code
-    ├── dpo.py, orpo_native.py, sft.py
-    ├── lora.py, merge_checkpoints.py
-    ├── evafrill_eval.py
-    └── generate_repetition_preference.py
 ```
-## Usage
-```python
-# This is a custom architecture — use the project's native loading code
-# Clone: https://github.com/pathcosmos/EVAFRILL-Mo
 import torch
 from model.transformer import LLM
 from tokenizers import Tokenizer
-model = LLM.from_pretrained("checkpoints/3b_dpo/checkpoint-slerp")
-model = model.to(device="cuda:0", dtype=torch.bfloat16)
 model.eval()
 tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
 ```
-## Limitations
-- **3B scale**: Factual accuracy and complex reasoning are limited
-- **GGUF/Ollama**: Not possible due to custom hybrid Mamba-2 architecture
-- **vLLM**: Theoretically possible but requires custom weight key mapping
-- **Greedy repetition**: ~74.5% without rep_penalty (use rep_penalty=1.2)
-## Links
 - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
-- **Paper reference**: [Nemotron-H](https://arxiv.org/abs/2504.03624)
-## License
-MIT

   - ko
   - en
 license: mit
+library_name: pytorch
+pipeline_tag: text-generation
 tags:
   - mamba2
   - hybrid
   - dpo
   - slerp
   - orpo
+  - nemotron-h
+datasets:
+  - heegyu/orca-math-korean-preference-cleaned
+  - nayohan/preference-collection-ko-full
+  - kuotient/orca-math-word-problems-193k-korean
+  - FreedomIntelligence/alpaca-gpt4-korean
+  - heegyu/orca_ko
+  - HAERAE-HUB/KOFFQA-GuardInstruct-v1
+model-index:
+  - name: EVAFRILL-Mo-3B
+    results:
+      - task:
+          type: text-generation
+          name: Text Generation
+        dataset:
+          type: hellaswag
+          name: HellaSwag (0-shot, limit=500)
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 34.6
+      - task:
+          type: text-generation
+        dataset:
+          type: arc_easy
+          name: ARC-Easy (0-shot, limit=500)
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 32.0
+      - task:
+          type: text-generation
+        dataset:
+          type: belebele
+          name: Belebele Korean (0-shot, limit=500)
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 23.6
+      - task:
+          type: text-generation
+        dataset:
+          type: mmlu
+          name: Global MMLU Korean (0-shot, limit=500)
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 23.7
 ---
+> [한국어](#한국어) | [English](#english)
+---
+# 한국어
+## EVAFRILL-Mo 3B — 하이브리드 Mamba-2 + Transformer
+### 프로젝트 소개
+EVAFRILL-Mo 3B는 NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) 아키텍처에서 영감을 받아 **밑바닥부터 직접 구현한** 30억 파라미터 하이브리드 언어 모델입니다.
+- 7× NVIDIA B200 GPU로 55B 토큰 사전학습 (약 60시간)
+- 한국어·영어·코드·수학 혼합 데이터셋 사용
+- SFT → DPO → SLERP 전체 파이프라인을 단일 프로젝트에서 직접 구현
+- 외부 프레임워크(Transformers Trainer, TRL) 없이 PyTorch 네이티브로 구현
+### 아키텍처
+```
+Type:           Hybrid Mamba-2 + Transformer
+Parameters:     2.94B (2,975,397,632)
+Layers:         26 (24× Mamba-2 SSM + 2× Attention GQA)
+d_model:        3,072
+Vocabulary:     64,000 (custom SentencePiece)
+Max seq length: 4,096
+```
+Mamba-2 SSM 블록이 장거리 의존성을 효율적으로 처리하고, 2개의 GQA Attention 블록이 전역 컨텍스트를 보완합니다.
+표준 Transformer 대비 추론 시 KV 캐시 메모리를 크게 절감합니다.
+### 모델 버전
+이 저장소에는 학습 파이프라인 각 단계의 체크포인트 **7종**이 포함됩니다.
+| 버전 | 디렉토리 | 크기 | 설명 | 권장 |
+|------|----------|------|------|:----:|
+| **SLERP** | `slerp/` | 6.3 GB | SFT + DPO R2 구면 선형 보간 (α=0.5) | ⭐ |
+| Pretrain | `pretrain/` | 12.6 GB | 기반 모델 (319K 스텝, 55B 토큰) | |
+| SFT v2 | `sft-v2/` | 6.3 GB | 명령어 파인튜닝 (65K 스텝) | |
+| DPO R1 | `dpo-r1/` | 6.3 GB | 선호도 정렬 1라운드 (3K 스텝) | |
+| DPO R2 | `dpo-r2/` | 6.3 GB | 보수적 파인튜닝 2라운드 (2K 스텝) | |
+| ORPO | `orpo/` | 6.3 GB | SFT+정렬 동시 학습 실험 (10K 스텝) | |
+| DPO R3 | `dpo-r3/` | 6.3 GB | 반복 억제 특화 실험 (1K 스텝) | |
+### 학습 파이프라인
 ```
 Pretrain (55B tokens, 7×B200, 60h)
+  ��─► SFT v2 (65K steps, H100 MIG, 5일)
+        ├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
+        │     └─► SLERP Merge (α=0.5) ⭐ 최종 권장
+        └─► ORPO (10K steps, 실험)
+              └─► DPO R3 (1K steps, 반복 특화 실험)
 ```
+각 화살표는 독립된 체크포인트로 저장되어, 임의의 단계부터 재현·비교가 가능합니다.
+### 벤치마크 결과
+**평가 대상: SLERP 모델** (0-shot, limit=500)
+| 벤치마크 | 정확도 |
+|----------|:------:|
+| HellaSwag | 34.6% |
+| ARC-Easy | 32.0% |
+| Belebele 한국어 | 23.6% |
+| Global MMLU 한국어 | 23.7% |
+**반복 생성 억제** (greedy decoding 기준)
+| 설정 | 3-gram 반복률 |
+|------|:-------------:|
+| rep_penalty 없음 | 74.5% |
+| rep_penalty=1.2 | **5.5%** |
+권장 추론 파라미터: `temperature=0.7, repetition_penalty=1.2`
+### DPO vs ORPO 비교
+| 지표 | SLERP (SFT→DPO) | ORPO | 우세 |
+|------|:---------------:|:----:|:----:|
+| Greedy 반복률 | 74.5% | 87.1% | SLERP |
+| 대화 품질 | 자연스러움 | 부자연스러움 | SLERP |
+| HellaSwag | **39.0%** | 35.0% | SLERP |
+| 학습 시간 | 5일+8시간 | **12.8시간** | ORPO |
+ORPO의 약점: SFT 65K 스텝 대비 10K 스텝만 학습되어 기반 명령어 이해가 부족합니다.
+### 사용법
+```python
+import torch
+from model.transformer import LLM
+from tokenizers import Tokenizer
+# 커스텀 아키텍처이므로 저장소 클론 후 사용
+# git clone https://github.com/pathcosmos/EVAFRILL-Mo
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = LLM.from_pretrained("hf_export/slerp")
+model = model.to(device=device, dtype=torch.bfloat16)
+model.eval()
+tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
+prompt = "인공지능이란 무엇인가요?"
+ids = tok.encode(prompt).ids
+input_ids = torch.tensor([ids], device=device)
+with torch.no_grad():
+    output = model.generate(
+        input_ids,
+        max_new_tokens=256,
+        temperature=0.7,
+        repetition_penalty=1.2,
+    )
+print(tok.decode(output[0].tolist()))
+```
+### 재현 자료
+| 경로 | 내용 |
+|------|------|
+| `data/combined_preference.jsonl` | 선호도 학습 데이터 (684K 쌍, 2.6 GB) |
+| `data/repetition_preference.jsonl` | 반복 억제 선호도 데이터 (105 쌍, 자동 생성) |
+| `configs/korean_3b_sft_1gpu.yaml` | SFT H100 MIG 설정 |
+| `configs/dpo_3b_1gpu.yaml` | DPO 학습 설정 |
+| `configs/orpo_3b_1gpu.yaml` | ORPO 학습 설정 |
+| `scripts/dpo.py` | DPO 학습 코드 |
+| `scripts/orpo_native.py` | ORPO 학습 코드 |
+| `scripts/sft.py` | SFT 학습 코드 |
+| `scripts/evafrill_eval.py` | 벤치마크 평가 코드 |
+| `scripts/merge_checkpoints.py` | SLERP 체크포인트 병합 |
+### 제한사항
+- **3B 규모 한계**: 사실 정확도·복잡한 추론에 한계가 있으며, 대형 모델 대비 성능이 낮습니다.
+- **GGUF/Ollama 불가**: 커스텀 하이브리드 Mamba-2 아키텍처로 표준 변환 툴을 지원하지 않습니다.
+- **vLLM 제한적**: 이론상 가능하나 커스텀 weight key 매핑이 필요합니다.
+- **반복 생성**: greedy decoding 시 반복률이 높으므로 반드시 `repetition_penalty=1.2` 이상을 설정하세요.
+- **언어 편중**: 한국어·영어 외 언어는 성능이 보장되지 않습니다.
+### 링크
+- **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
+- **참조 논문**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
+### 라이선스
+MIT License — 상업적 이용·수정·재배포 모두 자유롭습니다.
+---
+# English
+## EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer
+### Introduction
+EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built **entirely from scratch**, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.
+- Pretrained on 55B tokens using 7× NVIDIA B200 GPUs (~60 hours)
+- Mixed Korean, English, code, and math datasets
+- Full SFT → DPO → SLERP pipeline implemented in pure PyTorch — no Transformers Trainer or TRL
+- Designed as a Korean-first model with strong multilingual capability
+### Architecture
 ```
 Type:           Hybrid Mamba-2 + Transformer
 Max seq length: 4,096
 ```
+Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
+Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.
+### Model Variants
+This repository contains **7 checkpoints** representing each stage of the training pipeline.
+| Variant | Directory | Size | Description | Recommended |
+|---------|-----------|------|-------------|:-----------:|
+| **SLERP** | `slerp/` | 6.3 GB | Spherical interpolation of SFT + DPO R2 (α=0.5) | ⭐ |
+| Pretrain | `pretrain/` | 12.6 GB | Base model (319K steps, 55B tokens) | |
+| SFT v2 | `sft-v2/` | 6.3 GB | Instruction-tuned (65K steps) | |
+| DPO R1 | `dpo-r1/` | 6.3 GB | Preference-aligned Round 1 (3K steps) | |
+| DPO R2 | `dpo-r2/` | 6.3 GB | Conservative fine-tuning Round 2 (2K steps) | |
+| ORPO | `orpo/` | 6.3 GB | Simultaneous SFT+alignment experiment (10K steps) | |
+| DPO R3 | `dpo-r3/` | 6.3 GB | Repetition-targeted experiment (1K steps) | |
+### Training Pipeline
 ```
+Pretrain (55B tokens, 7×B200, 60h)
+  └─► SFT v2 (65K steps, H100 MIG, 5 days)
+        ├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
+        │     └─► SLERP Merge (α=0.5) ⭐ Final Recommended
+        └─► ORPO (10K steps, experimental)
+              └─► DPO R3 (1K steps, repetition experiment)
 ```
+Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.
+### Benchmark Results
+**Evaluated on: SLERP model** (0-shot, limit=500)
+| Benchmark | Accuracy |
+|-----------|:--------:|
+| HellaSwag | 34.6% |
+| ARC-Easy | 32.0% |
+| Belebele Korean | 23.6% |
+| Global MMLU Korean | 23.7% |
+**Repetition suppression** (greedy decoding)
+| Setting | 3-gram repetition rate |
+|---------|:----------------------:|
+| No rep_penalty | 74.5% |
+| rep_penalty=1.2 | **5.5%** |
+Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`
+### DPO vs ORPO Comparison
+| Metric | SLERP (SFT→DPO) | ORPO | Winner |
+|--------|:---------------:|:----:|:------:|
+| Greedy repetition | 74.5% | 87.1% | SLERP |
+| Chat quality | Fluent | Broken | SLERP |
+| HellaSwag | **39.0%** | 35.0% | SLERP |
+| Training time | 5d+8h | **12.8h** | ORPO |
+ORPO's weakness: only 10K steps of training vs SFT's 65K — insufficient base instruction-following before alignment kicks in.
+### Usage
+```python
 import torch
 from model.transformer import LLM
 from tokenizers import Tokenizer
+# Requires cloning the repository (custom architecture — not loadable via AutoModel)
+# git clone https://github.com/pathcosmos/EVAFRILL-Mo
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = LLM.from_pretrained("hf_export/slerp")
+model = model.to(device=device, dtype=torch.bfloat16)
 model.eval()
 tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
+prompt = "What is artificial intelligence?"
+ids = tok.encode(prompt).ids
+input_ids = torch.tensor([ids], device=device)
+with torch.no_grad():
+    output = model.generate(
+        input_ids,
+        max_new_tokens=256,
+        temperature=0.7,
+        repetition_penalty=1.2,
+    )
+print(tok.decode(output[0].tolist()))
 ```
+### Reproducibility
+| Path | Contents |
+|------|----------|
+| `data/combined_preference.jsonl` | Preference training data (684K pairs, 2.6 GB) |
+| `data/repetition_preference.jsonl` | Repetition-suppression preference data (105 pairs, auto-generated) |
+| `configs/korean_3b_sft_1gpu.yaml` | SFT config for H100 MIG |
+| `configs/dpo_3b_1gpu.yaml` | DPO training config |
+| `configs/orpo_3b_1gpu.yaml` | ORPO training config |
+| `scripts/dpo.py` | DPO training code |
+| `scripts/orpo_native.py` | ORPO training code |
+| `scripts/sft.py` | SFT training code |
+| `scripts/evafrill_eval.py` | Benchmark evaluation code |
+| `scripts/merge_checkpoints.py` | SLERP checkpoint merging |
+### Limitations
+- **3B scale**: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
+- **GGUF/Ollama**: Not supported — custom hybrid Mamba-2 architecture cannot be converted with standard tools.
+- **vLLM**: Theoretically possible but requires custom weight key mapping.
+- **Greedy repetition**: ~74.5% 3-gram repetition rate without `repetition_penalty` — always use `repetition_penalty >= 1.2`.
+- **Language coverage**: Performance is not guaranteed for languages other than Korean and English.
+### Links
 - **GitHub**: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
+- **Reference paper**: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)
+### License
+MIT License — free to use, modify, and distribute commercially.