README.md · pathcosmos/EVAFRILL-Mo-3B at main

EVAFRILL-Mo-3B / README.md

somebody-to-love

Update usage docs: replace AutoModel with working safetensors inference

e95edec 5 days ago

preview code

raw

history blame contribute delete

22 kB

	---
	language:
	- ko
	- en
	license: mit
	library_name: pytorch
	pipeline_tag: text-generation
	tags:
	- mamba2
	- hybrid
	- transformer
	- korean
	- from-scratch
	- dpo
	- slerp
	- orpo
	- nemotron-h
	datasets:
	- heegyu/orca-math-korean-preference-cleaned
	- nayohan/preference-collection-ko-full
	- kuotient/orca-math-word-problems-193k-korean
	- FreedomIntelligence/alpaca-gpt4-korean
	- heegyu/orca_ko
	- HAERAE-HUB/KOFFQA-GuardInstruct-v1
	model-index:
	- name: EVAFRILL-Mo-3B
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: hellaswag
	name: HellaSwag (0-shot, limit=500)
	metrics:
	- name: Accuracy
	type: accuracy
	value: 34.6
	- task:
	type: text-generation
	dataset:
	type: arc_easy
	name: ARC-Easy (0-shot, limit=500)
	metrics:
	- name: Accuracy
	type: accuracy
	value: 32.0
	- task:
	type: text-generation
	dataset:
	type: belebele
	name: Belebele Korean (0-shot, limit=500)
	metrics:
	- name: Accuracy
	type: accuracy
	value: 23.6
	- task:
	type: text-generation
	dataset:
	type: mmlu
	name: Global MMLU Korean (0-shot, limit=500)
	metrics:
	- name: Accuracy
	type: accuracy
	value: 23.7
	---

	> [한국어](#한국어) \| [English](#english)

	---

	# 한국어

	## EVAFRILL-Mo 3B — 하이브리드 Mamba-2 + Transformer

	### 프로젝트 소개

	EVAFRILL-Mo 3B는 NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) 아키텍처에서 영감을 받아 밑바닥부터 직접 구현한 30억 파라미터 하이브리드 언어 모델입니다.

	- 7× NVIDIA B200 GPU로 55B 토큰 사전학습 (약 60시간)
	- 한국어·영어·코드·수학 혼합 데이터셋 사용
	- SFT → DPO → SLERP 전체 파이프라인을 단일 프로젝트에서 직접 구현
	- 외부 프레임워크(Transformers Trainer, TRL) 없이 PyTorch 네이티브로 구현

	### 아키텍처

	```
	Type: Hybrid Mamba-2 + Transformer
	Parameters: 2.94B (2,975,397,632)
	Layers: 26 (24× Mamba-2 SSM + 2× Attention GQA)
	d_model: 3,072
	Vocabulary: 64,000 (custom SentencePiece)
	Max seq length: 4,096
	```

	Mamba-2 SSM 블록이 장거리 의존성을 효율적으로 처리하고, 2개의 GQA Attention 블록이 전역 컨텍스트를 보완합니다.
	표준 Transformer 대비 추론 시 KV 캐시 메모리를 크게 절감합니다.

	### 개발 배경 및 히스토리

	EVAFRILL-Mo는 6단계의 반복적 설계 과정을 거쳐 탄생했습니다:

	1. [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — 순수 Transformer decoder-only LLM으로 시작한 전신 프로젝트. 한국어+영어+코드+수학 데이터로 커스텀 SentencePiece 토크나이저(64K 어휘)를 학습하고, DDP 학습 파이프라인을 구축했습니다.
	2. Nemotron-H 영감 — NVIDIA의 하이브리드 Mamba-2 + Transformer 설계를 핵심 원칙만 추출하여(fragmentation) 제한된 하드웨어에 맞게 축소·적용.
	3. 체계적 규모 탐색 — 5개 규모(1B~3B) 모델을 7×B200에서 벤치마크하여 Chinchilla-optimal 최대 규모(3B, 93% 달성) 결정.
	4. 1B → 3B 전환 — tok/s가 per-GPU 값임을 발견하여, 1B 과잉학습(681%)을 3B 적정학습(93%)으로 전환.
	5. 3B 사전학습 — 319,772 steps, 55B tokens, 7×B200 FP8로 60시간 완료.
	6. Post-training — H100 MIG 환경에서 SFT → DPO → SLERP → ORPO 실험까지 완수.

	### 핵심 기술 하이라이트

	\| 기술 \| 효과 \|
	\|------\|------\|
	\| Chunked Cross-Entropy \| 64K 어휘에서 logits 메모리 사용량을 1/8로 절감 \|
	\| Mamba Memory Cliff 발견 \| batch 6→7에서 47GB→183GB+ 폭증 — selective scan의 구조적 제약 규명 \|
	\| FP8 네이티브 학습 \| TransformerEngine MXFP8BlockScaling으로 B200에서 BF16 대비 ~2배 처리량 \|
	\| LoRA B-zeroing \| DPO reference model을 모델 복제 없이 LoRA B를 임시 0으로 만들어 계산 — VRAM 50% 절약 \|
	\| SLERP 체크포인트 병합 \| SFT 지식 보존 + DPO 정렬을 구면 보간으로 균형 — alignment tax 완화 \|
	\| Native DPO/ORPO \| TRL 미사용, 커스텀 Mamba-2 하이브리드를 위해 처음부터 PyTorch로 구현 \|

	> 📖 전체 개발 과정, 아키텍처 설계 근거, 하드웨어 최적화 상세는 [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo)를 참조하세요.

	### 모델 버전

	이 저장소에는 학습 파이프라인 각 단계의 체크포인트 7종이 포함됩니다.

	\| 버전 \| 디렉토리 \| 크기 \| 설명 \| 권장 \|
	\|------\|----------\|------\|------\|:----:\|
	\| SLERP \| `slerp/` \| 6.3 GB \| SFT + DPO R2 구면 선형 보간 (α=0.5) \| ⭐ \|
	\| Pretrain \| `pretrain/` \| 12.6 GB \| 기반 모델 (319K 스텝, 55B 토큰) \| \|
	\| SFT v2 \| `sft-v2/` \| 6.3 GB \| 명령어 파인튜닝 (65K 스텝) \| \|
	\| DPO R1 \| `dpo-r1/` \| 6.3 GB \| 선호도 정렬 1라운드 (3K 스텝) \| \|
	\| DPO R2 \| `dpo-r2/` \| 6.3 GB \| 보수적 파인튜닝 2라운드 (2K 스텝) \| \|
	\| ORPO \| `orpo/` \| 6.3 GB \| SFT+정렬 동시 학습 실험 (10K 스텝) \| \|
	\| DPO R3 \| `dpo-r3/` \| 6.3 GB \| 반복 억제 특화 실험 (1K 스텝) \| \|

	### 학습 파이프라인

	```
	Pretrain (55B tokens, 7×B200, 60h)
	└─► SFT v2 (65K steps, H100 MIG, 5일)
	├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
	│ └─► SLERP Merge (α=0.5) ⭐ 최종 권장
	└─► ORPO (10K steps, 실험)
	└─► DPO R3 (1K steps, 반복 특화 실험)
	```

	각 화살표는 독립된 체크포인트로 저장되어, 임의의 단계부터 재현·비교가 가능합니다.

	### 벤치마크 결과

	평가 대상: SLERP 모델 (0-shot, limit=500)

	\| 벤치마크 \| 정확도 \|
	\|----------\|:------:\|
	\| HellaSwag \| 34.6% \|
	\| ARC-Easy \| 32.0% \|
	\| Belebele 한국어 \| 23.6% \|
	\| Global MMLU 한국어 \| 23.7% \|

	반복 생성 억제 (greedy decoding 기준)

	\| 설정 \| 3-gram 반복률 \|
	\|------\|:-------------:\|
	\| rep_penalty 없음 \| 74.5% \|
	\| rep_penalty=1.2 \| 5.5% \|

	권장 추론 파라미터: `temperature=0.7, repetition_penalty=1.2`

	### DPO vs ORPO 비교

	\| 지표 \| SLERP (SFT→DPO) \| ORPO \| 우세 \|
	\|------\|:---------------:\|:----:\|:----:\|
	\| Greedy 반복률 \| 74.5% \| 87.1% \| SLERP \|
	\| 대화 품질 \| 자연스러움 \| 부자연스러움 \| SLERP \|
	\| HellaSwag \| 39.0% \| 35.0% \| SLERP \|
	\| 학습 시간 \| 5일+8시간 \| 12.8시간 \| ORPO \|

	ORPO의 약점: SFT 65K 스텝 대비 10K 스텝만 학습되어 기반 명령어 이해가 부족합니다.

	### 사용법

	> GGUF/Ollama 미지원: 커스텀 Mamba-2 하이브리드 아키텍처로 llama.cpp/GGUF/Ollama와 호환되지 않습니다. PyTorch 직접 추론만 가능합니다.

	사전 준비:

	```bash
	# 1. 소스 코드 클론 (커스텀 아키텍처 모듈 필요)
	git clone https://github.com/pathcosmos/EVAFRILL-Mo
	cd EVAFRILL-Mo

	# 2. 의존성 설치
	pip install torch safetensors tokenizers PyYAML
	```

	방법 1: safetensors 직접 로딩 (권장)

	```python
	import json
	import torch
	from model.config import LMConfig
	from model.transformer import LLM
	from tokenizers import Tokenizer
	from safetensors.torch import load_file as load_safetensors

	CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # 이 저장소의 slerp/ 디렉토리

	# Config & 모델 로드
	with open(f"{CKPT}/config.json") as f:
	data = json.load(f)
	for k in ("model_type", "architectures", "_variant", "_description"):
	data.pop(k, None)
	cfg = LMConfig(**data)
	cfg.use_flash_attn = False

	model = LLM(cfg)
	state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
	model.load_state_dict(state, strict=False)
	model = model.to(device="cuda:0", dtype=torch.bfloat16)
	model.eval()

	tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

	# 생성 (권장: temp=0.7, rep_penalty=1.2)
	prompt = "<\|user\|>\n인공지능이란 무엇인가요?\n<\|assistant\|>\n"
	ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

	with torch.no_grad():
	for _ in range(256):
	logits, _ = model(ids)
	logits = logits[:, -1, :].float()
	for prev_id in set(ids[0].tolist()):
	if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
	else: logits[0, prev_id] *= 1.2
	probs = torch.softmax(logits / 0.7, dim=-1)
	next_id = torch.multinomial(probs, 1)
	ids = torch.cat([ids, next_id], dim=1)
	if next_id.item() == tok.token_to_id("</s>"): break

	print(tok.decode(ids[0].tolist()))
	```

	방법 2: 평가 프레임워크 러너 사용

	[frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test)의 `evafrill_runner.py`가 위 과정을 래핑합니다:

	```python
	from eval_framework.evafrill_runner import generate, unload_model

	result = generate("한국어로 인사해주세요.")
	print(result["response"])
	print(f"속도: {result['tokens_per_sec']:.1f} TPS")
	unload_model()
	```

	> 설정 방법: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-모델-설정-pytorch-직접-추론) 참조

	시스템 요구사항: GPU VRAM 8GB+ (BF16), CPU 추론 가능하지만 극히 느림 (~0.5 TPS)

	### 재현 자료

	\| 경로 \| 내용 \|
	\|------\|------\|
	\| `data/combined_preference.jsonl` \| 선호도 학습 데이터 (684K 쌍, 2.6 GB) \|
	\| `data/repetition_preference.jsonl` \| 반복 억제 선호도 데이터 (105 쌍, 자동 생성) \|
	\| `configs/korean_3b_sft_1gpu.yaml` \| SFT H100 MIG 설정 \|
	\| `configs/dpo_3b_1gpu.yaml` \| DPO 학습 설정 \|
	\| `configs/orpo_3b_1gpu.yaml` \| ORPO 학습 설정 \|
	\| `scripts/dpo.py` \| DPO 학습 코드 \|
	\| `scripts/orpo_native.py` \| ORPO 학습 코드 \|
	\| `scripts/sft.py` \| SFT 학습 코드 \|
	\| `scripts/evafrill_eval.py` \| 벤치마크 평가 코드 \|
	\| `scripts/merge_checkpoints.py` \| SLERP 체크포인트 병합 \|

	### 제한사항

	- 3B 규모 한계: 사실 정확도·복잡한 추론에 한계가 있으며, 대형 모델 대비 성능이 낮습니다.
	- GGUF/Ollama 불가: 커스텀 하이브리드 Mamba-2 아키텍처로 표준 변환 툴을 지원하지 않습니다.
	- vLLM 제한적: 이론상 가능하나 커스텀 weight key 매핑이 필요합니다.
	- 반복 생성: greedy decoding 시 반복률이 높으므로 반드시 `repetition_penalty=1.2` 이상을 설정하세요.
	- 언어 편중: 한국어·영어 외 언어는 성능이 보장되지 않습니다.

	### 링크

	- GitHub: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
	- 이전 프로젝트: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — 순수 Transformer 기반 전신 프로젝트
	- 참조 논문: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)

	### 라이선스

	MIT License — 상업적 이용·수정·재배포 모두 자유롭습니다.

	---

	# English

	## EVAFRILL-Mo 3B — Hybrid Mamba-2 + Transformer

	### Introduction

	EVAFRILL-Mo 3B is a 3-billion-parameter hybrid language model built entirely from scratch, inspired by NVIDIA's [Nemotron-H](https://arxiv.org/abs/2504.03624) architecture.

	- Pretrained on 55B tokens using 7× NVIDIA B200 GPUs (~60 hours)
	- Mixed Korean, English, code, and math datasets
	- Full SFT → DPO → SLERP pipeline implemented in pure PyTorch — no Transformers Trainer or TRL
	- Designed as a Korean-first model with strong multilingual capability

	### Architecture

	```
	Type: Hybrid Mamba-2 + Transformer
	Parameters: 2.94B (2,975,397,632)
	Layers: 26 (24× Mamba-2 SSM + 2× Attention GQA)
	d_model: 3,072
	Vocabulary: 64,000 (custom SentencePiece)
	Max seq length: 4,096
	```

	Mamba-2 SSM blocks handle long-range dependencies efficiently while two GQA Attention blocks provide global context.
	Compared to standard Transformers, this architecture significantly reduces KV cache memory during inference.

	### Development Background & History

	EVAFRILL-Mo was built through 6 iterative design stages:

	1. [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — Predecessor project starting as a pure Transformer decoder-only LLM. Built custom SentencePiece tokenizer (64K vocab) on Korean+English+code+math data and established DDP training pipeline.
	2. Nemotron-H Inspiration — Extracted core design principles from NVIDIA's hybrid Mamba-2 + Transformer architecture and scaled down for constrained hardware.
	3. Systematic Scale Search — Benchmarked 5 model sizes (1B–3B) on 7×B200 to find the Chinchilla-optimal maximum (3B, 93% achievement).
	4. 1B → 3B Transition — Discovered tok/s was per-GPU, redirecting from 1B over-training (681%) to 3B optimal training (93%).
	5. 3B Pretraining — 319,772 steps, 55B tokens, 60 hours on 7×B200 with FP8.
	6. Post-training — SFT → DPO → SLERP → ORPO experiments on H100 MIG.

	### Key Technical Highlights

	\| Technique \| Impact \|
	\|-----------\|--------\|
	\| Chunked Cross-Entropy \| Reduces logits memory by 8× for 64K vocabulary \|
	\| Mamba Memory Cliff Discovery \| Batch 6→7 causes 47GB→183GB+ explosion — structural limitation of selective scan \|
	\| FP8 Native Training \| TransformerEngine MXFP8BlockScaling delivers ~2× throughput vs BF16 on B200 \|
	\| LoRA B-zeroing \| Computes DPO reference logprobs without model duplication — 50% VRAM savings \|
	\| SLERP Checkpoint Merging \| Balances SFT knowledge + DPO alignment via spherical interpolation — mitigates alignment tax \|
	\| Native DPO/ORPO \| No TRL dependency — implemented from scratch in PyTorch for custom Mamba-2 hybrid \|

	> 📖 For the complete development journey, architecture design rationale, and hardware optimization details, see the [GitHub README](https://github.com/pathcosmos/EVAFRILL-Mo).

	### Model Variants

	This repository contains 7 checkpoints representing each stage of the training pipeline.

	\| Variant \| Directory \| Size \| Description \| Recommended \|
	\|---------\|-----------\|------\|-------------\|:-----------:\|
	\| SLERP \| `slerp/` \| 6.3 GB \| Spherical interpolation of SFT + DPO R2 (α=0.5) \| ⭐ \|
	\| Pretrain \| `pretrain/` \| 12.6 GB \| Base model (319K steps, 55B tokens) \| \|
	\| SFT v2 \| `sft-v2/` \| 6.3 GB \| Instruction-tuned (65K steps) \| \|
	\| DPO R1 \| `dpo-r1/` \| 6.3 GB \| Preference-aligned Round 1 (3K steps) \| \|
	\| DPO R2 \| `dpo-r2/` \| 6.3 GB \| Conservative fine-tuning Round 2 (2K steps) \| \|
	\| ORPO \| `orpo/` \| 6.3 GB \| Simultaneous SFT+alignment experiment (10K steps) \| \|
	\| DPO R3 \| `dpo-r3/` \| 6.3 GB \| Repetition-targeted experiment (1K steps) \| \|

	### Training Pipeline

	```
	Pretrain (55B tokens, 7×B200, 60h)
	└─► SFT v2 (65K steps, H100 MIG, 5 days)
	├─► DPO R1 (3K steps) ─► DPO R2 (2K steps)
	│ └─► SLERP Merge (α=0.5) ⭐ Final Recommended
	└─► ORPO (10K steps, experimental)
	└─► DPO R3 (1K steps, repetition experiment)
	```

	Every arrow corresponds to a separate saved checkpoint, enabling reproduction and comparison from any stage.

	### Benchmark Results

	Evaluated on: SLERP model (0-shot, limit=500)

	\| Benchmark \| Accuracy \|
	\|-----------\|:--------:\|
	\| HellaSwag \| 34.6% \|
	\| ARC-Easy \| 32.0% \|
	\| Belebele Korean \| 23.6% \|
	\| Global MMLU Korean \| 23.7% \|

	Repetition suppression (greedy decoding)

	\| Setting \| 3-gram repetition rate \|
	\|---------\|:----------------------:\|
	\| No rep_penalty \| 74.5% \|
	\| rep_penalty=1.2 \| 5.5% \|

	Recommended inference parameters: `temperature=0.7, repetition_penalty=1.2`

	### DPO vs ORPO Comparison

	\| Metric \| SLERP (SFT→DPO) \| ORPO \| Winner \|
	\|--------\|:---------------:\|:----:\|:------:\|
	\| Greedy repetition \| 74.5% \| 87.1% \| SLERP \|
	\| Chat quality \| Fluent \| Broken \| SLERP \|
	\| HellaSwag \| 39.0% \| 35.0% \| SLERP \|
	\| Training time \| 5d+8h \| 12.8h \| ORPO \|

	ORPO's weakness: only 10K steps of training vs SFT's 65K — insufficient base instruction-following before alignment kicks in.

	### Usage

	> GGUF/Ollama not supported: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.

	Prerequisites:

	```bash
	# 1. Clone source code (custom architecture modules required)
	git clone https://github.com/pathcosmos/EVAFRILL-Mo
	cd EVAFRILL-Mo

	# 2. Install dependencies
	pip install torch safetensors tokenizers PyYAML
	```

	Method 1: Direct safetensors loading (recommended)

	```python
	import json
	import torch
	from model.config import LMConfig
	from model.transformer import LLM
	from tokenizers import Tokenizer
	from safetensors.torch import load_file as load_safetensors

	CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # slerp/ directory of this repo

	# Load config & model
	with open(f"{CKPT}/config.json") as f:
	data = json.load(f)
	for k in ("model_type", "architectures", "_variant", "_description"):
	data.pop(k, None)
	cfg = LMConfig(**data)
	cfg.use_flash_attn = False

	model = LLM(cfg)
	state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
	model.load_state_dict(state, strict=False)
	model = model.to(device="cuda:0", dtype=torch.bfloat16)
	model.eval()

	tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")

	# Generate (recommended: temp=0.7, rep_penalty=1.2)
	prompt = "<\|user\|>\nWhat is artificial intelligence?\n<\|assistant\|>\n"
	ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

	with torch.no_grad():
	for _ in range(256):
	logits, _ = model(ids)
	logits = logits[:, -1, :].float()
	for prev_id in set(ids[0].tolist()):
	if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
	else: logits[0, prev_id] *= 1.2
	probs = torch.softmax(logits / 0.7, dim=-1)
	next_id = torch.multinomial(probs, 1)
	ids = torch.cat([ids, next_id], dim=1)
	if next_id.item() == tok.token_to_id("</s>"): break

	print(tok.decode(ids[0].tolist()))
	```

	Method 2: Evaluation framework runner

	The `evafrill_runner.py` in [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test) wraps the above into a simple API:

	```python
	from eval_framework.evafrill_runner import generate, unload_model

	result = generate("Hello, please introduce yourself.")
	print(result["response"])
	print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
	unload_model()
	```

	> Setup instructions: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-모델-설정-pytorch-직접-추론)

	System requirements: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)

	### Reproducibility

	\| Path \| Contents \|
	\|------\|----------\|
	\| `data/combined_preference.jsonl` \| Preference training data (684K pairs, 2.6 GB) \|
	\| `data/repetition_preference.jsonl` \| Repetition-suppression preference data (105 pairs, auto-generated) \|
	\| `configs/korean_3b_sft_1gpu.yaml` \| SFT config for H100 MIG \|
	\| `configs/dpo_3b_1gpu.yaml` \| DPO training config \|
	\| `configs/orpo_3b_1gpu.yaml` \| ORPO training config \|
	\| `scripts/dpo.py` \| DPO training code \|
	\| `scripts/orpo_native.py` \| ORPO training code \|
	\| `scripts/sft.py` \| SFT training code \|
	\| `scripts/evafrill_eval.py` \| Benchmark evaluation code \|
	\| `scripts/merge_checkpoints.py` \| SLERP checkpoint merging \|

	### Limitations

	- 3B scale: Factual accuracy and complex multi-step reasoning are limited compared to larger models.
	- GGUF/Ollama: Not supported — custom hybrid Mamba-2 architecture cannot be converted with standard tools.
	- vLLM: Theoretically possible but requires custom weight key mapping.
	- Greedy repetition: ~74.5% 3-gram repetition rate without `repetition_penalty` — always use `repetition_penalty >= 1.2`.
	- Language coverage: Performance is not guaranteed for languages other than Korean and English.

	### Links

	- GitHub: [pathcosmos/EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
	- Predecessor: [FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) \| [🤗 HuggingFace](https://huggingface.co/pathcosmos/frankenstallm) — Pure Transformer predecessor project
	- Reference paper: [Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models](https://arxiv.org/abs/2504.03624)

	### Acknowledgment / 감사의 글

	이 프로젝트는 과학기술정보통신부의 「첨단 GPU 활용 지원 사업」 (과학기술정보통신부 공고 제2025-1068호)을 통해 제공된 GPU 컴퓨팅 자원을 활용하여 수행되었습니다.

	> 국가 AI컴퓨팅자원 지원포털: [https://aiinfrahub.kr](https://aiinfrahub.kr)
	>
	> - 주관: 과학기술정보통신부 (MSIT), 정보통신산업진흥원 (NIPA)
	> - 운영: 한국정보통신진흥협회 (KAIT)

	대한민국 정부의 AI 인프라 지원 사업 덕분에 7× NVIDIA B200 GPU 환경에서 한국어 3B 하이브리드 Mamba-Transformer 모델을 처음부터 학습할 수 있었습니다. 국가 차원의 AI 컴퓨팅 자원 지원에 깊이 감사드립니다.

	This project was conducted using GPU computing resources provided through the "Advanced GPU Utilization Support Program" (MSIT Notice No. 2025-1068) by the Ministry of Science and ICT (MSIT) of the Republic of Korea.

	> National AI Computing Resource Support Portal: [https://aiinfrahub.kr](https://aiinfrahub.kr)
	>
	> - Organized by: Ministry of Science and ICT (MSIT), National IT Industry Promotion Agency (NIPA)
	> - Operated by: Korea Association of Information & Telecommunication (KAIT)

	We are deeply grateful for the national-level AI computing infrastructure support from the Korean government, which made it possible to train a Korean 3B hybrid Mamba-Transformer model from scratch on 7× NVIDIA B200 GPUs.

	---

	### License

	MIT License — free to use, modify, and distribute commercially.