docs: add comprehensive HuggingFace model card with benchmarks and usage guide
Browse files
README.md
CHANGED
|
@@ -1,172 +1,372 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
-
- ko
|
| 4 |
-
|
|
|
|
| 5 |
tags:
|
| 6 |
-
-
|
| 7 |
-
- korean
|
| 8 |
-
-
|
| 9 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# FRANKENSTALLM 3B
|
| 13 |
-
|
| 14 |
-
한국어 중심 **FRANKENSTALLM 3B** ORPO 파인튜닝 체크포인트에 **byte-fallback 토큰 256개**를 추가한 버전입니다.
|
| 15 |
-
llama.cpp/GGUF 추론 시 줄바꿈(`\n`) 등 미등록 문자로 인한 크래시를 방지하기 위해 사용합니다.
|
| 16 |
-
|
| 17 |
-
## 모델 상세
|
| 18 |
-
|
| 19 |
-
| 항목 | 값 |
|
| 20 |
-
|------|-----|
|
| 21 |
-
| **Architecture** | LlamaForCausalLM |
|
| 22 |
-
| **Params** | ~3B |
|
| 23 |
-
| **Hidden size** | 2048 |
|
| 24 |
-
| **Layers** | 24 |
|
| 25 |
-
| **Attention heads** | 16 |
|
| 26 |
-
| **KV heads** | 4 |
|
| 27 |
-
| **Max position** | 4096 |
|
| 28 |
-
| **Vocab size** | **64,256** (64,000 + 256 byte-fallback) |
|
| 29 |
-
| **Training** | ORPO (SFT → ORPO) |
|
| 30 |
-
|
| 31 |
-
## 변경 사항 (v2)
|
| 32 |
-
|
| 33 |
-
- 토크나이저: `byte_fallback=True`, `<0x00>`~`<0xFF>` 256개 토큰 추가
|
| 34 |
-
- 임베딩: 64,000 → 64,256 리사이즈, 새 토큰 초기화
|
| 35 |
-
- GGUF 변환·Ollama 배포 시 뉴라인 포함 입력 정상 처리 확인
|
| 36 |
-
|
| 37 |
-
## 학습 환경 (Training Hardware)
|
| 38 |
-
|
| 39 |
-
### GPU
|
| 40 |
-
|
| 41 |
-
| 항목 | 사양 |
|
| 42 |
-
|------|------|
|
| 43 |
-
| **GPU** | 8× NVIDIA B200 |
|
| 44 |
-
| **VRAM (per GPU)** | 183 GB HBM3e |
|
| 45 |
-
| **Total VRAM** | ~1,466 GB (~1.47 TB) |
|
| 46 |
-
| **FP8 Tensor Core** | 2,250 TFLOPS/GPU (총 18,000 TFLOPS) |
|
| 47 |
-
| **BF16 Tensor Core** | 1,125 TFLOPS/GPU |
|
| 48 |
-
| **HBM3e Bandwidth** | ~7.67 TB/s per GPU |
|
| 49 |
-
| **Interconnect** | NVLink 5.0 (NV18, 900 GB/s bidirectional) |
|
| 50 |
-
| **Topology** | NVSwitch — 모든 GPU↔GPU 단일 홉 all-to-all mesh |
|
| 51 |
-
| **SMs per GPU** | 148 |
|
| 52 |
-
| **L2 Cache per GPU** | 126.5 MB |
|
| 53 |
-
|
| 54 |
-
### CPU & Memory
|
| 55 |
-
|
| 56 |
-
| 항목 | 사양 |
|
| 57 |
-
|------|------|
|
| 58 |
-
| **CPU** | 2× AMD EPYC 9365 (Turin / Zen 5) |
|
| 59 |
-
| **Physical Cores** | 72 (36코어 × 2소켓) |
|
| 60 |
-
| **L3 Cache** | 384 MB (12 CCX × 32 MB) |
|
| 61 |
-
| **System RAM** | 2.21 TB DDR5 (NUMA 2노드 × ~1.1 TB) |
|
| 62 |
-
| **GPU↔NUMA 매핑** | GPU 0–3 → NUMA node 0 / GPU 4–7 → NUMA node 1 |
|
| 63 |
-
|
| 64 |
-
### 소프트웨어 스택
|
| 65 |
-
|
| 66 |
-
| 패키지 | 버전 |
|
| 67 |
-
|--------|------|
|
| 68 |
-
| **CUDA** | 13.1 |
|
| 69 |
-
| **Driver** | 580.95.05 |
|
| 70 |
-
| **PyTorch** | 2.10.0a0+b4e4ee81d3 (NVIDIA nv25.12 커스텀 빌드, B200 최적화) |
|
| 71 |
-
| **Transformer Engine** | 2.10.0 |
|
| 72 |
-
| **FlashAttention** | 2.7.4.post1+25.12 |
|
| 73 |
-
| **NCCL** | 2.28.9 |
|
| 74 |
-
| **Triton** | 3.5.1 |
|
| 75 |
-
| **TRL** | ORPO fine-tuning |
|
| 76 |
-
|
| 77 |
-
> **참고**: PyTorch는 NVIDIA B200 최적화 커스텀 빌드(`nv25.12`)를 사용합니다. FP8 네이티브 연산(`torch.float8_e4m3fn`) 지원 환경입니다.
|
| 78 |
-
|
| 79 |
-
## ORPO 평가 요약 (동일 체크포인트 기준)
|
| 80 |
-
|
| 81 |
-
- **평가 일시**: 2026-03-09
|
| 82 |
-
- **Preference Accuracy**: 76.02%
|
| 83 |
-
- **Reward Margin**: 0.6100
|
| 84 |
-
- **Eval Loss**: 1.7910 → 1.6250
|
| 85 |
-
- **KoBEST (0-shot) 평균**: 52.75%
|
| 86 |
-
- **생성 품질**: Greedy 3-gram 반복률 30.89%, EOS 종료율 66.67%
|
| 87 |
-
- **PPL Forgetting**: 최대 4.1% (기준 <15%)
|
| 88 |
-
- **종합**: 7/10 차원 통과, 정량 스코어 63.7/100
|
| 89 |
-
|
| 90 |
-
상세: 프로젝트 내 `reports/2026-03-09_ORPO_EVALUATION_REPORT.md` 참고.
|
| 91 |
-
|
| 92 |
-
## Ollama 배포 벤치마크 (Q4_K_M, 2026-03-09)
|
| 93 |
-
|
| 94 |
-
- **모델명**: `frankenstallm-3b-v2`
|
| 95 |
-
- **테스트 수**: 35 (자동 20 + 수동 15)
|
| 96 |
-
- **자동 채점 평균**: 46.7
|
| 97 |
-
- **카테고리**: korean_nlu 100.0, reasoning 50.0, knowledge 75.0, instruction_following 66.7, code 0.0, safety 10.0, repetition_resistance 2.2 등
|
| 98 |
-
- **지���**: Avg TTFT 16.7 ms, Avg TPS 142.5
|
| 99 |
-
|
| 100 |
-
상세: `reports/2026-03-09_GGUF_DEPLOYMENT_AND_EVAL_REPORT.md`, `eval/results/frankenstallm-3b-v2/ollama_benchmark_summary.md`
|
| 101 |
-
|
| 102 |
-
## 샘플링 파라미터 (Sampling Config)
|
| 103 |
-
|
| 104 |
-
ORPO 평가 그리드 실측 최적값 (`t0.7_rep1.2`) 기준입니다.
|
| 105 |
-
|
| 106 |
-
### 권장 파라미터
|
| 107 |
-
|
| 108 |
-
| 파라미터 | 값 | 비고 |
|
| 109 |
-
|---------|-----|------|
|
| 110 |
-
| `temperature` | **0.7** | 창의성/일관성 균형점 |
|
| 111 |
-
| `repetition_penalty` (PyTorch) | **1.2** | 반복 억제 |
|
| 112 |
-
| `repeat_penalty` (Ollama) | **1.2** | 동일 값 |
|
| 113 |
-
| `top_p` | **0.9** | nucleus sampling |
|
| 114 |
-
| `top_k` | **50** | |
|
| 115 |
-
| `max_new_tokens` | **512** | |
|
| 116 |
-
| `num_ctx` | **4096** | context window |
|
| 117 |
-
|
| 118 |
-
### 평가 결과 (ORPO eval grid)
|
| 119 |
-
|
| 120 |
-
| 설정 | 3-gram 반복률 | 4-gram 반복률 | EOS 종료율 | 평균 토큰 수 |
|
| 121 |
-
|------|-------------|-------------|----------|------------|
|
| 122 |
-
| **t0.7 / rep1.2 (권장)** | **0.0%** | **0.0%** | **100%** | 189.2 |
|
| 123 |
-
| t0.8 / rep1.05 (기본) | 4.7% | 2.3% | 100% | 221.4 |
|
| 124 |
-
| greedy (temp=0) | 30.89% | — | 66.67% | — |
|
| 125 |
-
|
| 126 |
-
> Ollama Q4_K_M 실측: 3-gram 반복 1.8% (자연 어절 반복), EOS 100% 종료 확인.
|
| 127 |
-
> greedy 대비 반복률 **30.89% → 0%** 해소.
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
```python
|
| 132 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 133 |
import torch
|
| 134 |
|
| 135 |
-
|
| 136 |
-
tokenizer = AutoTokenizer.from_pretrained(
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
outputs = model.generate(
|
| 140 |
-
**inputs,
|
| 141 |
-
temperature=0.7,
|
| 142 |
-
repetition_penalty=1.2,
|
| 143 |
-
top_p=0.9,
|
| 144 |
-
top_k=50,
|
| 145 |
-
max_new_tokens=512,
|
| 146 |
-
do_sample=True,
|
| 147 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 149 |
```
|
| 150 |
|
| 151 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
-
|
| 154 |
-
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
ollama run frankenstallm-3b-v2:Q4_K_M
|
| 159 |
|
| 160 |
-
|
| 161 |
-
ollama create frankenstallm-3b-v2:Q8_0 -f gguf/Modelfile.3b-v2-Q8_0
|
| 162 |
-
ollama run frankenstallm-3b-v2:Q8_0
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
-
##
|
| 171 |
|
| 172 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
language:
|
| 5 |
+
- ko
|
| 6 |
+
- en
|
| 7 |
+
model_type: llama
|
| 8 |
tags:
|
| 9 |
+
- 3b
|
| 10 |
+
- korean
|
| 11 |
+
- from-scratch
|
| 12 |
+
- orpo
|
| 13 |
+
- instruction-tuned
|
| 14 |
+
- preference-aligned
|
| 15 |
+
- fp8
|
| 16 |
+
- b200
|
| 17 |
+
- gguf
|
| 18 |
+
datasets:
|
| 19 |
+
- cc100
|
| 20 |
+
- allenai/c4
|
| 21 |
+
- heegyu/orca-math-korean-preference-cleaned
|
| 22 |
+
- nayohan/preference-collection-ko-full
|
| 23 |
+
- maywell/ko_Ultrafeedback_binarized
|
| 24 |
+
- HuggingFaceTB/cosmopedia
|
| 25 |
+
- wikimedia/wikipedia
|
| 26 |
+
pipeline_tag: text-generation
|
| 27 |
+
model-index:
|
| 28 |
+
- name: FRANKENSTALLM-3B
|
| 29 |
+
results:
|
| 30 |
+
- task:
|
| 31 |
+
type: text-generation
|
| 32 |
+
dataset:
|
| 33 |
+
type: kobest
|
| 34 |
+
name: KoBEST (0-shot)
|
| 35 |
+
metrics:
|
| 36 |
+
- name: Average
|
| 37 |
+
type: accuracy
|
| 38 |
+
value: 52.75
|
| 39 |
+
- name: COPA
|
| 40 |
+
type: accuracy
|
| 41 |
+
value: 63.9
|
| 42 |
+
- name: HellaSwag-KO
|
| 43 |
+
type: accuracy
|
| 44 |
+
value: 38.0
|
| 45 |
+
- name: SentiNeg
|
| 46 |
+
type: accuracy
|
| 47 |
+
value: 62.5
|
| 48 |
+
- name: BoolQ
|
| 49 |
+
type: accuracy
|
| 50 |
+
value: 50.6
|
| 51 |
+
- name: WiC
|
| 52 |
+
type: accuracy
|
| 53 |
+
value: 48.8
|
| 54 |
+
- task:
|
| 55 |
+
type: text-generation
|
| 56 |
+
dataset:
|
| 57 |
+
type: haerae
|
| 58 |
+
name: HAE-RAE (0-shot)
|
| 59 |
+
metrics:
|
| 60 |
+
- name: Average
|
| 61 |
+
type: accuracy
|
| 62 |
+
value: 21.81
|
| 63 |
+
- task:
|
| 64 |
+
type: text-generation
|
| 65 |
+
dataset:
|
| 66 |
+
type: piqa
|
| 67 |
+
name: PIQA (0-shot)
|
| 68 |
+
metrics:
|
| 69 |
+
- name: Accuracy
|
| 70 |
+
type: accuracy
|
| 71 |
+
value: 59.9
|
| 72 |
+
- task:
|
| 73 |
+
type: text-generation
|
| 74 |
+
dataset:
|
| 75 |
+
type: ai2_arc
|
| 76 |
+
name: ARC-Easy (0-shot)
|
| 77 |
+
metrics:
|
| 78 |
+
- name: Accuracy
|
| 79 |
+
type: accuracy
|
| 80 |
+
value: 36.0
|
| 81 |
---
|
| 82 |
|
| 83 |
+
# FRANKENSTALLM 3B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
> **A Korean 3B LLM built entirely from scratch — tokenizer, pretraining, SFT, and ORPO — on 8× NVIDIA B200 GPUs.**
|
| 86 |
+
|
| 87 |
+
| | |
|
| 88 |
+
|---|---|
|
| 89 |
+
| **Developer** | [pathcosmos](https://huggingface.co/pathcosmos) |
|
| 90 |
+
| **Parameters** | ~2.4B (3B-class with weight tying) |
|
| 91 |
+
| **Languages** | Korean (primary), English (secondary) |
|
| 92 |
+
| **License** | Apache 2.0 |
|
| 93 |
+
| **Training** | 3-phase: Pretrain → SFT → ORPO |
|
| 94 |
+
| **Hardware** | 8× NVIDIA B200 (FP8), ~86 hours total |
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## Quick Start
|
| 99 |
+
|
| 100 |
+
### Transformers
|
| 101 |
|
| 102 |
```python
|
| 103 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 104 |
import torch
|
| 105 |
|
| 106 |
+
model_id = "pathcosmos/frankenstallm"
|
| 107 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 108 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 109 |
+
model_id, torch_dtype=torch.bfloat16, device_map="auto"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
)
|
| 111 |
+
|
| 112 |
+
inputs = tokenizer(
|
| 113 |
+
"한국의 전통 음식 중 김치에 대해 설명해주세요.",
|
| 114 |
+
return_tensors="pt"
|
| 115 |
+
).to(model.device)
|
| 116 |
+
|
| 117 |
+
with torch.no_grad():
|
| 118 |
+
outputs = model.generate(
|
| 119 |
+
**inputs,
|
| 120 |
+
do_sample=True,
|
| 121 |
+
temperature=0.7,
|
| 122 |
+
repetition_penalty=1.2, # recommended
|
| 123 |
+
top_p=0.9,
|
| 124 |
+
max_new_tokens=512,
|
| 125 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 129 |
```
|
| 130 |
|
| 131 |
+
### Ollama (GGUF)
|
| 132 |
+
|
| 133 |
+
```bash
|
| 134 |
+
# Download GGUF + Modelfile
|
| 135 |
+
huggingface-cli download pathcosmos/frankenstallm \
|
| 136 |
+
gguf/frankenstallm-3b-v2-Q4_K_M.gguf \
|
| 137 |
+
gguf/Modelfile.3b-v2-Q4_K_M \
|
| 138 |
+
--local-dir ./frankenstallm
|
| 139 |
+
|
| 140 |
+
# Fix FROM path in Modelfile, then create
|
| 141 |
+
ollama create frankenstallm -f ./frankenstallm/gguf/Modelfile.3b-v2-Q4_K_M
|
| 142 |
+
|
| 143 |
+
# Run
|
| 144 |
+
ollama run frankenstallm
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## Model Highlights
|
| 150 |
+
|
| 151 |
+
- **From-scratch Korean tokenizer**: SentencePiece Unigram, 64K vocab, 99.95% Korean character coverage
|
| 152 |
+
- **3-phase training pipeline**: Pretrain (57K steps, ~60B tokens) → SFT (25.5K steps, 2.4M samples) → ORPO (10K steps, 630K preference pairs)
|
| 153 |
+
- **B200 FP8 native training**: TransformerEngine MXFP8 on NVIDIA B200 — 2× theoretical throughput vs BF16
|
| 154 |
+
- **GGUF deployment ready**: Q4_K_M (757MB), Q8_0 (1.2GB), F16 (2.3GB) with optimized Ollama Modelfiles
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## Architecture
|
| 159 |
+
|
| 160 |
+
| Component | Value |
|
| 161 |
+
|-----------|-------|
|
| 162 |
+
| Type | Decoder-only Transformer (LLaMA-style) |
|
| 163 |
+
| Hidden size | 3,072 |
|
| 164 |
+
| Layers | 28 |
|
| 165 |
+
| Attention heads | 24 |
|
| 166 |
+
| KV heads | 8 (GQA 3:1) |
|
| 167 |
+
| FFN dim | 8,192 (SwiGLU) |
|
| 168 |
+
| Vocab size | 64,000 |
|
| 169 |
+
| Context length | 4,096 (trained at 2,048) |
|
| 170 |
+
| Position encoding | RoPE (θ=500,000) |
|
| 171 |
+
| Normalization | Pre-norm RMSNorm |
|
| 172 |
+
| Attention impl | FlashAttention-2 |
|
| 173 |
+
| Precision | FP8 (MXFP8 via TransformerEngine) |
|
| 174 |
+
| Weight tying | Yes (embedding ↔ lm_head) |
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## Training Pipeline
|
| 179 |
+
|
| 180 |
+
### Phase 1: Pretraining
|
| 181 |
+
|
| 182 |
+
| Detail | Value |
|
| 183 |
+
|--------|-------|
|
| 184 |
+
| Steps | 57,000 |
|
| 185 |
+
| Final loss | 1.466 |
|
| 186 |
+
| Tokens seen | ~60B (38.5B unique × ~1.5 epochs) |
|
| 187 |
+
| Duration | ~63 hours |
|
| 188 |
+
| Data | CC-100 KO, HPLT KO, C4 KO, NamuWiki, Wikipedia KO, Cosmopedia (EN) |
|
| 189 |
+
| Batch size | 5 × 8 GPU × 8 accum × 2,048 seq = ~655K tok/step |
|
| 190 |
+
|
| 191 |
+
### Phase 2: Supervised Fine-Tuning (SFT)
|
| 192 |
+
|
| 193 |
+
| Detail | Value |
|
| 194 |
+
|--------|-------|
|
| 195 |
+
| Steps | 25,500 (early stop at 77.3%) |
|
| 196 |
+
| Best val_loss | 1.8851 (step 23,000) |
|
| 197 |
+
| Duration | ~15.5 hours |
|
| 198 |
+
| Data | 2,439,397 samples from 24 sources (7.48 GB) |
|
| 199 |
+
| Mix | 70% SFT + 30% pretrain replay (catastrophic forgetting prevention) |
|
| 200 |
+
| Knowledge forgetting | 0.9% (19 datasets) |
|
| 201 |
+
|
| 202 |
+
### Phase 3: ORPO (Odds Ratio Preference Optimization)
|
| 203 |
+
|
| 204 |
+
| Detail | Value |
|
| 205 |
+
|--------|-------|
|
| 206 |
+
| Steps | 9,997 (early convergence) |
|
| 207 |
+
| Best eval_loss | 1.625 |
|
| 208 |
+
| Preference accuracy | 76.02% |
|
| 209 |
+
| Reward margin | 0.6100 |
|
| 210 |
+
| Duration | ~7 hours |
|
| 211 |
+
| Data | ~630K preference pairs from 7 Korean HF datasets |
|
| 212 |
+
| Hyperparams | beta=0.25, lr=1.2e-5, eff_batch=128 |
|
| 213 |
+
|
| 214 |
+
**Total training time: ~86 hours on 8× B200**
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Benchmarks
|
| 219 |
+
|
| 220 |
+
### Training Phase Progression (Base → SFT → ORPO)
|
| 221 |
+
|
| 222 |
+
| Benchmark | Base | SFT | ORPO | Δ (Base→ORPO) |
|
| 223 |
+
|-----------|:----:|:---:|:----:|:---:|
|
| 224 |
+
| **KoBEST Avg (0-shot)** | 43.7% | 43.3% | **52.8%** | **+9.1pp** |
|
| 225 |
+
| KoBEST COPA | 49.3% | 48.6% | **63.9%** | +14.6pp |
|
| 226 |
+
| KoBEST HellaSwag-KO | 21.6% | 19.8% | **38.0%** | +16.4pp |
|
| 227 |
+
| KoBEST SentiNeg | 48.6% | 49.1% | **62.5%** | +13.9pp |
|
| 228 |
+
| KoBEST BoolQ | 50.3% | 50.1% | 50.6% | +0.3pp |
|
| 229 |
+
| PIQA | 52.5% | 52.6% | **59.9%** | +7.3pp |
|
| 230 |
+
| ARC-Easy | 25.6% | 25.9% | **36.0%** | +10.4pp |
|
| 231 |
+
| HAE-RAE | 19.7% | 19.9% | 21.8% | +2.1pp |
|
| 232 |
+
| HellaSwag EN | 26.2% | 26.1% | 29.2% | +3.0pp |
|
| 233 |
+
| Greedy 3-gram repetition | 61.0% | 73.0% | **30.9%** | -30.1pp |
|
| 234 |
+
| EOS termination rate | 0% | 60% | **67%** | +67pp |
|
| 235 |
+
| PPL forgetting | — | 0.9% | 4.1% | within 15% ✅ |
|
| 236 |
+
|
| 237 |
+
### 3B-class Model Comparison (Ollama, 35 tests)
|
| 238 |
|
| 239 |
+
| Model | Params | Korean NLU | Knowledge | Instruction | Reasoning | Avg Score |
|
| 240 |
+
|-------|:------:|:----------:|:---------:|:-----------:|:---------:|:---------:|
|
| 241 |
+
| Qwen 2.5 3B | 3B | 100.0 | 20.8 | 55.6 | 62.5 | **63.4** |
|
| 242 |
+
| Phi-4 Mini | 3.8B | 66.7 | 29.2 | 33.3 | **87.5** | 60.6 |
|
| 243 |
+
| **FRANKENSTALLM 3B** | **3B** | **100.0** | **75.0** | **66.7** | 50.0 | 46.7 |
|
|
|
|
| 244 |
|
| 245 |
+
> FRANKENSTALLM leads in **Korean NLU** (tied with Qwen), **Korean Knowledge** (75 vs 20.8/29.2), and **Instruction Following** (66.7 vs 55.6/33.3).
|
|
|
|
|
|
|
| 246 |
|
| 247 |
+
### Inference Speed (Ollama, Q4_K_M)
|
| 248 |
+
|
| 249 |
+
| Model | Avg TTFT | TPS | Note |
|
| 250 |
+
|-------|:--------:|:---:|------|
|
| 251 |
+
| **FRANKENSTALLM 3B** | **16.7ms** | **142.5** | Fastest |
|
| 252 |
+
| Phi-4 Mini 3.8B | 25.6ms | 100.4 | |
|
| 253 |
+
| Qwen 2.5 3B | 28.2ms | 93.8 | |
|
| 254 |
+
|
| 255 |
+
### Perplexity Preservation (ORPO Knowledge Retention)
|
| 256 |
+
|
| 257 |
+
| Dataset | Base PPL | ORPO PPL | Forgetting |
|
| 258 |
+
|---------|:--------:|:--------:|:----------:|
|
| 259 |
+
| Korean C4 | 5.72 | 5.87 | +2.7% |
|
| 260 |
+
| Korean Wiki | 11.84 | 12.21 | +3.2% |
|
| 261 |
+
| Max forgetting | — | — | 4.1% ✅ |
|
| 262 |
+
|
| 263 |
+
---
|
| 264 |
+
|
| 265 |
+
## Training Data
|
| 266 |
+
|
| 267 |
+
### Pretraining (~38.5B tokens)
|
| 268 |
+
|
| 269 |
+
| Category | Sources | Est. Tokens |
|
| 270 |
+
|----------|---------|:-----------:|
|
| 271 |
+
| Korean Web Crawl | C4 KO, CC-100 KO, HPLT KO | ~17.2B |
|
| 272 |
+
| Korean Encyclopedia | Wikipedia KO, NamuWiki (2 versions) | ~2.8B |
|
| 273 |
+
| English Educational | Cosmopedia (Stories, Web, Stanford, WikiHow, OpenStax, Khan) | ~5.7B |
|
| 274 |
+
| English Math/Science | AutoMathText, OpenWebMath, Proof-Pile-2 | ~8.5B |
|
| 275 |
+
| Code | StarCoder (filtered) | ~4.3B |
|
| 276 |
+
|
| 277 |
+
### SFT (2.4M samples, 24 sources)
|
| 278 |
+
|
| 279 |
+
| Domain | Share | Key Datasets |
|
| 280 |
+
|--------|:-----:|-------------|
|
| 281 |
+
| Reasoning/CoT | 38% | reasoning_r1_1.4m, magpie_reasoning |
|
| 282 |
+
| Korean Instructions | 23% | korean_instruction_mix, open_korean_instructions, kullm_v2 |
|
| 283 |
+
| English General | 16% | openhermes_2.5, ultrachat_200k |
|
| 284 |
+
| Math | 12% | NuminaMath-CoT, orca-math-ko |
|
| 285 |
+
| Dialog/Code/Other | 11% | smol-koreantalk, Evol-Instruct-Code-80k-ko |
|
| 286 |
+
|
| 287 |
+
### ORPO (~630K preference pairs, 7 sources)
|
| 288 |
+
|
| 289 |
+
| Dataset | Size | Domain |
|
| 290 |
+
|---------|:----:|--------|
|
| 291 |
+
| nayohan/preference-collection-ko-full | 4.9GB | General preference |
|
| 292 |
+
| heegyu/orca-math-korean-preference-cleaned | 1.6GB | Math reasoning |
|
| 293 |
+
| kuotient/orca-math-korean-dpo-pairs | 750MB | Math DPO |
|
| 294 |
+
| maywell/ko_Ultrafeedback_binarized | 394MB | Feedback alignment |
|
| 295 |
+
| tellang/yeji-preference-ko-v1 | 171MB | General preference |
|
| 296 |
+
| jojo0217/korean_rlhf_dataset | 137MB | RLHF pairs |
|
| 297 |
+
| lemon-mint/korean-realqa-reasoning-v01-preference | 58MB | QA reasoning |
|
| 298 |
+
|
| 299 |
+
---
|
| 300 |
+
|
| 301 |
+
## GGUF & Ollama
|
| 302 |
+
|
| 303 |
+
### Available Quantizations
|
| 304 |
+
|
| 305 |
+
| File | Size | Description |
|
| 306 |
+
|------|:----:|-------------|
|
| 307 |
+
| `gguf/frankenstallm-3b-v2-Q4_K_M.gguf` | 757MB | **Recommended** — best size/quality balance |
|
| 308 |
+
| `gguf/frankenstallm-3b-v2-Q8_0.gguf` | 1.2GB | Higher quality |
|
| 309 |
+
| `gguf/frankenstallm-3b-v2-f16.gguf` | 2.3GB | Full precision |
|
| 310 |
+
| `model.safetensors` | 4.76GB | Transformers native (ORPO best, byte-fallback fixed) |
|
| 311 |
+
|
| 312 |
+
### Recommended Sampling Parameters
|
| 313 |
+
|
| 314 |
+
| Parameter | Value | Notes |
|
| 315 |
+
|-----------|:-----:|-------|
|
| 316 |
+
| `temperature` | 0.7 | Optimal for Korean generation quality |
|
| 317 |
+
| `repeat_penalty` | 1.2 | **Required** — without it, greedy repetition is 30.9% |
|
| 318 |
+
| `top_p` | 0.9 | Nucleus sampling |
|
| 319 |
+
| `top_k` | 50 | Top-k candidates |
|
| 320 |
+
| `max_tokens` | 512 | Max generation length |
|
| 321 |
+
| `num_ctx` | 4096 | Context window (do not exceed) |
|
| 322 |
+
|
| 323 |
+
> ⚠️ Always use `repeat_penalty >= 1.2`. With it, repetition drops to **0%**. Without it, greedy decoding produces ~31% 3-gram repetition.
|
| 324 |
+
|
| 325 |
+
---
|
| 326 |
+
|
| 327 |
+
## Limitations
|
| 328 |
+
|
| 329 |
+
- **English performance is limited**: MMLU-EN ~23%, HellaSwag-EN ~29% — this is a Korean-focused model
|
| 330 |
+
- **Code generation**: Near zero capability (limited code in training data)
|
| 331 |
+
- **Greedy repetition**: 30.9% 3-gram repetition without `repeat_penalty` — always use sampling with `repeat_penalty >= 1.2`
|
| 332 |
+
- **Safety**: Safety alignment data was not included in training; use with appropriate guardrails
|
| 333 |
+
- **Scale gap**: Compared to commercial 3B models trained on trillions of tokens, this model was trained on ~60B tokens — expect lower overall benchmark scores
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## Hardware & Training Environment
|
| 338 |
+
|
| 339 |
+
| Component | Specification |
|
| 340 |
+
|-----------|---------------|
|
| 341 |
+
| GPU | 8× NVIDIA B200 (183GB HBM3e each, ~1.47TB total) |
|
| 342 |
+
| FP8 Compute | 2,250 TFLOPS/GPU (18,000 TFLOPS total) |
|
| 343 |
+
| Interconnect | NVLink 5.0, NVSwitch all-to-all mesh |
|
| 344 |
+
| CPU | 2× AMD EPYC 9365 (72 cores, Zen 5) |
|
| 345 |
+
| RAM | 2.21 TB DDR5 |
|
| 346 |
+
| PyTorch | 2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA custom) |
|
| 347 |
+
| TransformerEngine | 2.10.0 |
|
| 348 |
+
| FlashAttention | 2.7.4 |
|
| 349 |
+
| NCCL | 2.28.9 |
|
| 350 |
+
| CUDA | 13.1 |
|
| 351 |
+
| Total training | ~86 hours (Pretrain 63h + SFT 15.5h + ORPO 7h) |
|
| 352 |
+
|
| 353 |
+
---
|
| 354 |
+
|
| 355 |
+
## Citation
|
| 356 |
+
|
| 357 |
+
```bibtex
|
| 358 |
+
@misc{frankenstallm2026,
|
| 359 |
+
title={FRANKENSTALLM: A Korean 3B LLM Built From Scratch on B200 GPUs},
|
| 360 |
+
author={pathcosmos},
|
| 361 |
+
year={2026},
|
| 362 |
+
url={https://huggingface.co/pathcosmos/frankenstallm},
|
| 363 |
+
note={3-phase training (Pretrain, SFT, ORPO) with FP8 on 8x NVIDIA B200}
|
| 364 |
+
}
|
| 365 |
+
```
|
| 366 |
+
|
| 367 |
+
---
|
| 368 |
|
| 369 |
+
## Links
|
| 370 |
|
| 371 |
+
- **GitHub**: [pathcosmos/FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) — Full source code, training scripts, and builder's log
|
| 372 |
+
- **HuggingFace**: [pathcosmos/frankenstallm](https://huggingface.co/pathcosmos/frankenstallm)
|