somebody-to-love commited on
Commit
9eb04a9
·
verified ·
1 Parent(s): dc788f2

docs: add comprehensive HuggingFace model card with benchmarks and usage guide

Browse files
Files changed (1) hide show
  1. README.md +352 -152
README.md CHANGED
@@ -1,172 +1,372 @@
1
  ---
 
 
2
  language:
3
- - ko
4
- license: other
 
5
  tags:
6
- - llm
7
- - korean
8
- - orpo
9
- - gguf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- # FRANKENSTALLM 3B v2 (Byte-Fallback Fixed)
13
-
14
- 한국어 중심 **FRANKENSTALLM 3B** ORPO 파인튜닝 체크포인트에 **byte-fallback 토큰 256개**를 추가한 버전입니다.
15
- llama.cpp/GGUF 추론 시 줄바꿈(`\n`) 등 미등록 문자로 인한 크래시를 방지하기 위해 사용합니다.
16
-
17
- ## 모델 상세
18
-
19
- | 항목 | 값 |
20
- |------|-----|
21
- | **Architecture** | LlamaForCausalLM |
22
- | **Params** | ~3B |
23
- | **Hidden size** | 2048 |
24
- | **Layers** | 24 |
25
- | **Attention heads** | 16 |
26
- | **KV heads** | 4 |
27
- | **Max position** | 4096 |
28
- | **Vocab size** | **64,256** (64,000 + 256 byte-fallback) |
29
- | **Training** | ORPO (SFT → ORPO) |
30
-
31
- ## 변경 사항 (v2)
32
-
33
- - 토크나이저: `byte_fallback=True`, `<0x00>`~`<0xFF>` 256개 토큰 추가
34
- - 임베딩: 64,000 → 64,256 리사이즈, 새 토큰 초기화
35
- - GGUF 변환·Ollama 배포 시 뉴라인 포함 입력 정상 처리 확인
36
-
37
- ## 학습 환경 (Training Hardware)
38
-
39
- ### GPU
40
-
41
- | 항목 | 사양 |
42
- |------|------|
43
- | **GPU** | 8× NVIDIA B200 |
44
- | **VRAM (per GPU)** | 183 GB HBM3e |
45
- | **Total VRAM** | ~1,466 GB (~1.47 TB) |
46
- | **FP8 Tensor Core** | 2,250 TFLOPS/GPU (총 18,000 TFLOPS) |
47
- | **BF16 Tensor Core** | 1,125 TFLOPS/GPU |
48
- | **HBM3e Bandwidth** | ~7.67 TB/s per GPU |
49
- | **Interconnect** | NVLink 5.0 (NV18, 900 GB/s bidirectional) |
50
- | **Topology** | NVSwitch — 모든 GPU↔GPU 단일 홉 all-to-all mesh |
51
- | **SMs per GPU** | 148 |
52
- | **L2 Cache per GPU** | 126.5 MB |
53
-
54
- ### CPU & Memory
55
-
56
- | 항목 | 사양 |
57
- |------|------|
58
- | **CPU** | 2× AMD EPYC 9365 (Turin / Zen 5) |
59
- | **Physical Cores** | 72 (36코어 × 2소켓) |
60
- | **L3 Cache** | 384 MB (12 CCX × 32 MB) |
61
- | **System RAM** | 2.21 TB DDR5 (NUMA 2노드 × ~1.1 TB) |
62
- | **GPU↔NUMA 매핑** | GPU 0–3 → NUMA node 0 / GPU 4–7 → NUMA node 1 |
63
-
64
- ### 소프트웨어 스택
65
-
66
- | 패키지 | 버전 |
67
- |--------|------|
68
- | **CUDA** | 13.1 |
69
- | **Driver** | 580.95.05 |
70
- | **PyTorch** | 2.10.0a0+b4e4ee81d3 (NVIDIA nv25.12 커스텀 빌드, B200 최적화) |
71
- | **Transformer Engine** | 2.10.0 |
72
- | **FlashAttention** | 2.7.4.post1+25.12 |
73
- | **NCCL** | 2.28.9 |
74
- | **Triton** | 3.5.1 |
75
- | **TRL** | ORPO fine-tuning |
76
-
77
- > **참고**: PyTorch는 NVIDIA B200 최적화 커스텀 빌드(`nv25.12`)를 사용합니다. FP8 네이티브 연산(`torch.float8_e4m3fn`) 지원 환경입니다.
78
-
79
- ## ORPO 평가 요약 (동일 체크포인트 기준)
80
-
81
- - **평가 일시**: 2026-03-09
82
- - **Preference Accuracy**: 76.02%
83
- - **Reward Margin**: 0.6100
84
- - **Eval Loss**: 1.7910 → 1.6250
85
- - **KoBEST (0-shot) 평균**: 52.75%
86
- - **생성 품질**: Greedy 3-gram 반복률 30.89%, EOS 종료율 66.67%
87
- - **PPL Forgetting**: 최대 4.1% (기준 <15%)
88
- - **종합**: 7/10 차원 통과, 정량 스코어 63.7/100
89
-
90
- 상세: 프로젝트 내 `reports/2026-03-09_ORPO_EVALUATION_REPORT.md` 참고.
91
-
92
- ## Ollama 배포 벤치마크 (Q4_K_M, 2026-03-09)
93
-
94
- - **모델명**: `frankenstallm-3b-v2`
95
- - **테스트 수**: 35 (자동 20 + 수동 15)
96
- - **자동 채점 평균**: 46.7
97
- - **카테고리**: korean_nlu 100.0, reasoning 50.0, knowledge 75.0, instruction_following 66.7, code 0.0, safety 10.0, repetition_resistance 2.2 등
98
- - **지���**: Avg TTFT 16.7 ms, Avg TPS 142.5
99
-
100
- 상세: `reports/2026-03-09_GGUF_DEPLOYMENT_AND_EVAL_REPORT.md`, `eval/results/frankenstallm-3b-v2/ollama_benchmark_summary.md`
101
-
102
- ## 샘플링 파라미터 (Sampling Config)
103
-
104
- ORPO 평가 그리드 실측 최적값 (`t0.7_rep1.2`) 기준입니다.
105
-
106
- ### 권장 파라미터
107
-
108
- | 파라미터 | 값 | 비고 |
109
- |---------|-----|------|
110
- | `temperature` | **0.7** | 창의성/일관성 균형점 |
111
- | `repetition_penalty` (PyTorch) | **1.2** | 반복 억제 |
112
- | `repeat_penalty` (Ollama) | **1.2** | 동일 값 |
113
- | `top_p` | **0.9** | nucleus sampling |
114
- | `top_k` | **50** | |
115
- | `max_new_tokens` | **512** | |
116
- | `num_ctx` | **4096** | context window |
117
-
118
- ### 평가 결과 (ORPO eval grid)
119
-
120
- | 설정 | 3-gram 반복률 | 4-gram 반복률 | EOS 종료율 | 평균 토큰 수 |
121
- |------|-------------|-------------|----------|------------|
122
- | **t0.7 / rep1.2 (권장)** | **0.0%** | **0.0%** | **100%** | 189.2 |
123
- | t0.8 / rep1.05 (기본) | 4.7% | 2.3% | 100% | 221.4 |
124
- | greedy (temp=0) | 30.89% | — | 66.67% | — |
125
-
126
- > Ollama Q4_K_M 실측: 3-gram 반복 1.8% (자연 어절 반복), EOS 100% 종료 확인.
127
- > greedy 대비 반복률 **30.89% → 0%** 해소.
128
 
129
- ### Transformers 사용 예시
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
  ```python
132
  from transformers import AutoModelForCausalLM, AutoTokenizer
133
  import torch
134
 
135
- model = AutoModelForCausalLM.from_pretrained("pathcosmos/frankenstallm")
136
- tokenizer = AutoTokenizer.from_pretrained("pathcosmos/frankenstallm")
137
-
138
- inputs = tokenizer("안녕하세요, 오늘 날씨가", return_tensors="pt")
139
- outputs = model.generate(
140
- **inputs,
141
- temperature=0.7,
142
- repetition_penalty=1.2,
143
- top_p=0.9,
144
- top_k=50,
145
- max_new_tokens=512,
146
- do_sample=True,
147
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
149
  ```
150
 
151
- ## 사용
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
- - **Transformers**: 체크포인트를 그대로 `from_pretrained(...)` 로드 가능.
154
- - **GGUF / Ollama**:
155
- ```bash
156
- # Q4_K_M (권장, 757MB)
157
- ollama create frankenstallm-3b-v2:Q4_K_M -f gguf/Modelfile.3b-v2-Q4_K_M
158
- ollama run frankenstallm-3b-v2:Q4_K_M
159
 
160
- # Q8_0 (고품질, 1.2GB)
161
- ollama create frankenstallm-3b-v2:Q8_0 -f gguf/Modelfile.3b-v2-Q8_0
162
- ollama run frankenstallm-3b-v2:Q8_0
163
 
164
- # f16 (최고품질, 2.3GB)
165
- ollama create frankenstallm-3b-v2:f16 -f gguf/Modelfile.3b-v2-f16
166
- ollama run frankenstallm-3b-v2:f16
167
- ```
168
- Modelfile에 검증된 샘플링 파라미터(`temperature=0.7, repeat_penalty=1.2`)가 포함되어 있습니다.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
- ## 라이선스
171
 
172
- 프로젝트(FRANKENSTALLM) 라이선스에 따릅니다.
 
 
1
  ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
  language:
5
+ - ko
6
+ - en
7
+ model_type: llama
8
  tags:
9
+ - 3b
10
+ - korean
11
+ - from-scratch
12
+ - orpo
13
+ - instruction-tuned
14
+ - preference-aligned
15
+ - fp8
16
+ - b200
17
+ - gguf
18
+ datasets:
19
+ - cc100
20
+ - allenai/c4
21
+ - heegyu/orca-math-korean-preference-cleaned
22
+ - nayohan/preference-collection-ko-full
23
+ - maywell/ko_Ultrafeedback_binarized
24
+ - HuggingFaceTB/cosmopedia
25
+ - wikimedia/wikipedia
26
+ pipeline_tag: text-generation
27
+ model-index:
28
+ - name: FRANKENSTALLM-3B
29
+ results:
30
+ - task:
31
+ type: text-generation
32
+ dataset:
33
+ type: kobest
34
+ name: KoBEST (0-shot)
35
+ metrics:
36
+ - name: Average
37
+ type: accuracy
38
+ value: 52.75
39
+ - name: COPA
40
+ type: accuracy
41
+ value: 63.9
42
+ - name: HellaSwag-KO
43
+ type: accuracy
44
+ value: 38.0
45
+ - name: SentiNeg
46
+ type: accuracy
47
+ value: 62.5
48
+ - name: BoolQ
49
+ type: accuracy
50
+ value: 50.6
51
+ - name: WiC
52
+ type: accuracy
53
+ value: 48.8
54
+ - task:
55
+ type: text-generation
56
+ dataset:
57
+ type: haerae
58
+ name: HAE-RAE (0-shot)
59
+ metrics:
60
+ - name: Average
61
+ type: accuracy
62
+ value: 21.81
63
+ - task:
64
+ type: text-generation
65
+ dataset:
66
+ type: piqa
67
+ name: PIQA (0-shot)
68
+ metrics:
69
+ - name: Accuracy
70
+ type: accuracy
71
+ value: 59.9
72
+ - task:
73
+ type: text-generation
74
+ dataset:
75
+ type: ai2_arc
76
+ name: ARC-Easy (0-shot)
77
+ metrics:
78
+ - name: Accuracy
79
+ type: accuracy
80
+ value: 36.0
81
  ---
82
 
83
+ # FRANKENSTALLM 3B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
+ > **A Korean 3B LLM built entirely from scratch — tokenizer, pretraining, SFT, and ORPO — on 8× NVIDIA B200 GPUs.**
86
+
87
+ | | |
88
+ |---|---|
89
+ | **Developer** | [pathcosmos](https://huggingface.co/pathcosmos) |
90
+ | **Parameters** | ~2.4B (3B-class with weight tying) |
91
+ | **Languages** | Korean (primary), English (secondary) |
92
+ | **License** | Apache 2.0 |
93
+ | **Training** | 3-phase: Pretrain → SFT → ORPO |
94
+ | **Hardware** | 8× NVIDIA B200 (FP8), ~86 hours total |
95
+
96
+ ---
97
+
98
+ ## Quick Start
99
+
100
+ ### Transformers
101
 
102
  ```python
103
  from transformers import AutoModelForCausalLM, AutoTokenizer
104
  import torch
105
 
106
+ model_id = "pathcosmos/frankenstallm"
107
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
108
+ model = AutoModelForCausalLM.from_pretrained(
109
+ model_id, torch_dtype=torch.bfloat16, device_map="auto"
 
 
 
 
 
 
 
 
110
  )
111
+
112
+ inputs = tokenizer(
113
+ "한국의 전통 음식 중 김치에 대해 설명해주세요.",
114
+ return_tensors="pt"
115
+ ).to(model.device)
116
+
117
+ with torch.no_grad():
118
+ outputs = model.generate(
119
+ **inputs,
120
+ do_sample=True,
121
+ temperature=0.7,
122
+ repetition_penalty=1.2, # recommended
123
+ top_p=0.9,
124
+ max_new_tokens=512,
125
+ pad_token_id=tokenizer.eos_token_id,
126
+ )
127
+
128
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
129
  ```
130
 
131
+ ### Ollama (GGUF)
132
+
133
+ ```bash
134
+ # Download GGUF + Modelfile
135
+ huggingface-cli download pathcosmos/frankenstallm \
136
+ gguf/frankenstallm-3b-v2-Q4_K_M.gguf \
137
+ gguf/Modelfile.3b-v2-Q4_K_M \
138
+ --local-dir ./frankenstallm
139
+
140
+ # Fix FROM path in Modelfile, then create
141
+ ollama create frankenstallm -f ./frankenstallm/gguf/Modelfile.3b-v2-Q4_K_M
142
+
143
+ # Run
144
+ ollama run frankenstallm
145
+ ```
146
+
147
+ ---
148
+
149
+ ## Model Highlights
150
+
151
+ - **From-scratch Korean tokenizer**: SentencePiece Unigram, 64K vocab, 99.95% Korean character coverage
152
+ - **3-phase training pipeline**: Pretrain (57K steps, ~60B tokens) → SFT (25.5K steps, 2.4M samples) → ORPO (10K steps, 630K preference pairs)
153
+ - **B200 FP8 native training**: TransformerEngine MXFP8 on NVIDIA B200 — 2× theoretical throughput vs BF16
154
+ - **GGUF deployment ready**: Q4_K_M (757MB), Q8_0 (1.2GB), F16 (2.3GB) with optimized Ollama Modelfiles
155
+
156
+ ---
157
+
158
+ ## Architecture
159
+
160
+ | Component | Value |
161
+ |-----------|-------|
162
+ | Type | Decoder-only Transformer (LLaMA-style) |
163
+ | Hidden size | 3,072 |
164
+ | Layers | 28 |
165
+ | Attention heads | 24 |
166
+ | KV heads | 8 (GQA 3:1) |
167
+ | FFN dim | 8,192 (SwiGLU) |
168
+ | Vocab size | 64,000 |
169
+ | Context length | 4,096 (trained at 2,048) |
170
+ | Position encoding | RoPE (θ=500,000) |
171
+ | Normalization | Pre-norm RMSNorm |
172
+ | Attention impl | FlashAttention-2 |
173
+ | Precision | FP8 (MXFP8 via TransformerEngine) |
174
+ | Weight tying | Yes (embedding ↔ lm_head) |
175
+
176
+ ---
177
+
178
+ ## Training Pipeline
179
+
180
+ ### Phase 1: Pretraining
181
+
182
+ | Detail | Value |
183
+ |--------|-------|
184
+ | Steps | 57,000 |
185
+ | Final loss | 1.466 |
186
+ | Tokens seen | ~60B (38.5B unique × ~1.5 epochs) |
187
+ | Duration | ~63 hours |
188
+ | Data | CC-100 KO, HPLT KO, C4 KO, NamuWiki, Wikipedia KO, Cosmopedia (EN) |
189
+ | Batch size | 5 × 8 GPU × 8 accum × 2,048 seq = ~655K tok/step |
190
+
191
+ ### Phase 2: Supervised Fine-Tuning (SFT)
192
+
193
+ | Detail | Value |
194
+ |--------|-------|
195
+ | Steps | 25,500 (early stop at 77.3%) |
196
+ | Best val_loss | 1.8851 (step 23,000) |
197
+ | Duration | ~15.5 hours |
198
+ | Data | 2,439,397 samples from 24 sources (7.48 GB) |
199
+ | Mix | 70% SFT + 30% pretrain replay (catastrophic forgetting prevention) |
200
+ | Knowledge forgetting | 0.9% (19 datasets) |
201
+
202
+ ### Phase 3: ORPO (Odds Ratio Preference Optimization)
203
+
204
+ | Detail | Value |
205
+ |--------|-------|
206
+ | Steps | 9,997 (early convergence) |
207
+ | Best eval_loss | 1.625 |
208
+ | Preference accuracy | 76.02% |
209
+ | Reward margin | 0.6100 |
210
+ | Duration | ~7 hours |
211
+ | Data | ~630K preference pairs from 7 Korean HF datasets |
212
+ | Hyperparams | beta=0.25, lr=1.2e-5, eff_batch=128 |
213
+
214
+ **Total training time: ~86 hours on 8× B200**
215
+
216
+ ---
217
+
218
+ ## Benchmarks
219
+
220
+ ### Training Phase Progression (Base → SFT → ORPO)
221
+
222
+ | Benchmark | Base | SFT | ORPO | Δ (Base→ORPO) |
223
+ |-----------|:----:|:---:|:----:|:---:|
224
+ | **KoBEST Avg (0-shot)** | 43.7% | 43.3% | **52.8%** | **+9.1pp** |
225
+ | KoBEST COPA | 49.3% | 48.6% | **63.9%** | +14.6pp |
226
+ | KoBEST HellaSwag-KO | 21.6% | 19.8% | **38.0%** | +16.4pp |
227
+ | KoBEST SentiNeg | 48.6% | 49.1% | **62.5%** | +13.9pp |
228
+ | KoBEST BoolQ | 50.3% | 50.1% | 50.6% | +0.3pp |
229
+ | PIQA | 52.5% | 52.6% | **59.9%** | +7.3pp |
230
+ | ARC-Easy | 25.6% | 25.9% | **36.0%** | +10.4pp |
231
+ | HAE-RAE | 19.7% | 19.9% | 21.8% | +2.1pp |
232
+ | HellaSwag EN | 26.2% | 26.1% | 29.2% | +3.0pp |
233
+ | Greedy 3-gram repetition | 61.0% | 73.0% | **30.9%** | -30.1pp |
234
+ | EOS termination rate | 0% | 60% | **67%** | +67pp |
235
+ | PPL forgetting | — | 0.9% | 4.1% | within 15% ✅ |
236
+
237
+ ### 3B-class Model Comparison (Ollama, 35 tests)
238
 
239
+ | Model | Params | Korean NLU | Knowledge | Instruction | Reasoning | Avg Score |
240
+ |-------|:------:|:----------:|:---------:|:-----------:|:---------:|:---------:|
241
+ | Qwen 2.5 3B | 3B | 100.0 | 20.8 | 55.6 | 62.5 | **63.4** |
242
+ | Phi-4 Mini | 3.8B | 66.7 | 29.2 | 33.3 | **87.5** | 60.6 |
243
+ | **FRANKENSTALLM 3B** | **3B** | **100.0** | **75.0** | **66.7** | 50.0 | 46.7 |
 
244
 
245
+ > FRANKENSTALLM leads in **Korean NLU** (tied with Qwen), **Korean Knowledge** (75 vs 20.8/29.2), and **Instruction Following** (66.7 vs 55.6/33.3).
 
 
246
 
247
+ ### Inference Speed (Ollama, Q4_K_M)
248
+
249
+ | Model | Avg TTFT | TPS | Note |
250
+ |-------|:--------:|:---:|------|
251
+ | **FRANKENSTALLM 3B** | **16.7ms** | **142.5** | Fastest |
252
+ | Phi-4 Mini 3.8B | 25.6ms | 100.4 | |
253
+ | Qwen 2.5 3B | 28.2ms | 93.8 | |
254
+
255
+ ### Perplexity Preservation (ORPO Knowledge Retention)
256
+
257
+ | Dataset | Base PPL | ORPO PPL | Forgetting |
258
+ |---------|:--------:|:--------:|:----------:|
259
+ | Korean C4 | 5.72 | 5.87 | +2.7% |
260
+ | Korean Wiki | 11.84 | 12.21 | +3.2% |
261
+ | Max forgetting | — | — | 4.1% ✅ |
262
+
263
+ ---
264
+
265
+ ## Training Data
266
+
267
+ ### Pretraining (~38.5B tokens)
268
+
269
+ | Category | Sources | Est. Tokens |
270
+ |----------|---------|:-----------:|
271
+ | Korean Web Crawl | C4 KO, CC-100 KO, HPLT KO | ~17.2B |
272
+ | Korean Encyclopedia | Wikipedia KO, NamuWiki (2 versions) | ~2.8B |
273
+ | English Educational | Cosmopedia (Stories, Web, Stanford, WikiHow, OpenStax, Khan) | ~5.7B |
274
+ | English Math/Science | AutoMathText, OpenWebMath, Proof-Pile-2 | ~8.5B |
275
+ | Code | StarCoder (filtered) | ~4.3B |
276
+
277
+ ### SFT (2.4M samples, 24 sources)
278
+
279
+ | Domain | Share | Key Datasets |
280
+ |--------|:-----:|-------------|
281
+ | Reasoning/CoT | 38% | reasoning_r1_1.4m, magpie_reasoning |
282
+ | Korean Instructions | 23% | korean_instruction_mix, open_korean_instructions, kullm_v2 |
283
+ | English General | 16% | openhermes_2.5, ultrachat_200k |
284
+ | Math | 12% | NuminaMath-CoT, orca-math-ko |
285
+ | Dialog/Code/Other | 11% | smol-koreantalk, Evol-Instruct-Code-80k-ko |
286
+
287
+ ### ORPO (~630K preference pairs, 7 sources)
288
+
289
+ | Dataset | Size | Domain |
290
+ |---------|:----:|--------|
291
+ | nayohan/preference-collection-ko-full | 4.9GB | General preference |
292
+ | heegyu/orca-math-korean-preference-cleaned | 1.6GB | Math reasoning |
293
+ | kuotient/orca-math-korean-dpo-pairs | 750MB | Math DPO |
294
+ | maywell/ko_Ultrafeedback_binarized | 394MB | Feedback alignment |
295
+ | tellang/yeji-preference-ko-v1 | 171MB | General preference |
296
+ | jojo0217/korean_rlhf_dataset | 137MB | RLHF pairs |
297
+ | lemon-mint/korean-realqa-reasoning-v01-preference | 58MB | QA reasoning |
298
+
299
+ ---
300
+
301
+ ## GGUF & Ollama
302
+
303
+ ### Available Quantizations
304
+
305
+ | File | Size | Description |
306
+ |------|:----:|-------------|
307
+ | `gguf/frankenstallm-3b-v2-Q4_K_M.gguf` | 757MB | **Recommended** — best size/quality balance |
308
+ | `gguf/frankenstallm-3b-v2-Q8_0.gguf` | 1.2GB | Higher quality |
309
+ | `gguf/frankenstallm-3b-v2-f16.gguf` | 2.3GB | Full precision |
310
+ | `model.safetensors` | 4.76GB | Transformers native (ORPO best, byte-fallback fixed) |
311
+
312
+ ### Recommended Sampling Parameters
313
+
314
+ | Parameter | Value | Notes |
315
+ |-----------|:-----:|-------|
316
+ | `temperature` | 0.7 | Optimal for Korean generation quality |
317
+ | `repeat_penalty` | 1.2 | **Required** — without it, greedy repetition is 30.9% |
318
+ | `top_p` | 0.9 | Nucleus sampling |
319
+ | `top_k` | 50 | Top-k candidates |
320
+ | `max_tokens` | 512 | Max generation length |
321
+ | `num_ctx` | 4096 | Context window (do not exceed) |
322
+
323
+ > ⚠️ Always use `repeat_penalty >= 1.2`. With it, repetition drops to **0%**. Without it, greedy decoding produces ~31% 3-gram repetition.
324
+
325
+ ---
326
+
327
+ ## Limitations
328
+
329
+ - **English performance is limited**: MMLU-EN ~23%, HellaSwag-EN ~29% — this is a Korean-focused model
330
+ - **Code generation**: Near zero capability (limited code in training data)
331
+ - **Greedy repetition**: 30.9% 3-gram repetition without `repeat_penalty` — always use sampling with `repeat_penalty >= 1.2`
332
+ - **Safety**: Safety alignment data was not included in training; use with appropriate guardrails
333
+ - **Scale gap**: Compared to commercial 3B models trained on trillions of tokens, this model was trained on ~60B tokens — expect lower overall benchmark scores
334
+
335
+ ---
336
+
337
+ ## Hardware & Training Environment
338
+
339
+ | Component | Specification |
340
+ |-----------|---------------|
341
+ | GPU | 8× NVIDIA B200 (183GB HBM3e each, ~1.47TB total) |
342
+ | FP8 Compute | 2,250 TFLOPS/GPU (18,000 TFLOPS total) |
343
+ | Interconnect | NVLink 5.0, NVSwitch all-to-all mesh |
344
+ | CPU | 2× AMD EPYC 9365 (72 cores, Zen 5) |
345
+ | RAM | 2.21 TB DDR5 |
346
+ | PyTorch | 2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA custom) |
347
+ | TransformerEngine | 2.10.0 |
348
+ | FlashAttention | 2.7.4 |
349
+ | NCCL | 2.28.9 |
350
+ | CUDA | 13.1 |
351
+ | Total training | ~86 hours (Pretrain 63h + SFT 15.5h + ORPO 7h) |
352
+
353
+ ---
354
+
355
+ ## Citation
356
+
357
+ ```bibtex
358
+ @misc{frankenstallm2026,
359
+ title={FRANKENSTALLM: A Korean 3B LLM Built From Scratch on B200 GPUs},
360
+ author={pathcosmos},
361
+ year={2026},
362
+ url={https://huggingface.co/pathcosmos/frankenstallm},
363
+ note={3-phase training (Pretrain, SFT, ORPO) with FP8 on 8x NVIDIA B200}
364
+ }
365
+ ```
366
+
367
+ ---
368
 
369
+ ## Links
370
 
371
+ - **GitHub**: [pathcosmos/FRANKENSTALLM](https://github.com/pathcosmos/FRANKENSTALLM) Full source code, training scripts, and builder's log
372
+ - **HuggingFace**: [pathcosmos/frankenstallm](https://huggingface.co/pathcosmos/frankenstallm)