frankenstallm / source /eval /reports /00_executive_summary.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
|
raw
history blame
8.11 kB

korean_1b_fp8_run1 ์ข…ํ•ฉ ํ‰๊ฐ€ ๋ฆฌํฌํŠธ

ํ‰๊ฐ€ ๋‚ ์งœ: 2026-02-26 ํ‰๊ฐ€ ํ™˜๊ฒฝ: NVIDIA B200 ร—1 (์ถ”๋ก ), BF16, ํ‰๊ฐ€ ์†Œ์š” ์‹œ๊ฐ„ ์•ฝ 15๋ถ„


๋ชจ๋ธ ์ •๋ณด

ํ•ญ๋ชฉ ๋‚ด์šฉ
๋ชจ๋ธ๋ช… korean_1b_fp8_run1
ํŒŒ๋ผ๋ฏธํ„ฐ 1,189.7M (1.19B)
์•„ํ‚คํ…์ฒ˜ Decoder-only Transformer, LLaMA-style
vocab_size 64,000
d_model 2,048
n_layers 24
n_heads 16
n_kv_heads (GQA) 4
d_ffn 5,472
์œ„์น˜ ์ธ์ฝ”๋”ฉ RoPE (theta=500,000)
์ •๊ทœํ™” RMSNorm
ํ™œ์„ฑํ™” ํ•จ์ˆ˜ SwiGLU
๊ธฐํƒ€ Weight Tying, FlashAttention-2, TransformerEngine FP8 (MXFP8BlockScaling)

ํ•™์Šต ์„ค์ •

ํ•ญ๋ชฉ ๋‚ด์šฉ
ํ•™์Šต ๋‹จ๊ณ„ 34,000 steps
GPU ํ™˜๊ฒฝ 8ร— NVIDIA B200
ํ•™์Šต ์ •๋ฐ€๋„ FP8 + BF16 ํ˜ผํ•ฉ
ํ•™์Šต๋ฅ  2.0e-4
๋ฐฐ์น˜ ํฌ๊ธฐ 1.05M tok/step (8GPU ร— 8batch ร— 4accum ร— 4096seq)
์›œ์—… 2,000 steps
ํ•™์Šต ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ + C4 ํ•œ๊ตญ์–ด + ๋‚˜๋ฌด์œ„ํ‚ค (์ด ~8.91B tokens, ~4 ์—ํฌํฌ)

ํ•ต์‹ฌ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์š”์•ฝ

ํ‰๊ฐ€ ์˜์—ญ ํ•ต์‹ฌ ์ง€ํ‘œ ํŒ์ •
Perplexity (ํ†ตํ•ฉ) PPL=6.95, bits/tok=2.80 Good (1B ๊ธฐ์ค€)
Perplexity (C4) PPL=5.67, bits/tok=2.50 Excellent
Perplexity (Wiki) PPL=11.66, bits/tok=3.54 Acceptable
Perplexity (Namuwiki) PPL=25.34, bits/tok=4.66 Needs improvement
Top-1 Accuracy 56.18% Good
Top-5 Accuracy 72.35% Good
Top-10 Accuracy 77.75% Good
Mean Entropy 2.24 nats (3.23 bits) Healthy
์ƒ์„ฑ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด ๋ฌธ๋ฒ• ์–‘ํ˜ธ, ์‚ฌ์‹ค ๋ถ€๋ถ„์  Expected for 1B
๋ฐ˜๋ณต ํ‡ดํ™” 3/10 degenerate (30%) Needs mitigation
์ฝ”๋“œ/์ˆ˜ํ•™ ๋งค์šฐ ์ œํ•œ์  Expected

์ƒ์„ธ ๋ฆฌํฌํŠธ ๋ชฉ๋ก

ํŒŒ์ผ ๋‚ด์šฉ
01_perplexity_report.md ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ณ„ Perplexity ์ƒ์„ธ ๋ถ„์„
02_generation_report.md 10๊ฐœ ํ”„๋กฌํ”„ํŠธ ์ƒ์„ฑ ํ’ˆ์งˆ ์ƒ์„ธ ๋ถ„์„
03_repetition_calibration_report.md ๋ฐ˜๋ณต ๋ถ„์„ + ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์ ๊ฒ€
04_token_analysis_comparison_report.md ํ† ํฐ ์ˆ˜์ค€ NLL ๋ถ„์„ + ์˜จ๋„๋ณ„ ๋น„๊ต

Perplexity ๋ถ„์„ ์š”์•ฝ

๋ฐ์ดํ„ฐ ์†Œ์Šค๋ณ„ PPL

C4 ํ•œ๊ตญ์–ด (์ผ๋ฐ˜ ์›น ํ…์ŠคํŠธ):  PPL =  5.67  bits/tok = 2.50  โ† Excellent
์œ„ํ‚ค๋ฐฑ๊ณผ:                     PPL = 11.66  bits/tok = 3.54  โ† Acceptable
๋‚˜๋ฌด์œ„ํ‚ค:                     PPL = 25.34  bits/tok = 4.66  โ† Needs improvement
ํ†ตํ•ฉ (๊ฐ€์ค‘ ํ‰๊ท ):             PPL =  6.95  bits/tok = 2.80  โ† Good

C4์—์„œ์˜ ๋‚ฎ์€ PPL์€ ์ผ์ƒ์  ์›น ํ…์ŠคํŠธ ๋ถ„ํฌ์— ์ž˜ ์ ์‘ํ–ˆ์Œ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์œ„ํ‚ค๋ฐฑ๊ณผ PPL์ด ๋‚˜๋ฌด์œ„ํ‚ค๋ณด๋‹ค ๋‚ฎ์€ ๊ฒƒ์€ ์œ„ํ‚ค๋ฐฑ๊ณผ ํŠน์œ ์˜ ์ •ํ˜•ํ™”๋œ ๋ฌธ์ฒด๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ๋” ๋งŽ์ด ํฌํ•จ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ํ•ด์„๋œ๋‹ค. ๋‚˜๋ฌด์œ„ํ‚ค์˜ ๋†’์€ PPL์€ ๊ตฌ์–ด์ฒด, ์€์–ด, ์‹ ์กฐ์–ด, ํ‘œ ํ˜•์‹ ๋“ฑ ๋น„์ •ํ˜• ํ…์ŠคํŠธ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ด๋ฉฐ, 1B ๊ทœ๋ชจ์˜ ๋ชจ๋ธ์—์„œ๋Š” ์ผ๋ฐ˜์ ์ธ ๊ฒฐ๊ณผ์ด๋‹ค.

๋น„๊ต ๊ธฐ์ค€ (์ฐธ๊ณ )

๋ชจ๋ธ ๊ทœ๋ชจ ํ•œ๊ตญ์–ด PPL (์ฐธ๊ณ ์น˜)
GPT-2 Small 125M ~30โ€“40 (์˜์–ด ๊ธฐ์ค€)
small_fp8_run1 (๋ณธ ํ”„๋กœ์ ํŠธ) 125M ~18โ€“22 (์ถ”์ •)
korean_1b_fp8_run1 (๋ณธ ๋ชจ๋ธ) 1.19B 6.95 (ํ†ตํ•ฉ)
LLaMA-2 7B (ํ•œ๊ตญ์–ด ์ ์‘ ์—†์Œ) 7B โ€”

125M โ†’ 1.19B ์Šค์ผ€์ผ์—…์—์„œ PPL์ด ์•ฝ 2.5๋ฐฐ ์ด์ƒ ๊ฐœ์„ ๋œ ์ ์€ ์Šค์ผ€์ผ๋ง ๋ฒ•์น™(Chinchilla)๊ณผ ์ผ์น˜ํ•˜๋Š” ๊ฒฐ๊ณผ์ด๋‹ค.


์ƒ์„ฑ ํ’ˆ์งˆ ๋ถ„์„ ์š”์•ฝ

10๊ฐœ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•œ greedy decoding ๊ฒฐ๊ณผ ๊ธฐ์ค€:

์ƒ์„ฑ ์„ฑ๊ณต ์‚ฌ๋ก€ (7/10)

  • ์ผ์ƒ ๋Œ€ํ™” / ์„ค๋ช…๋ฌธ: ์ž์—ฐ์Šค๋Ÿฌ์šด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ๊ตฌ์„ฑ, ์กฐ์‚ฌยท์–ด๋ฏธ ์ฒ˜๋ฆฌ ์•ˆ์ •์ 
  • ์‚ฌ์ „์  ์ •์˜ ์š”์ฒญ: ๋‹จ์–ด ์„ค๋ช… ํ˜•์‹์„ ์ž˜ ๋”ฐ๋ผ๊ฐ€๋Š” ๊ฒฝํ–ฅ
  • ๊ฐ„๋‹จํ•œ ๋ชฉ๋ก ์ƒ์„ฑ: ํ•ญ๋ชฉ ๋‚˜์—ด ํŒจํ„ด ํŒŒ์•…

๋ฌธ์ œ ์‚ฌ๋ก€ (3/10)

  • ๋ฐ˜๋ณต ํ‡ดํ™” (Repetition Degeneration): ๊ฐ™์€ ๊ตฌ์ ˆ์ด ๋ฐ˜๋ณต๋˜๋ฉฐ ๋ฌธ์žฅ์ด ์ˆ˜๋ ดํ•˜์ง€ ์•Š์Œ. Greedy decoding์—์„œ ํŠนํžˆ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์šด ํŒจํ„ด์œผ๋กœ, temperature sampling ๋˜๋Š” repetition penalty ์ ์šฉ์œผ๋กœ ์™„ํ™” ๊ฐ€๋Šฅ
  • ์‚ฌ์‹ค ์˜ค๋ฅ˜: ์„ธ์ข…๋Œ€์™• ๊ด€๋ จ ์—ฐ๋„, ๊น€์น˜์ฐŒ๊ฐœ ๋ ˆ์‹œํ”ผ ๋น„์œจ ๋“ฑ์—์„œ ๋ถ€์ •ํ™•ํ•œ ๋‚ด์šฉ ์ƒ์„ฑ โ€” 1B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋Š” ์„ธ๋ฐ€ํ•œ ์‚ฌ์‹ค ๊ธฐ์–ต ๋Šฅ๋ ฅ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ ์˜ˆ์ƒ๋œ ๊ฒฐ๊ณผ
  • ์ฝ”๋“œ/์ˆ˜ํ•™: ํŒŒ์ด์ฌ ์ฝ”๋“œ ์ƒ์„ฑ ๋ฐ ์ˆ˜์‹ ๊ณ„์‚ฐ์—์„œ ๋งค์šฐ ์ œํ•œ์ ์ธ ์„ฑ๋Šฅ โ€” ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์ฝ”๋“œ/์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ์˜ˆ์ƒ๋œ ๊ฒฐ๊ณผ

์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ๋ถ„์„ ์š”์•ฝ

Top-K Accuracy

K Accuracy
Top-1 56.18%
Top-5 72.35%
Top-10 77.75%

Top-1 ์ •ํ™•๋„ 56%๋Š” ์–ธ์–ด ๋ชจ๋ธ๋กœ์„œ ๊ฑด๊ฐ•ํ•œ ์ˆ˜์ค€์ด๋‹ค. ๋ชจ๋ธ์ด ์˜ฌ๋ฐ”๋ฅธ ๋‹ค์Œ ํ† ํฐ์„ ํ™•๋ฅ  ์ƒ์œ„ 1์œ„๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋น„์œจ์ด 56%๋ผ๋Š” ๊ฒƒ์€ ๊ณผ๋„ํ•œ ํ™•์‹ (overconfidence)์ด๋‚˜ ๊ณผ์†Œํ•œ ํ™•์‹ (underconfidence) ์—†์ด ๊ท ํ˜• ์žกํžŒ ์˜ˆ์ธก ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง์„ ์‹œ์‚ฌํ•œ๋‹ค.

์—”ํŠธ๋กœํ”ผ ๋ถ„์„

Mean Entropy: 2.24 nats (3.23 bits)

์—”ํŠธ๋กœํ”ผ 2.24 nats๋Š” ๋ชจ๋ธ์ด ์˜ˆ์ธก ์‹œ ์•ฝ 9.4๊ฐœ ํ† ํฐ์— ๊ฑธ์ณ ํ™•๋ฅ ์„ ๋ถ„์‚ฐ์‹œํ‚จ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค (e^2.24 โ‰ˆ 9.4). ์ด ๊ฐ’์€ ๋„ˆ๋ฌด ๋พฐ์กฑํ•˜์ง€๋„(greedy collapse ์œ„ํ—˜) ๋„ˆ๋ฌด ํ‰ํƒ„ํ•˜์ง€๋„(๋ฌด์ž‘์œ„ ์ถœ๋ ฅ ์œ„ํ—˜) ์•Š์€ ๊ฑด๊ฐ•ํ•œ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.


๊ฒฐ๋ก 

์ „์ฒด ํ‰๊ฐ€

1B ํ•œ๊ตญ์–ด ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๋กœ์„œ ์ „๋ฐ˜์ ์œผ๋กœ ์–‘ํ˜ธํ•œ ์„ฑ๋Šฅ.

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ, C4 ํ•œ๊ตญ์–ด, ๋‚˜๋ฌด์œ„ํ‚ค ์•ฝ 8.91B ํ† ํฐ์œผ๋กœ ํ•™์Šต๋œ 1.19B ํŒŒ๋ผ๋ฏธํ„ฐ Decoder-only ๋ชจ๋ธ์ด๋‹ค. 8ร— B200 GPU ํ™˜๊ฒฝ์—์„œ FP8 + BF16 ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋กœ 34,000 steps ํ•™์Šตํ•˜์˜€์œผ๋ฉฐ, Chinchilla ์ตœ์  ๊ณ„์‚ฐ๋Ÿ‰์— ๊ทผ์‚ฌํ•œ ์„ค์ •์ด๋‹ค.


๊ฐ•์ 

  1. C4 PPL=5.67: ์ผ๋ฐ˜ ์›น ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์šฐ์ˆ˜ํ•œ ์–ธ์–ด ๋ชจ๋ธ๋ง ์„ฑ๋Šฅ. ํ•œ๊ตญ์–ด ์ผ์ƒ ํ…์ŠคํŠธ์˜ ํŒจํ„ด์„ ์ž˜ ํ•™์Šตํ•จ
  2. Top-1 Accuracy 56%: ๊ณผ๋„ํ•œ ํ™•์‹  ์—†์ด ๊ฑด๊ฐ•ํ•œ ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•จ
  3. ํ•œ๊ตญ์–ด ๋ฌธ๋ฒ• ์ฒ˜๋ฆฌ: ์กฐ์‚ฌ(์€/๋Š”/์ด/๊ฐ€/์„/๋ฅผ), ์–ด๋ฏธ(ํ–ˆ๋‹ค/ํ•ฉ๋‹ˆ๋‹ค/~์ด๋‹ค) ์ฒ˜๋ฆฌ๊ฐ€ ์•ˆ์ •์ ์ด๋ฉฐ ๋ฌธ๋ฒ•์ ์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ ์ƒ์„ฑ
  4. ์ผ์ƒ์  ํ”„๋กฌํ”„ํŠธ ๋Œ€์‘: ์„ค๋ช…, ์ •์˜, ๋ชฉ๋ก ๋“ฑ ๊ธฐ๋ณธ์ ์ธ ํ…์ŠคํŠธ ์ƒ์„ฑ ํŒจํ„ด ํŒŒ์•…

์•ฝ์ 

  1. Namuwiki PPL=25.34: ๋น„์ •ํ˜• ํ…์ŠคํŠธ(๊ตฌ์–ด์ฒด, ์€์–ด, ์‹ ์กฐ์–ด, ํ‘œ ํ˜•์‹)์— ์ƒ๋Œ€์ ์œผ๋กœ ์•ฝํ•จ. ๋„๋ฉ”์ธ ๋ถˆ๊ท ํ˜•์—์„œ ๋น„๋กฏ๋จ
  2. ๋ฐ˜๋ณต ํ‡ดํ™” 30%: 10๊ฐœ ์ƒ์„ฑ ์ค‘ 3๊ฐœ์—์„œ repetition degeneration ๋ฐœ์ƒ. Greedy decoding ํ™˜๊ฒฝ์—์„œ ํŠนํžˆ ๋‘๋“œ๋Ÿฌ์ง€๋ฉฐ, SFT ๋˜๋Š” RLHF ๋‹จ๊ณ„์—์„œ ๊ฐœ์„  ์˜ˆ์ƒ
  3. ์‚ฌ์‹ค ์ •ํ™•๋„ ์ œํ•œ: ์„ธ์ข…๋Œ€์™• ์—ฐ๋„, ์Œ์‹ ๋ ˆ์‹œํ”ผ ๋“ฑ ๊ตฌ์ฒด์  ์‚ฌ์‹ค ๊ธฐ์–ต ๋Šฅ๋ ฅ์ด ๋‚ฎ์Œ. 1B ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์˜ ๊ณ ์œ ํ•œ ํ•œ๊ณ„์ด๋ฉฐ, 7B ์ด์ƒ ์Šค์ผ€์ผ์—์„œ ๊ฐœ์„  ์˜ˆ์ƒ
  4. ์ฝ”๋“œ/์ˆ˜ํ•™ ๊ฑฐ์˜ ๋ถˆ๊ฐ€: ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์ฝ”๋“œ/์ˆ˜ํ•™์ด ํฌํ•จ๋˜์ง€ ์•Š์•„ ์˜ˆ์ƒ๋œ ๊ฒฐ๊ณผ. ์ „๋ฌธ ํŒŒ์ธํŠœ๋‹ ํ•„์š”

๋‹ค์Œ ๋‹จ๊ณ„ ๊ถŒ์žฅ

์šฐ์„ ์ˆœ์œ„ ์ž‘์—… ๊ธฐ๋Œ€ ํšจ๊ณผ
1 Instruction Tuning (SFT) ๋ฐ˜๋ณต ํ‡ดํ™” ์™„ํ™”, ์ง€์‹œ๋ฌธ ๋”ฐ๋ฅด๊ธฐ ๋Šฅ๋ ฅ ๋ถ€์—ฌ
2 DPO/RLHF ์ƒ์„ฑ ํ’ˆ์งˆ + ์‚ฌ์‹ค ์ •ํ™•๋„ ๊ฐœ์„ 
3 ๋„๋ฉ”์ธ ์ ์‘ ๋‚˜๋ฌด์œ„ํ‚ค/์ „๋ฌธ ๋„๋ฉ”์ธ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋กœ PPL ๊ฐœ์„ 
4 ์Šค์ผ€์ผ์—… (7B) ์‚ฌ์‹ค ๊ธฐ์–ต, ๋ฐ˜๋ณต ๋ฌธ์ œ ๋™์‹œ ๊ฐœ์„  ์˜ˆ์ƒ
5 ์–‘์žํ™” + ๋ฐฐํฌ GGUF Q4_K_M + Ollama ์„œ๋น™ (Phase B ํŒŒ์ดํ”„๋ผ์ธ ํ™œ์šฉ ๊ฐ€๋Šฅ)

ํ‰๊ฐ€ ํ™˜๊ฒฝ

ํ•ญ๋ชฉ ๋‚ด์šฉ
GPU NVIDIA B200 ร—1 (์ถ”๋ก )
์ถ”๋ก  dtype BF16
ํ‰๊ฐ€ ์†Œ์š” ์‹œ๊ฐ„ ์•ฝ 15๋ถ„ (์ „์ฒด 6๊ฐœ ์„น์…˜)
ํ‰๊ฐ€ ๋‚ ์งœ 2026-02-26
ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ eval/comprehensive_eval.py