Vjeong Claude Opus 4.6 commited on Feb 11

Commit

8a58ffe

1 Parent(s): f494c9e

Initial commit: LLM-1B-Lab project setup

LLaMA-style 1.1B parameter Decoder-Only Transformer for educational purposes.
Includes modularized llm_lab package, notebooks, and configuration files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (50) hide show

.gitignore +47 -0
CLAUDE.md +131 -0
LLM_Foundation_Model.code-workspace +8 -0
_archive/llm-1b-data-pipeline.py +906 -0
_archive/llm-1b-evaluation.py +1455 -0
_archive/llm-1b-model.py +791 -0
_archive/llm-1b-trainer.py +1108 -0
llm_lab/__init__.py +30 -0
llm_lab/config/__init__.py +7 -0
llm_lab/config/data_config.py +41 -0
llm_lab/config/eval_config.py +20 -0
llm_lab/config/model_config.py +53 -0
llm_lab/config/train_config.py +114 -0
llm_lab/data/__init__.py +11 -0
llm_lab/data/dataset.py +218 -0
llm_lab/data/diagnostics.py +153 -0
llm_lab/data/pipeline.py +156 -0
llm_lab/data/tokenizer.py +196 -0
llm_lab/evaluation/__init__.py +21 -0
llm_lab/evaluation/attention_viz.py +176 -0
llm_lab/evaluation/checklist.py +99 -0
llm_lab/evaluation/dynamics.py +242 -0
llm_lab/evaluation/full_evaluator.py +222 -0
llm_lab/evaluation/generation.py +200 -0
llm_lab/evaluation/perplexity.py +172 -0
llm_lab/evaluation/runner.py +56 -0
llm_lab/evaluation/scaling.py +153 -0
llm_lab/model/__init__.py +14 -0
llm_lab/model/attention.py +134 -0
llm_lab/model/feedforward.py +48 -0
llm_lab/model/llm_model.py +200 -0
llm_lab/model/norm.py +40 -0
llm_lab/model/rope.py +103 -0
llm_lab/model/transformer_block.py +65 -0
llm_lab/model/utils.py +85 -0
llm_lab/training/__init__.py +12 -0
llm_lab/training/checkpoint.py +159 -0
llm_lab/training/metrics.py +112 -0
llm_lab/training/optimizer.py +54 -0
llm_lab/training/runner.py +68 -0
llm_lab/training/scheduler.py +68 -0
llm_lab/training/trainer.py +351 -0
llm_lab/utils/__init__.py +5 -0
llm_lab/utils/device.py +94 -0
llm_lab/utils/seed.py +9 -0
notebooks/01_data_pipeline.ipynb +169 -0
notebooks/02_model.ipynb +212 -0
notebooks/03_training.ipynb +211 -0
notebooks/04_evaluation.ipynb +188 -0
requirements.txt +8 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,47 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+*.egg
+dist/
+build/
+*.so
+# Virtual environments
+venv/
+.venv/
+env/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Jupyter Notebook
+.ipynb_checkpoints/
+# OS
+.DS_Store
+Thumbs.db
+# ML / Training artifacts
+*.pt
+*.pth
+*.bin
+*.ckpt
+checkpoints/
+wandb/
+runs/
+# Data
+*.log
+*.csv
+*.tsv
+data/
+# Secrets
+.env
+*.key

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,131 @@

+# LLM-1B-Lab
+1.1B parameter LLaMA-style Decoder-Only Transformer 교육용 구현.
+딥러닝 초보자가 처음부터 끝까지 LLM을 학습하고 평가하는 과정을 경험할 수 있도록 설계됨.
+## 프로젝트 구조
+```
+LLM_Foundation_Model/
+├── CLAUDE.md
+├── requirements.txt
+├── llm_lab/                          # Python 패키지 (핵심 코드)
+│   ├── __init__.py
+│   ├── config/                       # 설정 데이터클래스
+│   │   ├── model_config.py           # ModelConfig (debug_10m / small_100m / base_1b 프리셋)
+│   │   ├── data_config.py            # DataConfig (데이터셋, 토크나이저, 배치 설정)
+│   │   ├── train_config.py           # TrainConfig (LR, 스케줄러, 체크포인트, wandb)
+│   │   └── eval_config.py            # EvalConfig (평가 파라미터)
+│   ├── model/                        # 모델 아키텍처
+│   │   ├── norm.py                   # RMSNorm
+│   │   ├── rope.py                   # RotaryPositionalEmbedding (RoPE)
+│   │   ├── attention.py              # GroupedQueryAttention (GQA)
+│   │   ├── feedforward.py            # SwiGLUFeedForward
+│   │   ├── transformer_block.py      # TransformerBlock (Pre-LN)
+│   │   ├── llm_model.py             # LLMModel (전체 모델 + generate)
+│   │   └── utils.py                  # count_parameters_detailed, estimate_memory_gb
+│   ├── data/                         # 데이터 파이프라인
+│   │   ├── tokenizer.py              # Tokenizer (SentencePiece / BPE / HuggingFace)
+│   │   ├── dataset.py                # PackedStreamingDataset, ValidationDataset, _collate_fn
+│   │   ├── pipeline.py               # create_train_dataloader, setup_data_pipeline
+│   │   └── diagnostics.py            # DataPipelineDiagnostics
+│   ├── training/                     # 학습 루프
+│   │   ├── scheduler.py              # CosineWarmupScheduler
+│   │   ├── checkpoint.py             # CheckpointManager (Google Drive 지원)
+│   │   ├── metrics.py                # MetricsTracker (wandb 연동)
+│   │   ├── optimizer.py              # create_optimizer (weight decay 분리)
+│   │   ├── trainer.py                # Trainer (gradient accumulation, mixed precision)
+│   │   └── runner.py                 # start_training (한 줄 실행 헬퍼)
+│   ├── evaluation/                   # 평가 & 분석
+│   │   ├── perplexity.py             # PerplexityEvaluator (위치별 Loss 포함)
+│   │   ├── generation.py             # GenerationEvaluator (다양한 프롬프트)
+│   │   ├── scaling.py                # ScalingAnalyzer (Chinchilla Scaling Law)
+│   │   ├── dynamics.py               # TrainingDynamicsAnalyzer (Loss/LR/Grad 시각화)
+│   │   ├── attention_viz.py          # AttentionVisualizer (헤드별 heatmap)
+│   │   ├── full_evaluator.py         # FullEvaluator (종합 평가 + 리포트)
+│   │   ├── checklist.py              # InsightChecklist (학습 인사이트 체크리스트)
+│   │   └── runner.py                 # run_evaluation (한 줄 실행 헬퍼)
+│   └── utils/                        # 공통 유틸리티
+│       ├── device.py                 # auto_configure, get_device, detect_gpu_info
+│       └── seed.py                   # set_seed
+├── notebooks/                        # Jupyter 노트북 (설정 + 실행)
+│   ├── 01_data_pipeline.ipynb
+│   ├── 02_model.ipynb
+│   ├── 03_training.ipynb
+│   └── 04_evaluation.ipynb
+└── _archive/                         # 원본 단일파일 백업
+    ├── llm-1b-model.py
+    ├── llm-1b-data-pipeline.py
+    ├── llm-1b-trainer.py
+    └── llm-1b-evaluation.py
+```
+## 기술 스택
+- **모델**: LLaMA-style Decoder-Only Transformer (RMSNorm, RoPE, GQA, SwiGLU, Weight Tying)
+- **학습**: Gradient Accumulation, Mixed Precision (bf16/fp16), Cosine LR + Warmup, Activation Checkpointing
+- **데이터**: HuggingFace Streaming (FineWeb-Edu), BPE 토크나이저, 시퀀스 패킹
+- **체크포인트**: Google Drive 자동 저장/복원 (Colab Pro+ 환경)
+- **평가**: Perplexity, 텍스트 생성, Scaling Law, Attention 시각화
+- **타겟 환경**: Google Colab Pro+ (A100 40GB)
+## 의존성 그래프 (순환 없음)
+```
+config (의존성 없음)
+  ↓
+utils → config
+  ↓
+model → config
+  ↓
+data → config
+  ↓
+training → config, utils
+  ↓
+evaluation → config
+```
+## 모델 프리셋
+| 프리셋 | 파라미터 | dim | layers | heads | kv_heads | 용도 |
+|--------|---------|-----|--------|-------|----------|------|
+| `debug_10m` | ~10M | 256 | 6 | 8 | 4 | 빠른 검증/디버그 |
+| `small_100m` | ~100M | 768 | 12 | 12 | 4 | 중간 실험 |
+| `base_1b` | ~1.1B | 2048 | 22 | 32 | 8 | 본격 학습 |
+## Quick Start
+```python
+from llm_lab.config import ModelConfig, DataConfig, TrainConfig
+from llm_lab.model import LLMModel
+from llm_lab.data import setup_data_pipeline
+from llm_lab.training import start_training
+from llm_lab.evaluation import run_evaluation
+# 1. 모델
+model = LLMModel(ModelConfig.base_1b())
+# 2. 데이터
+tok, train_dl, val_dl = setup_data_pipeline("pretrained")
+# 3. 학습
+trainer = start_training(model, train_dl, val_dl)
+# 4. 평가
+report = run_evaluation(model, tok, val_dl,
+                        metrics_history=trainer.metrics.history)
+```
+## 코드 컨벤션
+- **언어**: 코드는 영어, 주석/독스트링은 한국어 (교육적 설명 포함)
+- **타입 힌트**: 모든 함수에 typing 어노테이션 사용
+- **import 순서**: stdlib → torch → llm_lab (절대 경로) → 로컬 (상대 경로)
+- **데이터클래스**: 모든 설정은 `@dataclass`로 정의, 기본값 포함
+- **에러 처리**: 외부 의존성(matplotlib, wandb 등)은 `try/except ImportError`로 선택적 사용
+## 주의사항
+- `torch`는 로컬 환경에 설치되어 있지 않을 수 있음 (Colab Pro+에서 실행 전제)
+- `pip install torch datasets tokenizers sentencepiece transformers wandb matplotlib numpy`
+- 원본 4개 파일(`_archive/`)과 모듈화된 `llm_lab/` 패키지의 로직은 동일 (import 경로만 변경)

LLM_Foundation_Model.code-workspace ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+	"folders": [
+		{
+			"path": "."
+		}
+	],
+	"settings": {}
+}

_archive/llm-1b-data-pipeline.py ADDED Viewed

	@@ -0,0 +1,906 @@

+"""
+LLM-1B-Lab: 데이터 파이프라인
+==============================
+토크나이저 준비 → 데이터 스트리밍 → 시퀀스 패킹 → 배치 구성
+전체 흐름:
+  FineWeb-Edu (HuggingFace)
+    → Streaming으로 로드 (디스크 저장 없음)
+    → 토크나이징 (BPE, vocab=32K)
+    → 시퀀스 패킹 (여러 문서를 max_seq_len으로 연결)
+    → 배치 구성 (input_ids, targets)
+    → GPU 전송
+설치 필요 패키지:
+  pip install datasets tokenizers sentencepiece wandb
+"""
+import os
+import time
+import json
+from pathlib import Path
+from dataclasses import dataclass, field
+from typing import Optional, Iterator, List, Dict, Any
+import torch
+from torch.utils.data import IterableDataset, DataLoader
+# ============================================================================
+# 1. 데이터 설정
+# ============================================================================
+@dataclass
+class DataConfig:
+    """데이터 파이프라인 설정.
+    Colab Pro+ 환경 제약을 고려한 기본값:
+      - Streaming 모드로 디스크 사용 최소화
+      - 시퀀스 패킹으로 패딩 없이 GPU 활용률 극대화
+      - 전처리를 on-the-fly로 수행하여 메모리 절약
+    """
+    # ── 데이터셋 ──
+    dataset_name: str = "HuggingFaceFW/fineweb-edu"
+    dataset_subset: str = "sample-10BT"       # 10B 토큰 샘플
+    dataset_split: str = "train"
+    text_column: str = "text"                  # 텍스트가 담긴 컬럼명
+    # ── 토크나이저 ──
+    tokenizer_type: str = "sentencepiece"      # "sentencepiece" 또는 "hf"
+    # 사전 학습된 토크나이저 경로 (없으면 새로 학습)
+    tokenizer_path: Optional[str] = None
+    vocab_size: int = 32_000
+    # ── 시퀀스 ──
+    max_seq_len: int = 2048
+    # 문서 구분 토큰 사용 여부 (패킹 시 문서 경계 표시)
+    use_eos_separator: bool = True
+    # ── 배치 ──
+    batch_size: int = 4                        # micro batch (GPU당)
+    num_workers: int = 2                       # DataLoader 워커 수
+    prefetch_factor: int = 4                   # 미리 준비할 배치 수
+    # ── 토크나이저 학습 설정 (새로 학습 시) ──
+    tokenizer_train_samples: int = 50_000      # 학습에 사용할 문서 수
+    tokenizer_save_dir: str = "./tokenizer"
+    # ── 검증 데이터 ──
+    val_ratio: float = 0.001                   # 전체의 0.1%를 검증용으로
+# ============================================================================
+# 2. 토크나이저 래퍼
+# ============================================================================
+class Tokenizer:
+    """토크나이저 통합 래퍼.
+    세 가지 방법 지원:
+      1) 기존 SentencePiece 모델 로드
+      2) HuggingFace tokenizers 라이브러리로 새로 학습
+      3) 사전 학습된 HF 토크나이저 로드 (예: LLaMA tokenizer)
+    왜 직접 구현하지 않는가?
+      - BPE 토크나이저 학습은 대규모 텍스트 통계 처리이며,
+        모델 아키텍처 이해와 직접적 관련이 적습니다.
+      - 다만 토크나이저의 동작 원리(BPE 병합 규칙)는 이해해야 합니다.
+    BPE(Byte Pair Encoding) 핵심 원리:
+      1) 텍스트를 바이트/문자 단위로 분리
+      2) 가장 빈번한 인접 쌍을 반복적으로 병합
+      3) vocab_size에 도달할 때까지 반복
+      → 자주 등장하는 단어는 하나의 토큰, 희귀 단어는 여러 토큰으로 분리
+    """
+    def __init__(self, config: DataConfig):
+        self.config = config
+        self._tokenizer = None
+        self.vocab_size = config.vocab_size
+        # 특수 토큰 ID (초기화 후 설정됨)
+        self.bos_id: int = 1   # Beginning of Sequence
+        self.eos_id: int = 2   # End of Sequence
+        self.pad_id: int = 0   # Padding
+    # ────────────────────────────────────────────────
+    # 방법 1: SentencePiece 모델 로드
+    # ────────────────────────────────────────────────
+    def load_sentencepiece(self, model_path: str):
+        """기존 SentencePiece 모델을 로드합니다."""
+        import sentencepiece as spm
+        self._tokenizer = spm.SentencePieceProcessor()
+        self._tokenizer.Load(model_path)
+        self.vocab_size = self._tokenizer.GetPieceSize()
+        self.bos_id = self._tokenizer.bos_id()
+        self.eos_id = self._tokenizer.eos_id()
+        self.pad_id = self._tokenizer.pad_id()
+        self._encode_fn = self._tokenizer.Encode
+        self._decode_fn = self._tokenizer.Decode
+        print(f"[Tokenizer] SentencePiece 로드 완료: vocab_size={self.vocab_size}")
+    # ────────────────────────────────────────────────
+    # 방법 2: HuggingFace tokenizers로 BPE 학습
+    # ────────────────────────────────────────────────
+    def train_bpe(self, text_iterator: Iterator[str], save_dir: Optional[str] = None):
+        """BPE 토크나이저를 처음부터 학습합니다.
+        Args:
+            text_iterator: 학습 텍스트를 yield하는 이터레이터
+            save_dir: 저장 경로
+        학습 포인트:
+          - vocab_size가 클수록: 자주 쓰는 표현이 1토큰 → 시퀀스 짧아짐
+          - vocab_size가 작을수록: Embedding 파라미터 절약, 하지만 시퀀스 길어짐
+          - 32K는 영어 기준 좋은 균형점
+        """
+        from tokenizers import Tokenizer as HFTokenizer
+        from tokenizers.models import BPE
+        from tokenizers.trainers import BpeTrainer
+        from tokenizers.pre_tokenizers import ByteLevel
+        from tokenizers.processors import TemplateProcessing
+        print("[Tokenizer] BPE 토크나이저 학습 시작...")
+        # BPE 모델 생성
+        tokenizer = HFTokenizer(BPE(unk_token="<unk>"))
+        tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
+        # 특수 토큰 정의
+        special_tokens = ["<pad>", "<s>", "</s>", "<unk>"]
+        # 트레이너 설정
+        trainer = BpeTrainer(
+            vocab_size=self.config.vocab_size,
+            special_tokens=special_tokens,
+            min_frequency=2,           # 최소 2번 등장한 쌍만 병합
+            show_progress=True,
+        )
+        # 학습 실행
+        tokenizer.train_from_iterator(text_iterator, trainer=trainer)
+        # 후처리: BOS/EOS 자동 추가
+        tokenizer.post_processor = TemplateProcessing(
+            single="<s> $A </s>",
+            special_tokens=[("<s>", 1), ("</s>", 2)],
+        )
+        self._tokenizer = tokenizer
+        self.vocab_size = tokenizer.get_vocab_size()
+        self.pad_id = 0
+        self.bos_id = 1
+        self.eos_id = 2
+        self._encode_fn = lambda text: tokenizer.encode(text).ids
+        self._decode_fn = lambda ids: tokenizer.decode(ids)
+        # 저장
+        save_dir = save_dir or self.config.tokenizer_save_dir
+        os.makedirs(save_dir, exist_ok=True)
+        tokenizer.save(os.path.join(save_dir, "tokenizer.json"))
+        # 메타 정보 저장
+        meta = {
+            "vocab_size": self.vocab_size,
+            "bos_id": self.bos_id,
+            "eos_id": self.eos_id,
+            "pad_id": self.pad_id,
+        }
+        with open(os.path.join(save_dir, "tokenizer_meta.json"), "w") as f:
+            json.dump(meta, f, indent=2)
+        print(f"[Tokenizer] 학습 완료: vocab_size={self.vocab_size}")
+        print(f"[Tokenizer] 저장 위치: {save_dir}")
+    # ────────────────────────────────────────────────
+    # 방법 3: 사전 학습된 HF 토크나이저 로드
+    # ────────────────────────────────────────────────
+    def load_pretrained_hf(self, name_or_path: str = "meta-llama/Llama-2-7b-hf"):
+        """HuggingFace에서 사전 학습된 토크나이저를 로드합니다.
+        가장 간편한 방법. LLaMA 토크나이저는 32K vocab, BPE 기반.
+        주의: meta-llama 모델은 HF 승인이 필요할 수 있음.
+        대안: mistralai/Mistral-7B-v0.1 (승인 불필요)
+        """
+        from transformers import AutoTokenizer
+        print(f"[Tokenizer] HF 토크나이저 로드: {name_or_path}")
+        tokenizer = AutoTokenizer.from_pretrained(name_or_path)
+        self._tokenizer = tokenizer
+        self.vocab_size = tokenizer.vocab_size
+        self.bos_id = tokenizer.bos_token_id or 1
+        self.eos_id = tokenizer.eos_token_id or 2
+        self.pad_id = tokenizer.pad_token_id or 0
+        self._encode_fn = lambda text: tokenizer.encode(text, add_special_tokens=False)
+        self._decode_fn = lambda ids: tokenizer.decode(ids)
+        print(f"[Tokenizer] 로드 완료: vocab_size={self.vocab_size}")
+    def load_trained_hf(self, path: str):
+        """train_bpe()로 학습한 토크나이저를 다시 로드합니다."""
+        from tokenizers import Tokenizer as HFTokenizer
+        tokenizer = HFTokenizer.from_file(os.path.join(path, "tokenizer.json"))
+        with open(os.path.join(path, "tokenizer_meta.json"), "r") as f:
+            meta = json.load(f)
+        self._tokenizer = tokenizer
+        self.vocab_size = meta["vocab_size"]
+        self.bos_id = meta["bos_id"]
+        self.eos_id = meta["eos_id"]
+        self.pad_id = meta["pad_id"]
+        self._encode_fn = lambda text: tokenizer.encode(text).ids
+        self._decode_fn = lambda ids: tokenizer.decode(ids)
+        print(f"[Tokenizer] 로드 완료: vocab_size={self.vocab_size}")
+    # ────────────────────────────────────────────────
+    # 공통 인터페이스
+    # ────────────────────────────────────────────────
+    def encode(self, text: str, add_special_tokens: bool = False) -> List[int]:
+        """텍스트 → 토큰 ID 리스트."""
+        ids = self._encode_fn(text)
+        if add_special_tokens:
+            ids = [self.bos_id] + ids + [self.eos_id]
+        return ids
+    def decode(self, ids: List[int]) -> str:
+        """토큰 ID 리스트 → 텍스트."""
+        return self._decode_fn(ids)
+    def __len__(self) -> int:
+        return self.vocab_size
+# ============================================================================
+# 3. 시퀀스 패킹 스트리밍 데이터셋
+# ============================================================================
+class PackedStreamingDataset(IterableDataset):
+    """Streaming + 시퀀스 패킹 데이터셋.
+    왜 시퀀스 패킹인가?
+      - 일반적 방법: 각 문서를 max_seq_len으로 잘라 패딩 → GPU 낭비
+      - 시퀀스 패킹: 여러 문서를 이어붙여 max_seq_len을 꽉 채움 → 100% 활용
+    동작 방식:
+      문서1 (300 토큰) + 문서2 (1500 토큰) + 문서3 (248 토큰) = 2048 토큰
+      → [문서1][EOS][문서2][EOS][문서3][EOS][...패딩 없이 딱 맞춤]
+    왜 Streaming인가?
+      - FineWeb-Edu 10B 샘플: 압축 상태에서도 수십 GB
+      - Colab 디스크 한계 (~200GB)에서 전체 다운로드 불가
+      - Streaming: 필요한 만큼만 네트워크에서 읽어옴
+    학습 시 주의사항:
+      - 시퀀스 내 문서 경계에 EOS 토큰 삽입으로 모델이 문서 끝을 인식
+      - Cross-Attention 마스크 없이도 EOS가 자연스러운 경계 역할
+    """
+    def __init__(
+        self,
+        tokenizer: Tokenizer,
+        config: DataConfig,
+        split: str = "train",
+        seed: int = 42,
+    ):
+        super().__init__()
+        self.tokenizer = tokenizer
+        self.config = config
+        self.split = split
+        self.seed = seed
+        self.max_seq_len = config.max_seq_len
+    def _load_dataset(self):
+        """HuggingFace 데이터셋을 스트리밍 모드로 로드합니다."""
+        from datasets import load_dataset
+        ds = load_dataset(
+            self.config.dataset_name,
+            name=self.config.dataset_subset,
+            split=self.config.dataset_split,
+            streaming=True,         # 핵심: 스트리밍 모드
+            trust_remote_code=True,
+        )
+        # 셔플 (스트리밍에서는 버퍼 기반 근사 셔플)
+        ds = ds.shuffle(seed=self.seed, buffer_size=10_000)
+        return ds
+    def _tokenize_and_pack(self, dataset) -> Iterator[Dict[str, torch.Tensor]]:
+        """문서를 토크나이즈하고 시퀀스 패킹합니다.
+        Yields:
+            {"input_ids": (max_seq_len,), "targets": (max_seq_len,)}
+        targets = input_ids를 한 칸 shift:
+            input_ids:  [A, B, C, D, E]
+            targets:    [B, C, D, E, F]
+            → 모델은 A를 보고 B를 예측, B를 보고 C를 예측, ...
+        """
+        buffer: List[int] = []  # 토큰 버퍼
+        for example in dataset:
+            text = example[self.config.text_column]
+            if not text or not text.strip():
+                continue
+            # 토크나이즈 (특수 토큰 없이)
+            token_ids = self.tokenizer.encode(text, add_special_tokens=False)
+            if not token_ids:
+                continue
+            # EOS 토큰 추가 (문서 경계 표시)
+            if self.config.use_eos_separator:
+                token_ids.append(self.tokenizer.eos_id)
+            # 버퍼에 추가
+            buffer.extend(token_ids)
+            # 버퍼가 충분히 차면 시퀀스 생성
+            # +1은 targets 생성을 위해 (input + 다음 토큰)
+            while len(buffer) >= self.max_seq_len + 1:
+                # max_seq_len + 1 만큼 꺼냄
+                chunk = buffer[: self.max_seq_len + 1]
+                buffer = buffer[self.max_seq_len + 1 :]
+                # input_ids: 처음 ~ 끝에서 두 번째
+                input_ids = torch.tensor(chunk[:-1], dtype=torch.long)
+                # targets: 두 번째 ~ 끝 (한 칸 shift)
+                targets = torch.tensor(chunk[1:], dtype=torch.long)
+                yield {"input_ids": input_ids, "targets": targets}
+    def __iter__(self) -> Iterator[Dict[str, torch.Tensor]]:
+        """DataLoader가 호출하는 이터레이터.
+        멀티 워커 지원:
+          - 각 워커가 서로 다른 시드로 셔플된 스트림을 처리
+          - 워커 간 데이터 중복을 최소화
+        """
+        worker_info = torch.utils.data.get_worker_info()
+        if worker_info is not None:
+            # 멀티 워커: 각 워커에 다른 시드
+            worker_seed = self.seed + worker_info.id
+        else:
+            worker_seed = self.seed
+        # 워커별 시드로 데이터셋 로드
+        self.seed = worker_seed
+        dataset = self._load_dataset()
+        return self._tokenize_and_pack(dataset)
+# ============================================================================
+# 4. 검증용 데이터셋 (고정 크기)
+# ============================================================================
+class ValidationDataset:
+    """검증용 데이터셋.
+    Streaming 데이터셋에서 일정량을 미리 가져와 메모리에 저장합니다.
+    매 에폭 동일한 데이터로 평가해야 비교가 의미 있기 때문입니다.
+    """
+    def __init__(
+        self,
+        tokenizer: Tokenizer,
+        config: DataConfig,
+        num_samples: int = 100,
+        seed: int = 9999,
+    ):
+        self.tokenizer = tokenizer
+        self.config = config
+        self.num_samples = num_samples
+        self.samples: List[Dict[str, torch.Tensor]] = []
+        self._prepare(seed)
+    def _prepare(self, seed: int):
+        """데이터셋에서 검증 샘플을 미리 추출합니다."""
+        from datasets import load_dataset
+        print(f"[Validation] {self.num_samples}개 검증 샘플 준비 중...")
+        ds = load_dataset(
+            self.config.dataset_name,
+            name=self.config.dataset_subset,
+            split=self.config.dataset_split,
+            streaming=True,
+            trust_remote_code=True,
+        )
+        # 학습 데이터와 겹치지 않도록 다른 시드, 앞부분 건너뛰기
+        ds = ds.shuffle(seed=seed, buffer_size=5_000)
+        buffer: List[int] = []
+        count = 0
+        for example in ds:
+            if count >= self.num_samples:
+                break
+            text = example[self.config.text_column]
+            if not text or not text.strip():
+                continue
+            token_ids = self.tokenizer.encode(text, add_special_tokens=False)
+            if not token_ids:
+                continue
+            token_ids.append(self.tokenizer.eos_id)
+            buffer.extend(token_ids)
+            while len(buffer) >= self.config.max_seq_len + 1 and count < self.num_samples:
+                chunk = buffer[: self.config.max_seq_len + 1]
+                buffer = buffer[self.config.max_seq_len + 1 :]
+                self.samples.append({
+                    "input_ids": torch.tensor(chunk[:-1], dtype=torch.long),
+                    "targets": torch.tensor(chunk[1:], dtype=torch.long),
+                })
+                count += 1
+        print(f"[Validation] {len(self.samples)}개 샘플 준비 완료")
+    def get_dataloader(self, batch_size: int) -> DataLoader:
+        """검증 DataLoader를 반환합니다."""
+        return DataLoader(
+            self.samples,
+            batch_size=batch_size,
+            shuffle=False,
+            num_workers=0,
+            collate_fn=_collate_fn,
+        )
+# ============================================================================
+# 5. DataLoader 생성 유틸리티
+# ============================================================================
+def _collate_fn(batch: List[Dict[str, torch.Tensor]]) -> Dict[str, torch.Tensor]:
+    """배치 내 샘플들을 하나의 텐서로 합칩니다.
+    시퀀스 패킹 덕분에 모든 샘플이 동일한 길이(max_seq_len)이므로
+    추가 패딩이 필요 없습니다.
+    """
+    return {
+        "input_ids": torch.stack([s["input_ids"] for s in batch]),
+        "targets": torch.stack([s["targets"] for s in batch]),
+    }
+def create_train_dataloader(
+    tokenizer: Tokenizer,
+    config: DataConfig,
+    seed: int = 42,
+) -> DataLoader:
+    """학습용 DataLoader를 생성합니다.
+    Returns:
+        무한히 반복되는 스트리밍 DataLoader
+    사용법:
+        dataloader = create_train_dataloader(tokenizer, config)
+        for step, batch in enumerate(dataloader):
+            input_ids = batch["input_ids"].to(device)  # (B, seq_len)
+            targets = batch["targets"].to(device)       # (B, seq_len)
+            logits, loss = model(input_ids, targets)
+            ...
+    """
+    dataset = PackedStreamingDataset(
+        tokenizer=tokenizer,
+        config=config,
+        split="train",
+        seed=seed,
+    )
+    dataloader = DataLoader(
+        dataset,
+        batch_size=config.batch_size,
+        num_workers=config.num_workers,
+        prefetch_factor=config.prefetch_factor if config.num_workers > 0 else None,
+        pin_memory=True,     # GPU 전송 속도 향상
+        collate_fn=_collate_fn,
+    )
+    return dataloader
+# ============================================================================
+# 6. 토크나이저 학습 헬퍼
+# ============================================================================
+def train_tokenizer_from_dataset(config: DataConfig) -> Tokenizer:
+    """데이터셋에서 BPE 토크나이저를 학습합니다.
+    전체 데이터를 다 사용할 필요 없이, 50K 문서면 충분합니다.
+    토크나이저 vocab은 전체 데이터의 통계를 반영하면 되므로.
+    """
+    from datasets import load_dataset
+    print(f"[Train Tokenizer] {config.dataset_name}에서 토크나이저 학습")
+    print(f"[Train Tokenizer] 학습 문서 수: {config.tokenizer_train_samples:,}")
+    # 텍스트 이터레이터 생성
+    ds = load_dataset(
+        config.dataset_name,
+        name=config.dataset_subset,
+        split=config.dataset_split,
+        streaming=True,
+        trust_remote_code=True,
+    )
+    def text_iterator():
+        count = 0
+        for example in ds:
+            if count >= config.tokenizer_train_samples:
+                break
+            text = example[config.text_column]
+            if text and text.strip():
+                yield text
+                count += 1
+                if count % 10_000 == 0:
+                    print(f"  ... {count:,} 문서 처리")
+    # 토크나이저 학습
+    tokenizer = Tokenizer(config)
+    tokenizer.train_bpe(text_iterator(), save_dir=config.tokenizer_save_dir)
+    return tokenizer
+# ============================================================================
+# 7. 데이터 파이프라인 통계/진단 도구
+# ============================================================================
+class DataPipelineDiagnostics:
+    """데이터 파이프라인의 성능과 품질을 진단합니다.
+    학습 전 반드시 확인해야 할 항목:
+      1) 토크나이저 품질: 평균 토큰/문서, 알 수 없는 토큰 비율
+      2) 패킹 효율: 실제 토큰 비율 vs 패딩 비율
+      3) 처리 속도: tokens/sec (데이터 로딩 병목 확인)
+      4) 배치 형태: shape, dtype 정확성
+    """
+    @staticmethod
+    def check_tokenizer_quality(
+        tokenizer: Tokenizer,
+        config: DataConfig,
+        num_samples: int = 1000,
+    ):
+        """토크나이저 품질을 진단합니다."""
+        from datasets import load_dataset
+        print("\n" + "=" * 60)
+        print("📊 토크나이저 품질 진단")
+        print("=" * 60)
+        ds = load_dataset(
+            config.dataset_name,
+            name=config.dataset_subset,
+            split=config.dataset_split,
+            streaming=True,
+            trust_remote_code=True,
+        )
+        token_counts = []
+        char_counts = []
+        sample_count = 0
+        for example in ds:
+            if sample_count >= num_samples:
+                break
+            text = example[config.text_column]
+            if not text or not text.strip():
+                continue
+            tokens = tokenizer.encode(text)
+            token_counts.append(len(tokens))
+            char_counts.append(len(text))
+            sample_count += 1
+        avg_tokens = sum(token_counts) / len(token_counts)
+        avg_chars = sum(char_counts) / len(char_counts)
+        compression_ratio = avg_chars / avg_tokens  # 문자/토큰 비율
+        print(f"  분석 문서 수: {len(token_counts):,}")
+        print(f"  평균 토큰/문서: {avg_tokens:.1f}")
+        print(f"  평균 문자/문서: {avg_chars:.1f}")
+        print(f"  압축 비율 (문자/토큰): {compression_ratio:.2f}")
+        print(f"    → 영어 기준 3.5~4.5가 정상")
+        print(f"  최소 토큰: {min(token_counts)}, 최대: {max(token_counts)}")
+        # 디코드 왕복 테스트
+        test_text = "The quick brown fox jumps over the lazy dog."
+        encoded = tokenizer.encode(test_text)
+        decoded = tokenizer.decode(encoded)
+        roundtrip_ok = test_text.strip() in decoded.strip()
+        print(f"\n  왕복 테스트: {'✅ 통과' if roundtrip_ok else '❌ 실패'}")
+        print(f"    원본:  {test_text}")
+        print(f"    인코딩: {encoded[:20]}{'...' if len(encoded) > 20 else ''}")
+        print(f"    디코딩: {decoded}")
+    @staticmethod
+    def benchmark_throughput(
+        dataloader: DataLoader,
+        num_batches: int = 50,
+        seq_len: int = 2048,
+    ):
+        """데이터 로딩 처리량을 측정합니다.
+        GPU 학습 속도의 병목이 데이터 로딩인지 확인하는 핵심 진단.
+        목표: 데이터 로딩이 GPU 연산보다 빨라야 함 (data loading ≠ bottleneck).
+        """
+        print("\n" + "=" * 60)
+        print("⚡ 데이터 로딩 처리량 벤치마크")
+        print("=" * 60)
+        total_tokens = 0
+        start_time = time.time()
+        for i, batch in enumerate(dataloader):
+            if i >= num_batches:
+                break
+            batch_tokens = batch["input_ids"].numel()
+            total_tokens += batch_tokens
+            if (i + 1) % 10 == 0:
+                elapsed = time.time() - start_time
+                tps = total_tokens / elapsed
+                print(f"  Batch {i+1}: {tps:,.0f} tokens/sec")
+        elapsed = time.time() - start_time
+        tps = total_tokens / elapsed
+        print(f"\n  총 배치 수: {num_batches}")
+        print(f"  총 토큰 수: {total_tokens:,}")
+        print(f"  소요 시간: {elapsed:.2f}초")
+        print(f"  평균 처리량: {tps:,.0f} tokens/sec")
+        print(f"\n  💡 A100 학습 처리량 ~50-80K tokens/sec 기준:")
+        if tps > 80_000:
+            print(f"     ✅ 데이터 로딩이 병목이 아닙니다")
+        elif tps > 30_000:
+            print(f"     ⚠️ 경계선 - num_workers 증가를 고려하세요")
+        else:
+            print(f"     ❌ 데이터 로딩이 병목! num_workers/prefetch 조정 필요")
+    @staticmethod
+    def inspect_batch(batch: Dict[str, torch.Tensor], tokenizer: Tokenizer):
+        """배치 하나를 상세 검사합니다."""
+        print("\n" + "=" * 60)
+        print("🔍 배치 상세 검사")
+        print("=" * 60)
+        input_ids = batch["input_ids"]
+        targets = batch["targets"]
+        print(f"  input_ids shape: {input_ids.shape}")
+        print(f"  targets shape:   {targets.shape}")
+        print(f"  dtype:           {input_ids.dtype}")
+        print(f"  값 범위:         [{input_ids.min().item()}, {input_ids.max().item()}]")
+        # Shift 관계 확인: targets[i] == input_ids[i+1]
+        shift_correct = (input_ids[:, 1:] == targets[:, :-1]).float().mean().item()
+        print(f"  Shift 정합성:    {shift_correct*100:.1f}% (100%여야 정상)")
+        # EOS 토큰 분포 (문서 경계)
+        eos_count = (input_ids == tokenizer.eos_id).sum().item()
+        total_tokens = input_ids.numel()
+        print(f"  EOS 토큰 수:     {eos_count} / {total_tokens} ({eos_count/total_tokens*100:.2f}%)")
+        # 첫 번째 샘플 디코딩 미리보기
+        first_sample = input_ids[0][:100].tolist()
+        decoded_preview = tokenizer.decode(first_sample)
+        print(f"\n  첫 샘플 디코딩 (처음 100 토큰):")
+        print(f"  {decoded_preview[:300]}...")
+# ============================================================================
+# 8. 전체 파이프라인 통합 (Quick Start)
+# ============================================================================
+def setup_data_pipeline(
+    tokenizer_mode: str = "train_new",
+    tokenizer_path: Optional[str] = None,
+    config: Optional[DataConfig] = None,
+) -> tuple:
+    """데이터 파이프라인을 한 번에 설정합니다.
+    Args:
+        tokenizer_mode:
+            "train_new"    - BPE 토크나이저 새로 학습
+            "load_trained" - 이전에 학습한 토크나이저 로드
+            "pretrained"   - HuggingFace 사전학습 토크나이저 사용
+        tokenizer_path:
+            "train_new"    → 저장 경로 (기본: ./tokenizer)
+            "load_trained" → 저장된 토크나이저 경로
+            "pretrained"   → HF 모델명 (기본: mistralai/Mistral-7B-v0.1)
+    Returns:
+        (tokenizer, train_dataloader, val_dataloader)
+    사용 예시 (Colab):
+        # 방법 1: 토크나이저 새로 학습
+        tok, train_dl, val_dl = setup_data_pipeline("train_new")
+        # 방법 2: 기존 토크나이저 로드
+        tok, train_dl, val_dl = setup_data_pipeline("load_trained", "./tokenizer")
+        # 방법 3: 사전학습 토크나이저 (가장 간편)
+        tok, train_dl, val_dl = setup_data_pipeline("pretrained")
+    """
+    config = config or DataConfig()
+    print("=" * 60)
+    print("🚀 데이터 파이프라인 설정")
+    print("=" * 60)
+    # ── Step 1: 토크나이저 ──
+    tokenizer = Tokenizer(config)
+    if tokenizer_mode == "train_new":
+        tokenizer = train_tokenizer_from_dataset(config)
+    elif tokenizer_mode == "load_trained":
+        path = tokenizer_path or config.tokenizer_save_dir
+        tokenizer.load_trained_hf(path)
+    elif tokenizer_mode == "pretrained":
+        name = tokenizer_path or "mistralai/Mistral-7B-v0.1"
+        tokenizer.load_pretrained_hf(name)
+    else:
+        raise ValueError(f"Unknown tokenizer_mode: {tokenizer_mode}")
+    # ── Step 2: 학습 DataLoader ──
+    print("\n[DataLoader] 학습 DataLoader 생성...")
+    train_dataloader = create_train_dataloader(tokenizer, config)
+    # ── Step 3: 검증 DataLoader ──
+    print("\n[DataLoader] 검증 DataLoader 생성...")
+    val_dataset = ValidationDataset(tokenizer, config, num_samples=100)
+    val_dataloader = val_dataset.get_dataloader(batch_size=config.batch_size)
+    print("\n" + "=" * 60)
+    print("✅ 데이터 파이프라인 설정 완료!")
+    print(f"   토크나이저 vocab: {tokenizer.vocab_size:,}")
+    print(f"   시퀀스 길이: {config.max_seq_len}")
+    print(f"   배치 크기: {config.batch_size}")
+    print(f"   토큰/배치: {config.batch_size * config.max_seq_len:,}")
+    print("=" * 60)
+    return tokenizer, train_dataloader, val_dataloader
+# ============================================================================
+# 9. 검증 스크립트
+# ============================================================================
+if __name__ == "__main__":
+    """
+    로컬/Colab에서 실행하여 파이프라인을 검증합니다.
+    실행 방법:
+      python data_pipeline.py
+    또는 Colab에서:
+      !pip install datasets tokenizers sentencepiece
+      %run data_pipeline.py
+    """
+    print("=" * 70)
+    print("LLM-1B-Lab: 데이터 파이프라인 검증")
+    print("=" * 70)
+    # ── 간단한 검증: 더미 토크나이저로 파이프라인 테스트 ──
+    print("\n[테스트 1] 더미 토크나이저로 파이프라인 구조 검증")
+    # 더미 토크나이저 (실제 데이터셋 없이 테스트)
+    class DummyTokenizer:
+        """테스트용 간단한 문자 단위 토크나이저."""
+        def __init__(self, vocab_size=256):
+            self.vocab_size = vocab_size
+            self.eos_id = 2
+            self.bos_id = 1
+            self.pad_id = 0
+        def encode(self, text, add_special_tokens=False):
+            # 각 문자를 ASCII 값으로 변환 (간단한 테스트용)
+            ids = [min(ord(c), self.vocab_size - 1) for c in text]
+            if add_special_tokens:
+                ids = [self.bos_id] + ids + [self.eos_id]
+            return ids
+        def decode(self, ids):
+            return "".join(chr(min(i, 127)) for i in ids if i > 2)
+        def __len__(self):
+            return self.vocab_size
+    config = DataConfig(max_seq_len=64, batch_size=2)  # 작은 설정
+    dummy_tok = DummyTokenizer()
+    # 더미 데이터로 패킹 테스트
+    print("\n[테스트 2] 시퀀스 패킹 로직 검증")
+    buffer = []
+    test_docs = [
+        "Hello world! This is document one. " * 5,
+        "Second document here with different content. " * 8,
+        "Third doc. " * 20,
+        "A " * 200,
+    ]
+    for doc in test_docs:
+        tokens = dummy_tok.encode(doc)
+        tokens.append(dummy_tok.eos_id)
+        buffer.extend(tokens)
+    seq_len = config.max_seq_len
+    packed_count = 0
+    while len(buffer) >= seq_len + 1:
+        chunk = buffer[: seq_len + 1]
+        buffer = buffer[seq_len + 1 :]
+        input_ids = torch.tensor(chunk[:-1], dtype=torch.long)
+        targets = torch.tensor(chunk[1:], dtype=torch.long)
+        # Shift 관계 확인
+        assert (input_ids[1:] == targets[:-1]).all(), "Shift 관계 오류!"
+        packed_count += 1
+    print(f"  문서 수: {len(test_docs)}")
+    print(f"  총 토큰 수: {sum(len(dummy_tok.encode(d)) + 1 for d in test_docs)}")
+    print(f"  패킹된 시퀀스 수: {packed_count}")
+    print(f"  시퀀스 길이: {seq_len}")
+    print(f"  남은 버퍼: {len(buffer)} 토큰")
+    print(f"  ✅ Shift 관계 검증 통과")
+    # 배치 구성 테스트
+    print("\n[테스트 3] 배치 구성 검증")
+    samples = []
+    buffer2 = []
+    for doc in test_docs * 10:  # 충분한 데이터 생성
+        tokens = dummy_tok.encode(doc)
+        tokens.append(dummy_tok.eos_id)
+        buffer2.extend(tokens)
+    while len(buffer2) >= seq_len + 1 and len(samples) < 10:
+        chunk = buffer2[: seq_len + 1]
+        buffer2 = buffer2[seq_len + 1 :]
+        samples.append({
+            "input_ids": torch.tensor(chunk[:-1], dtype=torch.long),
+            "targets": torch.tensor(chunk[1:], dtype=torch.long),
+        })
+    batch = _collate_fn(samples[:config.batch_size])
+    print(f"  input_ids shape: {batch['input_ids'].shape}")
+    print(f"  targets shape:   {batch['targets'].shape}")
+    print(f"  dtype: {batch['input_ids'].dtype}")
+    expected_shape = (config.batch_size, seq_len)
+    assert batch["input_ids"].shape == expected_shape, f"Shape 불일치: {batch['input_ids'].shape} != {expected_shape}"
+    print(f"  ✅ 배치 shape 검증 통과: {expected_shape}")
+    # EOS 토큰 존재 확인
+    eos_found = (batch["input_ids"] == dummy_tok.eos_id).any().item()
+    print(f"  ✅ EOS 토큰 존재: {eos_found}")
+    print("\n" + "=" * 70)
+    print("✅ 데이터 파이프라인 구조 검증 완료!")
+    print()
+    print("다음 단계: 실제 데이터셋으로 테스트")
+    print("  tokenizer, train_dl, val_dl = setup_data_pipeline('pretrained')")
+    print("  DataPipelineDiagnostics.check_tokenizer_quality(tokenizer, DataConfig())")
+    print("  DataPipelineDiagnostics.benchmark_throughput(train_dl)")
+    print("=" * 70)

_archive/llm-1b-evaluation.py ADDED Viewed

	@@ -0,0 +1,1455 @@

+"""
+LLM-1B-Lab: 평가 모듈 (Evaluation)
+=====================================
+학습된 모델의 품질을 다각도로 평가하고,
+학습 과정에서 얻은 통찰을 분석합니다.
+평가 영역:
+  1. Perplexity 측정      — 언어 모델의 표준 정량 지표
+  2. 텍스트 생성 품질      — 정성적 평가 (다양한 프롬프트)
+  3. Scaling Law 분석      — 10M → 100M → 1B 비교
+  4. 학습 역학 분석        — Loss 곡선, LR, Gradient 패턴
+  5. Attention 시각화      — 모델이 "어디를 보는지" 분석
+  6. 종합 리포트 생성      — 학습 인사이트 정리
+설치 필요:
+  pip install matplotlib numpy
+"""
+import math
+import time
+import json
+from pathlib import Path
+from dataclasses import dataclass, field
+from typing import Optional, List, Dict, Any, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader
+try:
+    import matplotlib
+    matplotlib.use("Agg")  # Colab/서버 호환
+    import matplotlib.pyplot as plt
+    import matplotlib.ticker as ticker
+    HAS_MATPLOTLIB = True
+except ImportError:
+    HAS_MATPLOTLIB = False
+try:
+    import numpy as np
+    HAS_NUMPY = True
+except ImportError:
+    HAS_NUMPY = False
+# ============================================================================
+# 1. 평가 설정
+# ============================================================================
+@dataclass
+class EvalConfig:
+    """평가 파라미터."""
+    # ── Perplexity ──
+    eval_batch_size: int = 4
+    max_eval_batches: int = 100      # 최대 평가 배치 수
+    # ── 생성 ──
+    max_new_tokens: int = 200
+    temperature: float = 0.8
+    top_k: int = 50
+    top_p: float = 0.9
+    num_samples: int = 3             # 프롬프트당 생성 횟수
+    # ── 출력 ──
+    save_dir: str = "./eval_results"
+    plot_dpi: int = 150
+# ============================================================================
+# 2. Perplexity 평가기
+# ============================================================================
+class PerplexityEvaluator:
+    """Perplexity(PPL)를 측정합니다.
+    Perplexity란?
+      PPL = exp(average cross-entropy loss)
+      직관적 의미:
+        - PPL = 1:     완벽한 예측 (불가능)
+        - PPL = 10:    매번 10개 후보 중 고르는 수준
+        - PPL = 100:   100개 후보 중 고르는 수준 (무작위에 가까움)
+        - PPL = 32000: vocab 전체에서 랜덤 선택 (초기 랜덤 모델)
+      좋은 1B 모델 기준 (영어 웹 텍스트):
+        - 5B 토큰 학습: PPL ~30-40
+        - 10B 토큰 학습: PPL ~20-30
+        - 20B 토큰 학습: PPL ~15-25
+    측정 방법:
+      - 검증 데이터셋의 모든 토큰에 대해 cross-entropy 계산
+      - 토큰 단위 평균 후 exp() 적용
+      - 패딩 토큰은 제외 (ignore_index=-100)
+    """
+    def __init__(self, config: EvalConfig):
+        self.config = config
+    @torch.no_grad()
+    def evaluate(
+        self,
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+        desc: str = "Evaluation",
+    ) -> Dict[str, float]:
+        """Perplexity를 측정합니다.
+        Returns:
+            {
+                "loss": 평균 cross-entropy loss,
+                "perplexity": exp(loss),
+                "num_tokens": 평가에 사용된 총 토큰 수,
+                "num_batches": 평가에 사용된 배치 수,
+            }
+        """
+        model.eval()
+        total_loss = 0.0
+        total_tokens = 0
+        num_batches = 0
+        print(f"\n📊 {desc}")
+        start_time = time.time()
+        for i, batch in enumerate(dataloader):
+            if i >= self.config.max_eval_batches:
+                break
+            input_ids = batch["input_ids"].to(device)
+            targets = batch["targets"].to(device)
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                logits, _ = model(input_ids)
+            # 토큰별 cross-entropy (reduction='none')
+            # logits: (B, S, V) → (B*S, V)
+            # targets: (B, S) → (B*S,)
+            loss_per_token = F.cross_entropy(
+                logits.view(-1, logits.size(-1)),
+                targets.view(-1),
+                ignore_index=-100,
+                reduction="none",
+            )
+            # -100이 아닌 유효 토큰만 카운트
+            valid_mask = (targets.view(-1) != -100)
+            valid_tokens = valid_mask.sum().item()
+            total_loss += loss_per_token[valid_mask].sum().item()
+            total_tokens += valid_tokens
+            num_batches += 1
+            if (i + 1) % 20 == 0:
+                running_ppl = math.exp(min(total_loss / max(total_tokens, 1), 20))
+                print(f"  Batch {i+1}/{self.config.max_eval_batches}: running PPL = {running_ppl:.2f}")
+        elapsed = time.time() - start_time
+        avg_loss = total_loss / max(total_tokens, 1)
+        perplexity = math.exp(min(avg_loss, 100))  # overflow 방지
+        results = {
+            "loss": round(avg_loss, 4),
+            "perplexity": round(perplexity, 2),
+            "num_tokens": total_tokens,
+            "num_batches": num_batches,
+            "eval_time_sec": round(elapsed, 1),
+        }
+        print(f"  ────────────────────────────────")
+        print(f"  Loss:        {results['loss']:.4f}")
+        print(f"  Perplexity:  {results['perplexity']:.2f}")
+        print(f"  평가 토큰:   {total_tokens:,}")
+        print(f"  소요 시간:   {elapsed:.1f}초")
+        return results
+    @torch.no_grad()
+    def evaluate_per_position(
+        self,
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+        max_batches: int = 50,
+    ) -> List[float]:
+        """시퀀스 내 위치별 Loss를 측정합니다.
+        학습 포인트:
+          - 위치 0~10: Loss가 높음 (문맥이 부족)
+          - 위치 100+: Loss가 안정적으로 낮아짐 (문맥 활용)
+          - 이 패턴이 Transformer의 in-context learning 능력을 보여줌
+        """
+        model.eval()
+        seq_len = None
+        position_loss_sum = None
+        position_count = None
+        for i, batch in enumerate(dataloader):
+            if i >= max_batches:
+                break
+            input_ids = batch["input_ids"].to(device)
+            targets = batch["targets"].to(device)
+            B, S = targets.shape
+            if seq_len is None:
+                seq_len = S
+                position_loss_sum = torch.zeros(S, device=device)
+                position_count = torch.zeros(S, device=device)
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                logits, _ = model(input_ids)
+            # (B, S) 형태의 토큰별 loss
+            loss_per_token = F.cross_entropy(
+                logits.view(-1, logits.size(-1)),
+                targets.view(-1),
+                ignore_index=-100,
+                reduction="none",
+            ).view(B, S)
+            valid_mask = (targets != -100).float()
+            position_loss_sum += (loss_per_token * valid_mask).sum(dim=0)
+            position_count += valid_mask.sum(dim=0)
+        # 위치별 평균 loss
+        position_avg_loss = (position_loss_sum / position_count.clamp(min=1)).cpu().tolist()
+        return position_avg_loss
+# ============================================================================
+# 3. 텍스트 생성 평가
+# ============================================================================
+class GenerationEvaluator:
+    """다양한 프롬프트로 텍스트를 생성하여 품질을 평가합니다.
+    평가 관점:
+      1) 문법적 정확성:  영어 문법에 맞는 문장을 생성하는가?
+      2) 일관성:         문맥을 유지하며 이어가는가?
+      3) 다양성:         같은 프롬프트에 다른 결과를 생성하는가?
+      4) 반복 회피:      같은 구절을 반복하지 않는가?
+      5) 지식 표현:      학습 데이터의 지식이 반영되는가?
+    1B 모델의 현실적 기대치:
+      - 문법적으로 올바른 영어 문장 생성 ✅
+      - 짧은 문단 내 일관성 유지 ✅
+      - 복잡한 추론이나 긴 논리 전개 ❌ (더 큰 모델 필요)
+      - 사실적 정확성은 보장 안 됨 ⚠️
+    """
+    # 다양한 도메인의 테스트 프롬프트
+    DEFAULT_PROMPTS = [
+        # ── 일반 지식 ──
+        "The theory of relativity states that",
+        "In the history of computer science,",
+        "The human brain is remarkable because",
+        # ── 설명/교육 ──
+        "To understand machine learning, one must first",
+        "The water cycle begins when",
+        "Photosynthesis is the process by which",
+        # ── 서사/스토리 ──
+        "Once upon a time, in a small village near the mountains,",
+        "The detective looked at the evidence and realized that",
+        # ── 코드/기술 ──
+        "def fibonacci(n):\n    \"\"\"Calculate the nth Fibonacci number.\"\"\"\n",
+        "The most important data structures in programming are",
+        # ── 짧은 완성 ──
+        "The capital of France is",
+        "Water boils at a temperature of",
+        # ── 긴 문맥 ──
+        ("Artificial intelligence has transformed many industries. "
+         "In healthcare, AI is used for diagnosis and drug discovery. "
+         "In finance, it powers algorithmic trading and fraud detection. "
+         "Looking ahead, the most promising application of AI is"),
+    ]
+    def __init__(self, config: EvalConfig):
+        self.config = config
+    @torch.no_grad()
+    def generate_samples(
+        self,
+        model: nn.Module,
+        tokenizer: Any,
+        device: torch.device,
+        prompts: Optional[List[str]] = None,
+        verbose: bool = True,
+    ) -> List[Dict[str, Any]]:
+        """프롬프트별로 텍스트를 생성합니다.
+        Returns:
+            [{"prompt": str, "generations": [str, ...], "metrics": {...}}, ...]
+        """
+        model.eval()
+        prompts = prompts or self.DEFAULT_PROMPTS
+        results = []
+        if verbose:
+            print("\n" + "=" * 70)
+            print("📝 텍스트 생성 평가")
+            print("=" * 70)
+        for idx, prompt in enumerate(prompts):
+            prompt_results = {
+                "prompt": prompt,
+                "generations": [],
+                "metrics": {},
+            }
+            if verbose:
+                print(f"\n{'─'*60}")
+                print(f"프롬프트 [{idx+1}/{len(prompts)}]:")
+                print(f"  \"{prompt[:80]}{'...' if len(prompt) > 80 else ''}\"")
+                print(f"{'─'*60}")
+            # 프롬프트 인코딩
+            prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
+            input_tensor = torch.tensor([prompt_ids], dtype=torch.long, device=device)
+            all_texts = []
+            for sample_idx in range(self.config.num_samples):
+                # 생성
+                generated_ids = model.generate(
+                    input_tensor,
+                    max_new_tokens=self.config.max_new_tokens,
+                    temperature=self.config.temperature,
+                    top_k=self.config.top_k,
+                    top_p=self.config.top_p,
+                )
+                # 디코딩 (프롬프트 이후 부분만)
+                new_ids = generated_ids[0][len(prompt_ids):].tolist()
+                generated_text = tokenizer.decode(new_ids)
+                all_texts.append(generated_text)
+                prompt_results["generations"].append(generated_text)
+                if verbose:
+                    print(f"\n  ✍️ 생성 #{sample_idx+1}:")
+                    # 깔끔한 출력 (줄바꿈 포함)
+                    display_text = generated_text[:500]
+                    for line in display_text.split("\n"):
+                        print(f"    {line}")
+                    if len(generated_text) > 500:
+                        print(f"    ... (총 {len(generated_text)} 문자)")
+            # 생성 품질 메트릭
+            prompt_results["metrics"] = self._compute_generation_metrics(all_texts)
+            if verbose and prompt_results["metrics"]:
+                m = prompt_results["metrics"]
+                print(f"\n  📊 메트릭: "
+                      f"평균 길이={m['avg_length']:.0f}자, "
+                      f"반복률={m['repetition_rate']:.1%}, "
+                      f"어휘 다양성={m['lexical_diversity']:.2f}")
+            results.append(prompt_results)
+        return results
+    @staticmethod
+    def _compute_generation_metrics(texts: List[str]) -> Dict[str, float]:
+        """생성 텍스트의 품질 메트릭을 계산합니다.
+        메트릭:
+          - avg_length:        평균 생성 길이 (문자)
+          - avg_word_count:    평균 단어 수
+          - repetition_rate:   n-gram 반복률 (낮을수록 좋음)
+          - lexical_diversity: 고유 단어 비율 (높을수록 다양)
+          - sample_diversity:  샘플 간 다양성 (다른 생성끼리 얼마나 다른가)
+        """
+        if not texts:
+            return {}
+        # 길이
+        lengths = [len(t) for t in texts]
+        word_counts = [len(t.split()) for t in texts]
+        # 반복률 (4-gram 기준)
+        rep_rates = []
+        for text in texts:
+            words = text.lower().split()
+            if len(words) < 4:
+                rep_rates.append(0.0)
+                continue
+            ngrams = [tuple(words[i:i+4]) for i in range(len(words)-3)]
+            unique_ratio = len(set(ngrams)) / len(ngrams) if ngrams else 1.0
+            rep_rates.append(1.0 - unique_ratio)  # 반복률 = 1 - 고유비율
+        # 어휘 다양성 (Type-Token Ratio)
+        diversities = []
+        for text in texts:
+            words = text.lower().split()
+            if words:
+                diversities.append(len(set(words)) / len(words))
+            else:
+                diversities.append(0.0)
+        # 샘플 간 다양성 (자카드 유사도의 역)
+        sample_div = 0.0
+        if len(texts) > 1:
+            word_sets = [set(t.lower().split()) for t in texts]
+            similarities = []
+            for i in range(len(word_sets)):
+                for j in range(i+1, len(word_sets)):
+                    inter = len(word_sets[i] & word_sets[j])
+                    union = len(word_sets[i] | word_sets[j])
+                    if union > 0:
+                        similarities.append(inter / union)
+            sample_div = 1.0 - (sum(similarities) / max(len(similarities), 1))
+        return {
+            "avg_length": sum(lengths) / len(lengths),
+            "avg_word_count": sum(word_counts) / len(word_counts),
+            "repetition_rate": sum(rep_rates) / len(rep_rates),
+            "lexical_diversity": sum(diversities) / len(diversities),
+            "sample_diversity": round(sample_div, 3),
+        }
+# ============================================================================
+# 4. Scaling Law 분석기
+# ============================================================================
+class ScalingAnalyzer:
+    """10M → 100M → 1B 모델의 Scaling Law를 분석합니다.
+    Chinchilla Scaling Law (2022):
+      - 최적 학습: 토큰 수 ≈ 20 × 파라미터 수
+      - Loss ∝ N^(-α) × D^(-β)  (N=파라미터, D=데이터)
+      - α ≈ 0.076, β ≈ 0.095 (논문 기준)
+    이 분석의 목적:
+      - 우리 모델이 Scaling Law를 따르는지 확인
+      - 더 큰 모델/더 많은 데이터의 효과를 예측
+      - 컴퓨팅 자원 배분의 최적점 이해
+    """
+    def __init__(self, save_dir: str = "./eval_results"):
+        self.save_dir = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def analyze(
+        self,
+        model_results: List[Dict[str, Any]],
+    ) -> Dict[str, Any]:
+        """여러 모델 크기의 결과를 비교 분석합니다.
+        Args:
+            model_results: [
+                {"name": "10M",  "params": 10e6,  "tokens": 1e9, "loss": 4.2, "ppl": 66.7},
+                {"name": "100M", "params": 100e6, "tokens": 5e9, "loss": 3.5, "ppl": 33.1},
+                {"name": "1B",   "params": 1.1e9, "tokens": 10e9,"loss": 3.0, "ppl": 20.1},
+            ]
+        Returns:
+            분석 결과 딕셔너리
+        """
+        if len(model_results) < 2:
+            print("⚠️ Scaling 분석에는 최소 2개 모델 결과가 필요합니다.")
+            return {}
+        print("\n" + "=" * 70)
+        print("📈 Scaling Law 분석")
+        print("=" * 70)
+        # ── 결과 테이블 ──
+        print(f"\n  {'모델':<8} {'파라미터':>12} {'토큰':>10} {'Loss':>8} {'PPL':>8}")
+        print(f"  {'─'*52}")
+        for r in model_results:
+            params_str = f"{r['params']/1e6:.0f}M" if r["params"] < 1e9 else f"{r['params']/1e9:.1f}B"
+            tokens_str = f"{r['tokens']/1e9:.1f}B"
+            print(f"  {r['name']:<8} {params_str:>12} {tokens_str:>10} {r['loss']:>8.4f} {r['ppl']:>8.2f}")
+        # ── Scaling 효율 계산 ──
+        analysis = {"models": model_results, "scaling_efficiency": []}
+        for i in range(1, len(model_results)):
+            prev = model_results[i-1]
+            curr = model_results[i]
+            param_ratio = curr["params"] / prev["params"]
+            loss_reduction = prev["loss"] - curr["loss"]
+            ppl_reduction = (prev["ppl"] - curr["ppl"]) / prev["ppl"]
+            efficiency = {
+                "from": prev["name"],
+                "to": curr["name"],
+                "param_multiplier": round(param_ratio, 1),
+                "loss_reduction": round(loss_reduction, 4),
+                "ppl_reduction_pct": round(ppl_reduction * 100, 1),
+            }
+            analysis["scaling_efficiency"].append(efficiency)
+            print(f"\n  {prev['name']} → {curr['name']}:")
+            print(f"    파라미터 ×{param_ratio:.1f}")
+            print(f"    Loss 감소: {loss_reduction:.4f}")
+            print(f"    PPL 감소: {ppl_reduction*100:.1f}%")
+        # ── Chinchilla 최적성 체크 ──
+        print(f"\n  Chinchilla 최적성 체크 (토큰 ≈ 20 × 파라미터):")
+        for r in model_results:
+            optimal_tokens = r["params"] * 20
+            actual_ratio = r["tokens"] / r["params"]
+            status = "✅ 최적 범위" if 15 <= actual_ratio <= 25 else "⚠️ 범위 밖"
+            print(f"    {r['name']}: 토큰/파라미터 = {actual_ratio:.1f}x "
+                  f"(최적: 20x) {status}")
+        analysis["chinchilla_ratios"] = [
+            {"name": r["name"], "ratio": round(r["tokens"] / r["params"], 1)}
+            for r in model_results
+        ]
+        return analysis
+    def plot_scaling_curves(
+        self,
+        model_results: List[Dict[str, Any]],
+        save_path: Optional[str] = None,
+    ):
+        """Scaling 곡선을 시각화합니다."""
+        if not HAS_MATPLOTLIB or not HAS_NUMPY:
+            print("⚠️ matplotlib/numpy가 필요합니다: pip install matplotlib numpy")
+            return
+        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+        params = [r["params"] for r in model_results]
+        losses = [r["loss"] for r in model_results]
+        ppls = [r["ppl"] for r in model_results]
+        names = [r["name"] for r in model_results]
+        # ── Loss vs Parameters (log-log) ──
+        ax = axes[0]
+        ax.loglog(params, losses, "o-", color="#2563eb", linewidth=2, markersize=10)
+        for p, l, n in zip(params, losses, names):
+            ax.annotate(f"  {n}\n  Loss={l:.2f}", (p, l), fontsize=9)
+        ax.set_xlabel("Parameters", fontsize=12)
+        ax.set_ylabel("Validation Loss", fontsize=12)
+        ax.set_title("Loss vs Model Size (log-log)", fontsize=13, fontweight="bold")
+        ax.grid(True, alpha=0.3)
+        # ── PPL vs Parameters (log-log) ──
+        ax = axes[1]
+        ax.loglog(params, ppls, "s-", color="#dc2626", linewidth=2, markersize=10)
+        for p, pp, n in zip(params, ppls, names):
+            ax.annotate(f"  {n}\n  PPL={pp:.1f}", (p, pp), fontsize=9)
+        ax.set_xlabel("Parameters", fontsize=12)
+        ax.set_ylabel("Perplexity", fontsize=12)
+        ax.set_title("Perplexity vs Model Size (log-log)", fontsize=13, fontweight="bold")
+        ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "scaling_curves.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"\n  📊 Scaling 곡선 저장: {save_path}")
+        plt.close(fig)
+# ============================================================================
+# 5. 학습 역학 분석기
+# ============================================================================
+class TrainingDynamicsAnalyzer:
+    """학습 과정의 메트릭을 분석하고 시각화합니다.
+    분석 항목:
+      - Loss 곡선:      수렴 패턴, 스파이크 감지
+      - LR 스케줄:      Warmup + Cosine decay 확인
+      - Gradient Norm:  학습 안정성, 폭발/소멸 감지
+      - 처리량:         tokens/sec 안정성, 병목 감지
+    """
+    def __init__(self, save_dir: str = "./eval_results"):
+        self.save_dir = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def analyze_metrics(self, metrics_history: Dict[str, list]) -> Dict[str, Any]:
+        """학습 메트릭을 분석합니다.
+        Args:
+            metrics_history: Trainer.metrics.history 딕셔너리
+        Returns:
+            분석 결과
+        """
+        print("\n" + "=" * 70)
+        print("🔬 학습 역학 분석")
+        print("=" * 70)
+        analysis = {}
+        # ── Loss 분석 ──
+        if metrics_history.get("train_loss"):
+            losses = metrics_history["train_loss"]
+            analysis["loss"] = {
+                "initial": round(losses[0], 4),
+                "final": round(losses[-1], 4),
+                "minimum": round(min(losses), 4),
+                "total_reduction": round(losses[0] - losses[-1], 4),
+            }
+            # 스파이크 감지 (이전 값 대비 50% 이상 급증)
+            spikes = []
+            for i in range(1, len(losses)):
+                if losses[i] > losses[i-1] * 1.5:
+                    step = metrics_history["step"][i] if "step" in metrics_history else i
+                    spikes.append({"step": step, "loss": round(losses[i], 4)})
+            analysis["loss"]["spikes"] = spikes
+            print(f"\n  📉 Loss 분석:")
+            print(f"    초기:  {analysis['loss']['initial']:.4f}")
+            print(f"    최종:  {analysis['loss']['final']:.4f}")
+            print(f"    최소:  {analysis['loss']['minimum']:.4f}")
+            print(f"    감소:  {analysis['loss']['total_reduction']:.4f}")
+            print(f"    스파이크: {len(spikes)}회")
+            if spikes:
+                for s in spikes[:5]:
+                    print(f"      Step {s['step']}: Loss = {s['loss']}")
+        # ── Gradient Norm 분석 ──
+        if metrics_history.get("grad_norm"):
+            gnorms = metrics_history["grad_norm"]
+            analysis["grad_norm"] = {
+                "mean": round(sum(gnorms) / len(gnorms), 4),
+                "max": round(max(gnorms), 4),
+                "min": round(min(gnorms), 4),
+                "clipped_pct": round(sum(1 for g in gnorms if g >= 0.99) / len(gnorms) * 100, 1),
+            }
+            print(f"\n  📐 Gradient Norm 분석:")
+            print(f"    평균: {analysis['grad_norm']['mean']:.4f}")
+            print(f"    최대: {analysis['grad_norm']['max']:.4f}")
+            print(f"    클리핑 비율: {analysis['grad_norm']['clipped_pct']:.1f}%")
+            if analysis["grad_norm"]["clipped_pct"] > 30:
+                print(f"    ⚠️ 클리핑이 잦음 → LR 하향 또는 warmup 연장 고려")
+        # ── 처리량 분석 ──
+        if metrics_history.get("tokens_per_sec"):
+            tps = metrics_history["tokens_per_sec"]
+            tps_valid = [t for t in tps if t > 0]
+            if tps_valid:
+                analysis["throughput"] = {
+                    "mean": round(sum(tps_valid) / len(tps_valid)),
+                    "std": round((sum((t - sum(tps_valid)/len(tps_valid))**2 for t in tps_valid) / len(tps_valid))**0.5),
+                    "min": round(min(tps_valid)),
+                    "max": round(max(tps_valid)),
+                }
+                print(f"\n  ⚡ 처리량 분석:")
+                print(f"    평균: {analysis['throughput']['mean']:,} tokens/sec")
+                print(f"    표준편차: {analysis['throughput']['std']:,}")
+                print(f"    범위: [{analysis['throughput']['min']:,}, {analysis['throughput']['max']:,}]")
+        return analysis
+    def plot_training_curves(
+        self,
+        metrics_history: Dict[str, list],
+        save_path: Optional[str] = None,
+    ):
+        """학습 곡선을 4-panel 차트로 시각화합니다."""
+        if not HAS_MATPLOTLIB:
+            print("⚠️ matplotlib가 필요합니다: pip install matplotlib")
+            return
+        fig, axes = plt.subplots(2, 2, figsize=(16, 10))
+        fig.suptitle("Training Dynamics", fontsize=16, fontweight="bold")
+        steps = metrics_history.get("step", list(range(len(metrics_history.get("train_loss", [])))))
+        # ── (1) Loss ──
+        ax = axes[0, 0]
+        if metrics_history.get("train_loss"):
+            ax.plot(steps[:len(metrics_history["train_loss"])],
+                    metrics_history["train_loss"],
+                    color="#2563eb", alpha=0.6, linewidth=0.8, label="Train Loss")
+            # 이동 평균 (스무딩)
+            if len(metrics_history["train_loss"]) > 20:
+                window = min(50, len(metrics_history["train_loss"]) // 5)
+                smoothed = self._moving_average(metrics_history["train_loss"], window)
+                ax.plot(steps[window-1:len(smoothed)+window-1],
+                        smoothed, color="#1d4ed8", linewidth=2, label=f"Smoothed (window={window})")
+        if metrics_history.get("val_loss"):
+            val_steps = [steps[i] for i in range(0, len(steps),
+                         max(1, len(steps)//len(metrics_history["val_loss"])))][:len(metrics_history["val_loss"])]
+            ax.plot(val_steps, metrics_history["val_loss"],
+                    "o-", color="#dc2626", linewidth=2, markersize=5, label="Val Loss")
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Loss")
+        ax.set_title("Training & Validation Loss")
+        ax.legend()
+        ax.grid(True, alpha=0.3)
+        # ── (2) Learning Rate ──
+        ax = axes[0, 1]
+        if metrics_history.get("learning_rate"):
+            ax.plot(steps[:len(metrics_history["learning_rate"])],
+                    metrics_history["learning_rate"],
+                    color="#059669", linewidth=2)
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Learning Rate")
+        ax.set_title("Learning Rate Schedule")
+        ax.ticklabel_format(style="scientific", axis="y", scilimits=(0,0))
+        ax.grid(True, alpha=0.3)
+        # ── (3) Gradient Norm ──
+        ax = axes[1, 0]
+        if metrics_history.get("grad_norm"):
+            ax.plot(steps[:len(metrics_history["grad_norm"])],
+                    metrics_history["grad_norm"],
+                    color="#d97706", alpha=0.6, linewidth=0.8)
+            ax.axhline(y=1.0, color="red", linestyle="--", alpha=0.5, label="Clip threshold")
+            ax.legend()
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Gradient Norm")
+        ax.set_title("Gradient Norm (clipped at 1.0)")
+        ax.grid(True, alpha=0.3)
+        # ── (4) Throughput ──
+        ax = axes[1, 1]
+        if metrics_history.get("tokens_per_sec"):
+            tps = metrics_history["tokens_per_sec"]
+            ax.plot(steps[:len(tps)], tps, color="#7c3aed", alpha=0.6, linewidth=0.8)
+            if tps:
+                avg_tps = sum(tps) / len(tps)
+                ax.axhline(y=avg_tps, color="#7c3aed", linestyle="--", alpha=0.5,
+                           label=f"Avg: {avg_tps:,.0f}")
+                ax.legend()
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Tokens/sec")
+        ax.set_title("Training Throughput")
+        ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "training_curves.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"\n  📊 학습 곡선 저장: {save_path}")
+        plt.close(fig)
+    def plot_position_loss(
+        self,
+        position_losses: List[float],
+        save_path: Optional[str] = None,
+    ):
+        """위치별 Loss 분포를 시각화합니다."""
+        if not HAS_MATPLOTLIB:
+            return
+        fig, ax = plt.subplots(figsize=(12, 5))
+        positions = list(range(len(position_losses)))
+        ax.plot(positions, position_losses, color="#2563eb", linewidth=1.5)
+        ax.fill_between(positions, position_losses, alpha=0.1, color="#2563eb")
+        ax.set_xlabel("Position in Sequence", fontsize=12)
+        ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
+        ax.set_title("Loss by Position (earlier positions have less context)", fontsize=13, fontweight="bold")
+        ax.grid(True, alpha=0.3)
+        # 주요 구간 표시
+        if len(position_losses) > 100:
+            early_avg = sum(position_losses[:50]) / 50
+            late_avg = sum(position_losses[-200:]) / 200
+            ax.axhline(y=early_avg, color="red", linestyle="--", alpha=0.4,
+                       label=f"Early avg (0-50): {early_avg:.2f}")
+            ax.axhline(y=late_avg, color="green", linestyle="--", alpha=0.4,
+                       label=f"Late avg (-200): {late_avg:.2f}")
+            ax.legend()
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "position_loss.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"  📊 위치별 Loss 저장: {save_path}")
+        plt.close(fig)
+    @staticmethod
+    def _moving_average(data: list, window: int) -> list:
+        """이동 평균 계산."""
+        result = []
+        for i in range(window - 1, len(data)):
+            avg = sum(data[i - window + 1 : i + 1]) / window
+            result.append(avg)
+        return result
+# ============================================================================
+# 6. Attention 시각화
+# ============================================================================
+class AttentionVisualizer:
+    """Attention 패턴을 시각화합니다.
+    학습 포인트:
+      - Causal Mask: 하삼각 패턴 (미래 토큰은 볼 수 없음)
+      - 헤드별 역할 분화: 일부는 로컬(인접), 일부는 글로벌(먼 토큰) 주목
+      - 구문론적 패턴: 동사→주어, 대명사→선행사 등에 높은 attention
+    주의: 1B 모델의 전체 attention을 저장하면 메모리 부족!
+    → 특정 레이어/헤드만 선택적으로 시각화합니다.
+    """
+    def __init__(self, save_dir: str = "./eval_results"):
+        self.save_dir = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    @torch.no_grad()
+    def extract_attention(
+        self,
+        model: nn.Module,
+        input_ids: torch.Tensor,
+        layer_idx: int = 0,
+        device: torch.device = torch.device("cpu"),
+    ) -> torch.Tensor:
+        """특정 레이어의 attention weight를 추출합니다.
+        모델의 attention 모듈을 일시적으로 수정하여
+        attention weight를 캡처합니다.
+        Returns:
+            attention_weights: (num_heads, seq_len, seq_len)
+        """
+        model.eval()
+        captured_attn = {}
+        # Hook으로 attention weight 캡처
+        target_layer = model.layers[layer_idx].attention
+        # scaled_dot_product_attention을 수동 구현으로 대체
+        original_forward = target_layer.forward
+        def hooked_forward(x, mask=None, position_offset=0):
+            B, S, _ = x.shape
+            hd = target_layer.head_dim
+            q = target_layer.q_proj(x).view(B, S, target_layer.num_heads, hd).transpose(1, 2)
+            k = target_layer.k_proj(x).view(B, S, target_layer.num_kv_heads, hd).transpose(1, 2)
+            v = target_layer.v_proj(x).view(B, S, target_layer.num_kv_heads, hd).transpose(1, 2)
+            q, k = target_layer.rope(q, k, position_offset)
+            if target_layer.num_kv_groups > 1:
+                k = target_layer._repeat_kv(k)
+                v = target_layer._repeat_kv(v)
+            # 수동 attention 계산 (weight 추출용)
+            scale = 1.0 / math.sqrt(hd)
+            scores = torch.matmul(q, k.transpose(-2, -1)) * scale
+            # Causal mask
+            causal = torch.triu(torch.ones(S, S, device=x.device, dtype=torch.bool), diagonal=1)
+            scores.masked_fill_(causal.unsqueeze(0).unsqueeze(0), float("-inf"))
+            attn_weights = F.softmax(scores, dim=-1)
+            captured_attn["weights"] = attn_weights[0].cpu()  # 첫 배치만
+            out = torch.matmul(attn_weights, v)
+            out = out.transpose(1, 2).contiguous().view(B, S, -1)
+            return target_layer.o_proj(out)
+        # Hook 적용
+        target_layer.forward = hooked_forward
+        try:
+            model(input_ids.to(device))
+        finally:
+            target_layer.forward = original_forward
+        return captured_attn.get("weights")  # (num_heads, S, S)
+    def plot_attention_heatmap(
+        self,
+        attn_weights: torch.Tensor,
+        tokens: List[str],
+        head_idx: int = 0,
+        save_path: Optional[str] = None,
+        title: str = "Attention Weights",
+    ):
+        """Attention heatmap을 그립니다."""
+        if not HAS_MATPLOTLIB:
+            print("⚠️ matplotlib가 필요합니다")
+            return
+        weights = attn_weights[head_idx].numpy()
+        max_len = min(len(tokens), 50)  # 최대 50 토큰만 표시
+        weights = weights[:max_len, :max_len]
+        display_tokens = tokens[:max_len]
+        fig, ax = plt.subplots(figsize=(12, 10))
+        im = ax.imshow(weights, cmap="Blues", aspect="auto")
+        ax.set_xticks(range(max_len))
+        ax.set_yticks(range(max_len))
+        ax.set_xticklabels(display_tokens, rotation=90, fontsize=7)
+        ax.set_yticklabels(display_tokens, fontsize=7)
+        ax.set_xlabel("Key (attended to)", fontsize=11)
+        ax.set_ylabel("Query (attending from)", fontsize=11)
+        ax.set_title(f"{title} — Head {head_idx}", fontsize=13, fontweight="bold")
+        fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / f"attention_head{head_idx}.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"  📊 Attention 시각화 저장: {save_path}")
+        plt.close(fig)
+    def plot_multi_head_summary(
+        self,
+        attn_weights: torch.Tensor,
+        num_heads_to_show: int = 8,
+        save_path: Optional[str] = None,
+    ):
+        """여러 헤드의 attention 패턴을 요약 비교합니다."""
+        if not HAS_MATPLOTLIB:
+            return
+        n_heads = min(attn_weights.shape[0], num_heads_to_show)
+        cols = 4
+        rows = math.ceil(n_heads / cols)
+        fig, axes = plt.subplots(rows, cols, figsize=(16, 4 * rows))
+        if rows == 1:
+            axes = axes.reshape(1, -1)
+        for idx in range(n_heads):
+            r, c = idx // cols, idx % cols
+            ax = axes[r, c]
+            w = attn_weights[idx].numpy()
+            ax.imshow(w, cmap="Blues", aspect="auto")
+            ax.set_title(f"Head {idx}", fontsize=10)
+            ax.set_xticks([])
+            ax.set_yticks([])
+        # 빈 subplot 숨기기
+        for idx in range(n_heads, rows * cols):
+            r, c = idx // cols, idx % cols
+            axes[r, c].axis("off")
+        fig.suptitle("Attention Patterns by Head", fontsize=14, fontweight="bold")
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "attention_multi_head.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"  📊 멀티 헤드 요약 저장: {save_path}")
+        plt.close(fig)
+# ============================================================================
+# 7. 종합 평가 실행기
+# ============================================================================
+class FullEvaluator:
+    """모든 평가를 한 번에 실행하고 리포트를 생성합니다.
+    사용법:
+    ```python
+    evaluator = FullEvaluator(model, tokenizer, val_dataloader, device)
+    report = evaluator.run_full_evaluation()
+    ```
+    """
+    def __init__(
+        self,
+        model: nn.Module,
+        tokenizer: Any,
+        val_dataloader: DataLoader,
+        device: torch.device,
+        config: Optional[EvalConfig] = None,
+        dtype: torch.dtype = torch.bfloat16,
+        metrics_history: Optional[Dict[str, list]] = None,
+    ):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.val_dataloader = val_dataloader
+        self.device = device
+        self.config = config or EvalConfig()
+        self.dtype = dtype
+        self.metrics_history = metrics_history
+        self.save_dir = Path(self.config.save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def run_full_evaluation(self) -> Dict[str, Any]:
+        """전체 평가를 실행합니다."""
+        report = {"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")}
+        print("\n" + "=" * 70)
+        print("🔍 종합 평가 시작")
+        print("=" * 70)
+        # ── 1. Perplexity ──
+        print("\n" + "━" * 40)
+        print("Phase 1/4: Perplexity 측정")
+        print("━" * 40)
+        ppl_evaluator = PerplexityEvaluator(self.config)
+        report["perplexity"] = ppl_evaluator.evaluate(
+            self.model, self.val_dataloader, self.device, self.dtype
+        )
+        # 위치별 Loss
+        print("\n  위치별 Loss 측정 중...")
+        position_losses = ppl_evaluator.evaluate_per_position(
+            self.model, self.val_dataloader, self.device, self.dtype
+        )
+        report["position_losses"] = {
+            "early_avg": round(sum(position_losses[:50]) / max(len(position_losses[:50]), 1), 4),
+            "late_avg": round(sum(position_losses[-200:]) / max(len(position_losses[-200:]), 1), 4),
+        }
+        # 위치별 Loss 시각화
+        dynamics = TrainingDynamicsAnalyzer(str(self.save_dir))
+        dynamics.plot_position_loss(position_losses, str(self.save_dir / "position_loss.png"))
+        # ── 2. 텍스트 생성 ──
+        print("\n" + "━" * 40)
+        print("Phase 2/4: 텍스트 생성")
+        print("━" * 40)
+        gen_evaluator = GenerationEvaluator(self.config)
+        gen_results = gen_evaluator.generate_samples(
+            self.model, self.tokenizer, self.device
+        )
+        report["generation"] = {
+            "num_prompts": len(gen_results),
+            "avg_metrics": self._average_gen_metrics(gen_results),
+        }
+        # ── 3. 학습 역학 분석 ──
+        if self.metrics_history:
+            print("\n" + "━" * 40)
+            print("Phase 3/4: 학습 역학 분석")
+            print("━" * 40)
+            report["training_dynamics"] = dynamics.analyze_metrics(self.metrics_history)
+            dynamics.plot_training_curves(self.metrics_history,
+                                          str(self.save_dir / "training_curves.png"))
+        else:
+            print("\n  Phase 3/4: 건너뜀 (metrics_history 없음)")
+        # ── 4. Attention 시각화 (샘플) ──
+        print("\n" + "━" * 40)
+        print("Phase 4/4: Attention 시각화")
+        print("━" * 40)
+        try:
+            self._visualize_attention_sample()
+        except Exception as e:
+            print(f"  ⚠️ Attention 시각화 실패: {e}")
+        # ── 리포트 저장 ──
+        report_path = self.save_dir / "eval_report.json"
+        with open(report_path, "w") as f:
+            json.dump(report, f, indent=2, default=str)
+        print(f"\n📋 리포트 저장: {report_path}")
+        # ── 요약 출력 ──
+        self._print_summary(report)
+        return report
+    def _visualize_attention_sample(self):
+        """샘플 텍스트로 attention을 시각화합니다."""
+        viz = AttentionVisualizer(str(self.save_dir))
+        sample_text = "The cat sat on the mat and looked at the bird."
+        token_ids = self.tokenizer.encode(sample_text, add_special_tokens=False)
+        input_tensor = torch.tensor([token_ids], dtype=torch.long)
+        # 토큰 문자열 (시각화 라벨용)
+        tokens_str = []
+        for tid in token_ids:
+            decoded = self.tokenizer.decode([tid])
+            tokens_str.append(decoded.replace("\n", "\\n"))
+        # Layer 0 attention 추출
+        attn_weights = viz.extract_attention(
+            self.model, input_tensor, layer_idx=0, device=self.device
+        )
+        if attn_weights is not None:
+            viz.plot_attention_heatmap(
+                attn_weights, tokens_str, head_idx=0,
+                title="Layer 0 Attention"
+            )
+            viz.plot_multi_head_summary(attn_weights)
+    @staticmethod
+    def _average_gen_metrics(gen_results: List[Dict]) -> Dict[str, float]:
+        """모든 프롬프트의 생성 메트릭 평균."""
+        if not gen_results:
+            return {}
+        all_metrics = [r["metrics"] for r in gen_results if r.get("metrics")]
+        if not all_metrics:
+            return {}
+        keys = all_metrics[0].keys()
+        return {
+            k: round(sum(m.get(k, 0) for m in all_metrics) / len(all_metrics), 3)
+            for k in keys
+        }
+    def _print_summary(self, report: Dict[str, Any]):
+        """최종 요약을 출력합니다."""
+        print("\n" + "=" * 70)
+        print("📋 평가 요약 리포트")
+        print("=" * 70)
+        # Perplexity
+        if "perplexity" in report:
+            ppl = report["perplexity"]
+            print(f"\n  🎯 Perplexity:")
+            print(f"     Loss:       {ppl['loss']:.4f}")
+            print(f"     PPL:        {ppl['perplexity']:.2f}")
+            # 등급 판정
+            ppl_val = ppl["perplexity"]
+            if ppl_val < 20:
+                grade = "🌟 우수 (Strong)"
+            elif ppl_val < 35:
+                grade = "✅ 양호 (Good)"
+            elif ppl_val < 60:
+                grade = "⚠️ 보통 (Fair)"
+            else:
+                grade = "❌ 미흡 (학습 추가 필요)"
+            print(f"     등급:       {grade}")
+        # 위치별 Loss
+        if "position_losses" in report:
+            pl = report["position_losses"]
+            print(f"\n  📍 위치별 Loss:")
+            print(f"     초반 (0-50):    {pl['early_avg']:.4f}")
+            print(f"     후반 (-200):    {pl['late_avg']:.4f}")
+            print(f"     컨텍스트 효과:  {pl['early_avg'] - pl['late_avg']:.4f} 감소")
+        # 생성 품질
+        if "generation" in report and report["generation"].get("avg_metrics"):
+            gm = report["generation"]["avg_metrics"]
+            print(f"\n  ✍️ 생성 품질:")
+            print(f"     평균 길이:      {gm.get('avg_length', 0):.0f} 자")
+            print(f"     반복률:         {gm.get('repetition_rate', 0):.1%}")
+            print(f"     어휘 다양성:    {gm.get('lexical_diversity', 0):.3f}")
+        # 학습 역학
+        if "training_dynamics" in report:
+            td = report["training_dynamics"]
+            if "loss" in td:
+                print(f"\n  📉 학습 역학:")
+                print(f"     Loss 감소:    {td['loss']['initial']:.4f} → {td['loss']['final']:.4f}")
+                print(f"     스파이크:     {len(td['loss']['spikes'])}회")
+        # 생성된 파일
+        print(f"\n  📂 결과 파일:")
+        for f in sorted(self.save_dir.glob("*")):
+            size = f.stat().st_size / 1024
+            print(f"     {f.name} ({size:.1f} KB)")
+        print("\n" + "=" * 70)
+# ============================================================================
+# 8. 학습 인사이트 체크리스트 검증기
+# ============================================================================
+class InsightChecklist:
+    """PRD에 정의된 학습 인사이트 체크리스트를 자동/수동으로 검증합니다.
+    자동 검증 가능 항목은 메트릭 기반으로 판정하고,
+    수동 항목은 질문으로 제시합니다.
+    """
+    @staticmethod
+    def run_checklist(
+        report: Dict[str, Any],
+        metrics_history: Optional[Dict[str, list]] = None,
+    ):
+        """체크리스트를 실행합니다."""
+        print("\n" + "=" * 70)
+        print("✅ 학습 인사이트 체크리스트")
+        print("=" * 70)
+        checks = {
+            "passed": [],
+            "failed": [],
+            "manual": [],
+        }
+        # ── 자동 검증 ──
+        # 1. Loss 수렴
+        if report.get("perplexity", {}).get("loss", 99) < 4.0:
+            checks["passed"].append("모델 Loss가 4.0 이하로 수렴")
+        else:
+            checks["failed"].append("모델 Loss가 4.0 이하로 미수렴")
+        # 2. Loss 스파이크
+        spikes = report.get("training_dynamics", {}).get("loss", {}).get("spikes", [])
+        if len(spikes) < 5:
+            checks["passed"].append(f"Loss 스파이크 {len(spikes)}회 (< 5회)")
+        else:
+            checks["failed"].append(f"Loss 스파이크 {len(spikes)}회 (≥ 5회, 안정성 개선 필요)")
+        # 3. 위치별 Loss 패턴
+        if report.get("position_losses"):
+            early = report["position_losses"]["early_avg"]
+            late = report["position_losses"]["late_avg"]
+            if early > late:
+                checks["passed"].append("위치별 Loss 감소 패턴 확인 (컨텍스트 활용)")
+            else:
+                checks["failed"].append("위치별 Loss 패턴 이상 (컨텍스트 미활용?)")
+        # 4. 생성 반복률
+        rep = report.get("generation", {}).get("avg_metrics", {}).get("repetition_rate", 1.0)
+        if rep < 0.3:
+            checks["passed"].append(f"생성 반복률 {rep:.1%} (< 30%)")
+        else:
+            checks["failed"].append(f"생성 반복률 {rep:.1%} (≥ 30%, temperature/top_p 조정)")
+        # 5. Gradient 클리핑 비율
+        if metrics_history and metrics_history.get("grad_norm"):
+            gnorms = metrics_history["grad_norm"]
+            clip_rate = sum(1 for g in gnorms if g >= 0.99) / max(len(gnorms), 1)
+            if clip_rate < 0.3:
+                checks["passed"].append(f"Gradient 클리핑 비율 {clip_rate:.1%} (건강)")
+            else:
+                checks["failed"].append(f"Gradient 클리핑 비율 {clip_rate:.1%} (너무 잦음)")
+        # ── 수동 확인 항목 ──
+        manual_items = [
+            "Self-Attention에서 Q, K, V 각각의 역할을 설명할 수 있는가?",
+            "RoPE가 위치 정보를 인코딩하는 수학적 원리를 이해하는가?",
+            "GQA가 MHA 대비 메모리를 절약하는 메커니즘을 설명할 수 있는가?",
+            "SwiGLU의 게이팅 메커니즘이 ReLU FFN과 어떻게 다른지 이해하는가?",
+            "Learning Rate Warmup이 왜 필요한지 체감했는가?",
+            "Gradient Accumulation이 큰 배치를 시뮬레이션하는 원리를 이해하는가?",
+            "Mixed Precision(bf16)의 메모리-속도 효과를 측정했는가?",
+            "Activation Checkpointing의 메모리-연산 트레이드오프를 이해하는가?",
+        ]
+        checks["manual"] = manual_items
+        # ── 출력 ──
+        total_auto = len(checks["passed"]) + len(checks["failed"])
+        passed_auto = len(checks["passed"])
+        print(f"\n  자동 검증: {passed_auto}/{total_auto} 통과")
+        for item in checks["passed"]:
+            print(f"    ✅ {item}")
+        for item in checks["failed"]:
+            print(f"    ❌ {item}")
+        print(f"\n  수동 확인 ({len(manual_items)} 항목):")
+        for i, item in enumerate(manual_items, 1):
+            print(f"    {i}. [ ] {item}")
+        print(f"\n  총 진행률: {passed_auto}/{total_auto + len(manual_items)} "
+              f"(수동 항목 포함 시)")
+        return checks
+# ============================================================================
+# 9. Quick Start
+# ============================================================================
+def run_evaluation(
+    model: nn.Module,
+    tokenizer: Any,
+    val_dataloader: DataLoader,
+    device: torch.device = None,
+    dtype: torch.dtype = torch.bfloat16,
+    metrics_history: Optional[Dict[str, list]] = None,
+    config: Optional[EvalConfig] = None,
+) -> Dict[str, Any]:
+    """평가를 한 번에 실행합니다.
+    사용법 (Colab):
+    ```python
+    from evaluation import run_evaluation
+    # 학습 완료 후
+    report = run_evaluation(
+        model=trainer.model,
+        tokenizer=tokenizer,
+        val_dataloader=val_dl,
+        metrics_history=trainer.metrics.history,
+    )
+    ```
+    """
+    if device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    evaluator = FullEvaluator(
+        model=model,
+        tokenizer=tokenizer,
+        val_dataloader=val_dataloader,
+        device=device,
+        config=config,
+        dtype=dtype,
+        metrics_history=metrics_history,
+    )
+    report = evaluator.run_full_evaluation()
+    # 인사이트 체크리스트
+    InsightChecklist.run_checklist(report, metrics_history)
+    return report
+# ============================================================================
+# 10. 검증 스크립트
+# ============================================================================
+if __name__ == "__main__":
+    print("=" * 70)
+    print("LLM-1B-Lab: 평가 모듈 검증")
+    print("=" * 70)
+    # ── 더미 모델로 구조 검증 ──
+    class TinyModel(nn.Module):
+        def __init__(self, vocab=100, dim=64):
+            super().__init__()
+            self.emb = nn.Embedding(vocab, dim)
+            self.linear = nn.Linear(dim, vocab)
+            self.linear.weight = self.emb.weight
+            self.layers = nn.ModuleList()  # attention 시각화 호환용
+        def forward(self, input_ids, targets=None):
+            h = self.emb(input_ids)
+            logits = self.linear(h)
+            loss = None
+            if targets is not None:
+                loss = F.cross_entropy(logits.view(-1, 100), targets.view(-1))
+            return logits, loss
+        def generate(self, input_ids, max_new_tokens=20, temperature=1.0, top_k=50, top_p=0.9):
+            generated = input_ids
+            for _ in range(max_new_tokens):
+                logits, _ = self(generated[:, -64:])
+                next_logits = logits[:, -1, :] / temperature
+                probs = F.softmax(next_logits, dim=-1)
+                nxt = torch.multinomial(probs, 1)
+                generated = torch.cat([generated, nxt], dim=1)
+            return generated
+    model = TinyModel()
+    device = torch.device("cpu")
+    # 더미 토크나이저
+    class DummyTok:
+        eos_id = 2
+        vocab_size = 100
+        def encode(self, t, add_special_tokens=False):
+            return [min(ord(c), 99) for c in t]
+        def decode(self, ids):
+            return "".join(chr(max(min(i, 122), 32)) for i in ids if i > 2)
+    tok = DummyTok()
+    # 더미 데이터
+    val_data = []
+    for _ in range(30):
+        ids = torch.randint(3, 100, (65,))
+        val_data.append({"input_ids": ids[:64], "targets": ids[1:65]})
+    def collate(batch):
+        return {
+            "input_ids": torch.stack([b["input_ids"] for b in batch]),
+            "targets": torch.stack([b["targets"] for b in batch]),
+        }
+    val_dl = DataLoader(val_data, batch_size=4, collate_fn=collate)
+    # ── 1. Perplexity 테스트 ──
+    print("\n[테스트 1] Perplexity 측정")
+    ppl_eval = PerplexityEvaluator(EvalConfig(max_eval_batches=5))
+    result = ppl_eval.evaluate(model, val_dl, device, torch.float32, desc="Test Eval")
+    print(f"  → Loss={result['loss']:.4f}, PPL={result['perplexity']:.2f}")
+    expected_ppl = math.exp(math.log(100))  # vocab=100 → 초기 PPL ≈ 100
+    print(f"  → 예상 초기 PPL ≈ {expected_ppl:.0f} (vocab=100 랜덤)")
+    # ── 2. 생성 테스트 ──
+    print("\n[테스트 2] 텍스트 생성")
+    gen_eval = GenerationEvaluator(EvalConfig(max_new_tokens=30, num_samples=1))
+    gen_results = gen_eval.generate_samples(
+        model, tok, device, prompts=["Hello world"], verbose=True
+    )
+    # ── 3. Scaling 분석 테스트 ──
+    print("\n[테스트 3] Scaling Law 분석")
+    analyzer = ScalingAnalyzer("./test_eval")
+    dummy_scaling = [
+        {"name": "10M", "params": 10e6, "tokens": 1e9, "loss": 4.2, "ppl": 66.7},
+        {"name": "100M", "params": 100e6, "tokens": 5e9, "loss": 3.5, "ppl": 33.1},
+        {"name": "1B", "params": 1.1e9, "tokens": 10e9, "loss": 3.0, "ppl": 20.1},
+    ]
+    scaling_result = analyzer.analyze(dummy_scaling)
+    # ── 4. 학습 역학 분석 테스트 ──
+    print("\n[테스트 4] 학습 역학 분석")
+    import random
+    random.seed(42)
+    dummy_history = {
+        "step": list(range(0, 1000, 10)),
+        "train_loss": [10.0 * (0.995 ** i) + random.gauss(0, 0.1) for i in range(100)],
+        "learning_rate": [min(3e-4 * i / 20, 3e-4) * (0.5 + 0.5 * math.cos(math.pi * max(0, i-20)/80))
+                          for i in range(100)],
+        "grad_norm": [min(random.gauss(0.5, 0.3), 1.0) for _ in range(100)],
+        "tokens_per_sec": [50000 + random.gauss(0, 3000) for _ in range(100)],
+        "val_loss": [8.0, 6.0, 4.5, 3.8, 3.5],
+        "val_ppl": [2981, 403, 90, 44, 33],
+    }
+    dynamics = TrainingDynamicsAnalyzer("./test_eval")
+    dynamics.analyze_metrics(dummy_history)
+    # ── 5. 체크리스트 테스트 ──
+    print("\n[테스트 5] 인사이트 체크리스트")
+    dummy_report = {
+        "perplexity": {"loss": 3.5, "perplexity": 33.1},
+        "position_losses": {"early_avg": 4.5, "late_avg": 3.2},
+        "generation": {"avg_metrics": {"repetition_rate": 0.15}},
+        "training_dynamics": {"loss": {"initial": 10.0, "final": 3.5, "spikes": []}},
+    }
+    InsightChecklist.run_checklist(dummy_report, dummy_history)
+    # 정리
+    import shutil
+    if os.path.exists("./test_eval"):
+        shutil.rmtree("./test_eval")
+    print("\n" + "=" * 70)
+    print("✅ 평가 모듈 검증 완료!")
+    print()
+    print("실제 사용법:")
+    print("  from evaluation import run_evaluation")
+    print("  report = run_evaluation(model, tokenizer, val_dl,")
+    print("                          metrics_history=trainer.metrics.history)")
+    print("=" * 70)

_archive/llm-1b-model.py ADDED Viewed

	@@ -0,0 +1,791 @@

+"""
+LLM-1B-Lab: 1B Parameter LLaMA-style Transformer (from scratch)
+================================================================
+딥러닝 초보자를 위한 학습용 구현.
+각 컴포넌트에 상세 주석을 달아 "왜 이렇게 하는지"를 설명합니다.
+아키텍처 요약:
+  - Decoder-Only Transformer (Causal LM)
+  - RMSNorm (Pre-Normalization)
+  - Rotary Positional Embedding (RoPE)
+  - Grouped Query Attention (GQA)
+  - SwiGLU Feed-Forward Network
+  - Weight Tying (Embedding ↔ Output Head)
+"""
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# ============================================================================
+# 1. 모델 설정 (Config)
+# ============================================================================
+@dataclass
+class ModelConfig:
+    """모델 하이퍼파라미터를 하나의 데이터클래스로 관리합니다.
+    규모별 프리셋:
+      - debug:  ~10M  (파이프라인 검증용)
+      - small:  ~100M (중간 검증용)
+      - base:   ~1.1B (최종 목표)
+    """
+    vocab_size: int = 32_000
+    hidden_dim: int = 2048        # d_model: 모델의 기본 차원
+    num_layers: int = 22          # Transformer 블록 수
+    num_heads: int = 16           # Query 헤드 수
+    num_kv_heads: int = 4         # Key/Value 헤드 수 (GQA)
+    intermediate_dim: int = 5632  # FFN 중간 차원 (≈ 2.75 × hidden_dim)
+    max_seq_len: int = 2048       # 최대 시퀀스 길이
+    dropout: float = 0.0          # Pretraining에서는 보통 0 사용
+    rope_theta: float = 10000.0   # RoPE 주파수 베이스
+    norm_eps: float = 1e-6        # RMSNorm epsilon
+    @property
+    def head_dim(self) -> int:
+        """각 어텐션 헤드의 차원."""
+        return self.hidden_dim // self.num_heads
+    @property
+    def num_kv_groups(self) -> int:
+        """GQA에서 하나의 KV 헤드가 담당하는 Q 헤드 수."""
+        return self.num_heads // self.num_kv_heads
+    @classmethod
+    def debug_10m(cls) -> "ModelConfig":
+        """~10M 파라미터 - 빠른 디버깅용."""
+        return cls(
+            hidden_dim=256, num_layers=6, num_heads=8,
+            num_kv_heads=4, intermediate_dim=704, max_seq_len=512,
+        )
+    @classmethod
+    def small_100m(cls) -> "ModelConfig":
+        """~100M 파라미터 - 중간 검증용."""
+        return cls(
+            hidden_dim=768, num_layers=12, num_heads=12,
+            num_kv_heads=4, intermediate_dim=2048, max_seq_len=1024,
+        )
+    @classmethod
+    def base_1b(cls) -> "ModelConfig":
+        """~1.1B 파라미터 - 최종 학습 목표."""
+        return cls()  # 기본값이 1B 설정
+# ============================================================================
+# 2. RMSNorm (Root Mean Square Layer Normalization)
+# ============================================================================
+class RMSNorm(nn.Module):
+    """RMSNorm: LayerNorm의 경량화 버전.
+    일반 LayerNorm과의 차이:
+      - 평균(mean)을 빼지 않음 → 연산 절약
+      - 분산 대신 RMS(Root Mean Square)로 정규화
+      - bias 파라미터 없음
+    수식:
+      RMSNorm(x) = (x / RMS(x)) * γ
+      RMS(x) = sqrt(mean(x²) + ε)
+    왜 정규화가 필요한가?
+      → 레이어를 깊게 쌓으면 활성화 값의 스케일이 폭발하거나 소멸합니다.
+      → 정규화로 각 레이어의 입력을 안정적인 범위로 유지합니다.
+    """
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        # γ (gamma): 학습 가능한 스케일 파라미터, 1로 초기화
+        self.weight = nn.Parameter(torch.ones(dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # 1) 입력을 float32로 변환 (수치 안정성)
+        #    bf16/fp16 상태에서 제곱합을 구하면 오버플로우 위험
+        x_float = x.float()
+        # 2) RMS 계산: sqrt(mean(x²) + ε)
+        rms = torch.rsqrt(x_float.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+        # rsqrt = 1/sqrt(x) → 나눗셈 대신 곱셈으로 대체 (더 빠름)
+        # 3) 정규화 후 원래 dtype으로 복원, 스케일 적용
+        return (x_float * rms).to(x.dtype) * self.weight
+# ============================================================================
+# 3. Rotary Positional Embedding (RoPE)
+# ============================================================================
+class RotaryPositionalEmbedding(nn.Module):
+    """RoPE: 회전 행렬을 이용한 상대 위치 인코딩.
+    핵심 아이디어:
+      - 각 차원 쌍(2i, 2i+1)을 2D 평면의 좌표로 보고,
+        위치(position)에 비례한 각도만큼 회전시킵니다.
+      - 두 토큰의 어텐션 스코어(Q·K)는 상대 거리에만 의존하게 됩니다.
+    왜 RoPE인가?
+      - 절��� 위치 임베딩: 각 위치에 고정 벡터를 더함 → 길이 일반화 어려움
+      - 상대 위치 임베딩: 구현 복잡, 추가 파라미터 필요
+      - RoPE: 파라미터 없이, 자연스럽게 상대 위치 정보 인코딩
+    수식:
+      θ_i = theta^(-2i/d)  (i = 0, 1, ..., d/2-1)
+      RoPE(x, pos) = x를 각 차원 쌍에서 pos × θ_i 만큼 회전
+    """
+    def __init__(self, dim: int, max_seq_len: int = 2048, theta: float = 10000.0):
+        super().__init__()
+        self.dim = dim
+        self.max_seq_len = max_seq_len
+        self.theta = theta
+        # 주파수 벡터 미리 계산 (학습 불필요 → buffer로 등록)
+        # freqs[i] = 1 / (theta^(2i/dim)), i = 0, 1, ..., dim/2-1
+        freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("freqs", freqs, persistent=False)
+        # (max_seq_len, dim/2) 크기의 cos/sin 테이블 미리 계산
+        self._build_cache(max_seq_len)
+    def _build_cache(self, seq_len: int):
+        """cos/sin 값을 미리 계산하여 캐싱합니다."""
+        t = torch.arange(seq_len, device=self.freqs.device, dtype=torch.float32)
+        # outer product: (seq_len,) × (dim/2,) → (seq_len, dim/2)
+        angles = torch.outer(t, self.freqs)
+        self.register_buffer("cos_cached", angles.cos(), persistent=False)
+        self.register_buffer("sin_cached", angles.sin(), persistent=False)
+    def forward(
+        self, q: torch.Tensor, k: torch.Tensor, position_offset: int = 0
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Q, K에 회전 변환을 적용합니다.
+        Args:
+            q: (batch, num_heads, seq_len, head_dim)
+            k: (batch, num_kv_heads, seq_len, head_dim)
+            position_offset: 시퀀스 시작 위치 오프셋 (추론 시 KV 캐시 사용 시)
+        Returns:
+            회전 변환이 적용된 (q_rotated, k_rotated)
+        """
+        seq_len = q.shape[2]
+        # 필요 시 캐시 확장
+        if position_offset + seq_len > self.cos_cached.shape[0]:
+            self._build_cache(position_offset + seq_len)
+        # 현재 위치에 해당하는 cos/sin 슬라이스
+        cos = self.cos_cached[position_offset : position_offset + seq_len]  # (seq_len, dim/2)
+        sin = self.sin_cached[position_offset : position_offset + seq_len]
+        q_rotated = self._apply_rotation(q, cos, sin)
+        k_rotated = self._apply_rotation(k, cos, sin)
+        return q_rotated, k_rotated
+    @staticmethod
+    def _apply_rotation(
+        x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
+    ) -> torch.Tensor:
+        """회전 변환 적용.
+        2D 회전 행렬:
+          [cos θ, -sin θ] [x1]   [x1·cos θ - x2·sin θ]
+          [sin θ,  cos θ] [x2] = [x1·sin θ + x2·cos θ]
+        이를 벡터 연산으로 효율적으로 구현합니다.
+        """
+        # x: (batch, heads, seq_len, head_dim)
+        # 짝수/홀수 인덱스를 분리: (x0, x1, x2, x3, ...) → (x0, x2, ...), (x1, x3, ...)
+        x_even = x[..., 0::2]  # 짝수 인덱스
+        x_odd  = x[..., 1::2]  # 홀수 인덱스
+        # 브로드캐스팅을 위해 차원 맞춤: (seq_len, dim/2) → (1, 1, seq_len, dim/2)
+        cos = cos.unsqueeze(0).unsqueeze(0)
+        sin = sin.unsqueeze(0).unsqueeze(0)
+        # 회전 적용
+        rotated_even = x_even * cos - x_odd * sin
+        rotated_odd  = x_even * sin + x_odd * cos
+        # 다시 인터리빙: (even0, odd0, even1, odd1, ...)
+        out = torch.stack([rotated_even, rotated_odd], dim=-1)
+        return out.flatten(-2)  # 마지막 두 차원을 합쳐 원래 shape 복원
+# ============================================================================
+# 4. Grouped Query Attention (GQA)
+# ============================================================================
+class GroupedQueryAttention(nn.Module):
+    """GQA: Multi-Head Attention의 메모리 효율적 변형.
+    MHA vs GQA vs MQA:
+      - MHA (Multi-Head Attention):  Q, K, V 모두 num_heads개 → 메모리 큼
+      - MQA (Multi-Query Attention): K, V는 1개 헤드 공유 → 품질 저하 우려
+      - GQA (Grouped Query Attention): K, V를 num_kv_heads개로 그룹화
+        → MHA와 MQA의 중간, 좋은 품질-효율 균형
+    예시 (num_heads=16, num_kv_heads=4):
+      Q 헤드: [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15]
+      K/V 그룹:  [  0  ,   1   ,    2     ,     3      ]
+      → Q 헤드 4개가 K/V 헤드 1개를 공유
+    Attention 수식:
+      Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V
+    """
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        self.config = config
+        self.head_dim = config.head_dim
+        self.num_heads = config.num_heads
+        self.num_kv_heads = config.num_kv_heads
+        self.num_kv_groups = config.num_kv_groups  # num_heads // num_kv_heads
+        # Q/K/V 프로젝션
+        # Q: hidden_dim → num_heads × head_dim
+        self.q_proj = nn.Linear(config.hidden_dim, config.num_heads * self.head_dim, bias=False)
+        # K, V: hidden_dim → num_kv_heads × head_dim (Q보다 작음!)
+        self.k_proj = nn.Linear(config.hidden_dim, config.num_kv_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(config.hidden_dim, config.num_kv_heads * self.head_dim, bias=False)
+        # 출력 프로젝션: 모든 헤드의 출력을 다시 hidden_dim으로
+        self.o_proj = nn.Linear(config.num_heads * self.head_dim, config.hidden_dim, bias=False)
+        # RoPE
+        self.rope = RotaryPositionalEmbedding(
+            dim=self.head_dim, max_seq_len=config.max_seq_len, theta=config.rope_theta
+        )
+        # Attention dropout (pretraining에서는 보통 0)
+        self.attn_dropout = nn.Dropout(config.dropout)
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[torch.Tensor] = None,
+        position_offset: int = 0,
+    ) -> torch.Tensor:
+        """
+        Args:
+            x: (batch_size, seq_len, hidden_dim)
+            mask: (seq_len, seq_len) causal mask
+            position_offset: 위치 오프셋 (추론 시 사용)
+        Returns:
+            (batch_size, seq_len, hidden_dim)
+        """
+        B, S, _ = x.shape
+        # ──────────────────────────────────────────────
+        # Step 1: Q, K, V 프로젝션
+        # ──────────────────────────────────────────────
+        q = self.q_proj(x)  # (B, S, num_heads × head_dim)
+        k = self.k_proj(x)  # (B, S, num_kv_heads × head_dim)
+        v = self.v_proj(x)  # (B, S, num_kv_heads × head_dim)
+        # 멀티헤드 형태로 reshape
+        q = q.view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
+        # → (B, num_heads, S, head_dim)
+        k = k.view(B, S, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        # → (B, num_kv_heads, S, head_dim)
+        v = v.view(B, S, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        # ──────────────────────────────────────────────
+        # Step 2: RoPE 적용 (Q, K에만! V에는 적용하지 않음)
+        # ──────────────────────────────────────────────
+        # 위치 정보는 "어디를 볼지"(Q·K)에만 영향을 줘야 하고,
+        # "무엇을 가져올지"(V)에는 영향을 주면 안 됩니다.
+        q, k = self.rope(q, k, position_offset)
+        # ──────────────────────────────────────────────
+        # Step 3: GQA - KV 헤드 확장 (repeat)
+        # ──────────────────────────────────────────────
+        # num_kv_heads=4 → num_heads=16: 각 KV를 4번 반복
+        if self.num_kv_groups > 1:
+            k = self._repeat_kv(k)  # (B, num_heads, S, head_dim)
+            v = self._repeat_kv(v)
+        # ──────────────────────────────────────────────
+        # Step 4: Scaled Dot-Product Attention
+        # ──────────────────────────────────────────────
+        # PyTorch >= 2.0의 최적화된 구현 사용 (Flash Attention 자동 적용)
+        attn_out = F.scaled_dot_product_attention(
+            q, k, v,
+            attn_mask=mask,
+            dropout_p=self.config.dropout if self.training else 0.0,
+            is_causal=(mask is None),  # mask가 없으면 자동 causal masking
+        )
+        # → (B, num_heads, S, head_dim)
+        # ──────────────────────────────────────────────
+        # Step 5: 헤드 합치기 + 출력 프로젝션
+        # ──────────────────────────────────────────────
+        attn_out = attn_out.transpose(1, 2).contiguous().view(B, S, -1)
+        # → (B, S, num_heads × head_dim)
+        return self.o_proj(attn_out)  # → (B, S, hidden_dim)
+    def _repeat_kv(self, x: torch.Tensor) -> torch.Tensor:
+        """KV 헤드를 Q 헤드 수에 맞게 반복합니다.
+        (B, num_kv_heads, S, head_dim) → (B, num_heads, S, head_dim)
+        예: num_kv_heads=4, num_kv_groups=4
+          [kv0, kv1, kv2, kv3] → [kv0,kv0,kv0,kv0, kv1,kv1,kv1,kv1, ...]
+        """
+        B, H_kv, S, D = x.shape
+        x = x[:, :, None, :, :]               # (B, H_kv, 1, S, D)
+        x = x.expand(B, H_kv, self.num_kv_groups, S, D)  # (B, H_kv, groups, S, D)
+        return x.reshape(B, self.num_heads, S, D)
+# ============================================================================
+# 5. SwiGLU Feed-Forward Network
+# ============================================================================
+class SwiGLUFeedForward(nn.Module):
+    """SwiGLU: Gated Linear Unit with Swish 활성화 함수.
+    기존 FFN:
+      FFN(x) = ReLU(x·W1 + b1)·W2 + b2
+      → 단순한 비선형 변환
+    SwiGLU FFN:
+      SwiGLU(x) = (Swish(x·W_gate) ⊙ (x·W_up)) · W_down
+      → 게이팅 메커니즘으로 정보 흐름을 제어
+    왜 SwiGLU가 더 좋은가?
+      - Swish(x) = x · sigmoid(x): 부드러운 활성화, 음수 영역 일부 허용
+      - Gate 벡터가 "어떤 정보를 통과시킬지" 학습
+      - PaLM, LLaMA 등에서 ReLU FFN 대비 일관된 성능 향상 보고
+    참고: W_gate와 W_up 두 개의 up-projection이 있어서
+    파라미터 수가 기존 FFN 대비 1.5배이지만, intermediate_dim을
+    조정하여 총 파라미터 수를 맞춥니다.
+    """
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        # 게이트 프로젝션: hidden_dim → intermediate_dim
+        self.gate_proj = nn.Linear(config.hidden_dim, config.intermediate_dim, bias=False)
+        # 업 프로젝션: hidden_dim → intermediate_dim
+        self.up_proj   = nn.Linear(config.hidden_dim, config.intermediate_dim, bias=False)
+        # 다운 프로젝션: intermediate_dim → hidden_dim
+        self.down_proj = nn.Linear(config.intermediate_dim, config.hidden_dim, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # SwiGLU(x) = (Swish(gate(x)) ⊙ up(x)) · down
+        #
+        # 1) gate: 어떤 정보를 통과시킬지 결정 (Swish 활성화)
+        gate = F.silu(self.gate_proj(x))  # silu = Swish = x * sigmoid(x)
+        # 2) up: 정보를 고차원으로 사영
+        up = self.up_proj(x)
+        # 3) element-wise 곱 (게이팅) → 다시 원래 차원으로
+        return self.down_proj(gate * up)
+# ============================================================================
+# 6. Transformer Block (하나의 레이어)
+# ============================================================================
+class TransformerBlock(nn.Module):
+    """하나의 Transformer 디코더 블록.
+    구조 (Pre-Norm 방식):
+      x → RMSNorm → Attention → + (residual) → RMSNorm → FFN → + (residual) → out
+    Pre-Norm vs Post-Norm:
+      - Post-Norm (원래 Transformer): LayerNorm이 residual 이후
+        → 깊은 모델에서 학습 불안정
+      - Pre-Norm (GPT-2 이후 표준): LayerNorm이 sublayer 이전
+        → gradient 흐름이 원활, 학습이 안정적
+    Residual Connection의 역할:
+      - 입력을 출력에 더함 → gradient가 레이어를 건너뛸 수 있는 "고속도로"
+      - 22개 레이어를 쌓아도 학습이 가능한 핵심 이유
+    """
+    def __init__(self, config: ModelConfig, layer_idx: int):
+        super().__init__()
+        self.layer_idx = layer_idx
+        # Pre-Norm: Attention 전 정규화
+        self.attn_norm = RMSNorm(config.hidden_dim, eps=config.norm_eps)
+        # Self-Attention
+        self.attention = GroupedQueryAttention(config)
+        # Pre-Norm: FFN 전 정규화
+        self.ffn_norm = RMSNorm(config.hidden_dim, eps=config.norm_eps)
+        # Feed-Forward Network
+        self.feed_forward = SwiGLUFeedForward(config)
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[torch.Tensor] = None,
+        position_offset: int = 0,
+    ) -> torch.Tensor:
+        """
+        Args:
+            x: (batch_size, seq_len, hidden_dim)
+        Returns:
+            (batch_size, seq_len, hidden_dim)
+        """
+        # ── Attention sublayer with residual ──
+        # h = x + Attention(RMSNorm(x))
+        h = x + self.attention(self.attn_norm(x), mask, position_offset)
+        # ── FFN sublayer with residual ──
+        # out = h + FFN(RMSNorm(h))
+        out = h + self.feed_forward(self.ffn_norm(h))
+        return out
+# ============================================================================
+# 7. Full Transformer Model (LLaMA-style)
+# ============================================================================
+class LLMModel(nn.Module):
+    """1B 파라미터 LLaMA-style Decoder-Only Transformer.
+    전체 구조:
+      Input Token IDs
+        → Token Embedding
+        → [TransformerBlock] × num_layers  (+ Activation Checkpointing)
+        → RMSNorm (최종)
+        → Linear Head (→ vocab logits)
+    Weight Tying:
+      - 입력 Embedding과 출력 Linear Head의 가중치를 공유
+      - 파라미터 수 절약 (~65M) + 성능 유지/향상
+      - 직관: "단어의 의미 표현"과 "단어 예측"이 같은 공간�� 사용
+    """
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        self.config = config
+        # ── Token Embedding ──
+        self.token_embedding = nn.Embedding(config.vocab_size, config.hidden_dim)
+        # ── Transformer Blocks ──
+        self.layers = nn.ModuleList([
+            TransformerBlock(config, layer_idx=i)
+            for i in range(config.num_layers)
+        ])
+        # ── 최종 정규화 ──
+        self.final_norm = RMSNorm(config.hidden_dim, eps=config.norm_eps)
+        # ── 출력 헤드 (Weight Tying) ──
+        self.lm_head = nn.Linear(config.hidden_dim, config.vocab_size, bias=False)
+        # Weight Tying: lm_head의 가중치 = token_embedding의 가중치
+        self.lm_head.weight = self.token_embedding.weight
+        # 가중치 초기화
+        self._init_weights()
+    def _init_weights(self):
+        """가중치 초기화 전략.
+        왜 초기화가 중요한가?
+          - 너무 크면: 활성화 폭발 → NaN
+          - 너무 작으면: gradient 소멸 → 학습 정체
+          - 적절한 초기화: 각 레이어의 출력 분산을 일정하게 유지
+        GPT-2 스타일 초기화:
+          - 일반 Linear: N(0, 0.02)
+          - Residual projection: N(0, 0.02 / √(2 × num_layers))
+            → 레이어가 깊어질수록 residual 기여를 줄여 안정화
+        """
+        std = 0.02
+        residual_std = std / math.sqrt(2 * self.config.num_layers)
+        for module in self.modules():
+            if isinstance(module, nn.Linear):
+                nn.init.normal_(module.weight, mean=0.0, std=std)
+                if module.bias is not None:
+                    nn.init.zeros_(module.bias)
+            elif isinstance(module, nn.Embedding):
+                nn.init.normal_(module.weight, mean=0.0, std=std)
+        # Residual projection 레이어에 축소된 초기화 적용
+        for layer in self.layers:
+            nn.init.normal_(layer.attention.o_proj.weight, mean=0.0, std=residual_std)
+            nn.init.normal_(layer.feed_forward.down_proj.weight, mean=0.0, std=residual_std)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        targets: Optional[torch.Tensor] = None,
+        position_offset: int = 0,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Args:
+            input_ids: (batch_size, seq_len) - 토큰 ID
+            targets:   (batch_size, seq_len) - 정답 토큰 ID (학습 시)
+            position_offset: 위치 오프셋 (추론 시)
+        Returns:
+            logits: (batch_size, seq_len, vocab_size)
+            loss:   스칼라 (targets 제공 시) 또는 None
+        """
+        B, S = input_ids.shape
+        # ── Step 1: Token Embedding ──
+        # 각 토큰 ID를 hidden_dim 차원의 벡터로 변환
+        h = self.token_embedding(input_ids)  # (B, S, hidden_dim)
+        # ── Step 2: Transformer Blocks ──
+        # Activation Checkpointing: 학습 시 메모리 절약
+        # (중간 활성화를 저장하지 않고, backward 시 재계산)
+        for layer in self.layers:
+            if self.training and torch.is_grad_enabled():
+                # Activation Checkpointing 적용
+                h = torch.utils.checkpoint.checkpoint(
+                    layer, h, None, position_offset,
+                    use_reentrant=False,  # PyTorch >= 2.0 권장
+                )
+            else:
+                h = layer(h, mask=None, position_offset=position_offset)
+        # ── Step 3: 최종 정규화 ──
+        h = self.final_norm(h)
+        # ── Step 4: 출력 로짓 계산 ──
+        logits = self.lm_head(h)  # (B, S, vocab_size)
+        # ── Step 5: Loss 계산 (학습 시) ──
+        loss = None
+        if targets is not None:
+            # Cross-Entropy Loss: 다음 토큰 예측
+            # logits: (B, S, V) → (B*S, V)
+            # targets: (B, S)   → (B*S,)
+            loss = F.cross_entropy(
+                logits.view(-1, self.config.vocab_size),
+                targets.view(-1),
+                ignore_index=-100,  # 패딩 토큰 무시
+            )
+        return logits, loss
+    def count_parameters(self, trainable_only: bool = True) -> int:
+        """모델 파라미터 수 계산."""
+        if trainable_only:
+            return sum(p.numel() for p in self.parameters() if p.requires_grad)
+        return sum(p.numel() for p in self.parameters())
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: torch.Tensor,
+        max_new_tokens: int = 100,
+        temperature: float = 1.0,
+        top_k: int = 50,
+        top_p: float = 0.9,
+    ) -> torch.Tensor:
+        """텍스트 생성 (추론).
+        Autoregressive 생성: 한 토큰씩 예측하여 이어붙이기.
+        Args:
+            input_ids: (1, prompt_len) - 초기 프롬프트
+            max_new_tokens: 생성할 최대 토큰 수
+            temperature: 확률 분포 날카로움 조절 (낮을수록 보수적)
+            top_k: 확률 상위 k개만 고려
+            top_p: 누적 확률 p까지만 고려 (nucleus sampling)
+        """
+        self.eval()
+        generated = input_ids
+        for _ in range(max_new_tokens):
+            # 현재 시퀀스가 max_seq_len을 초과하면 잘라내기
+            ctx = generated[:, -self.config.max_seq_len:]
+            # Forward pass
+            logits, _ = self(ctx)
+            # 마지막 토큰의 logits만 사용 (다음 토큰 예측)
+            next_logits = logits[:, -1, :] / temperature
+            # ── Top-K 필터링 ──
+            if top_k > 0:
+                top_k_values, _ = torch.topk(next_logits, min(top_k, next_logits.size(-1)))
+                min_top_k = top_k_values[:, -1].unsqueeze(-1)
+                next_logits = next_logits.masked_fill(next_logits < min_top_k, float("-inf"))
+            # ── Top-P (Nucleus) 필터링 ──
+            if top_p < 1.0:
+                sorted_logits, sorted_indices = torch.sort(next_logits, descending=True)
+                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                # 누적 확률이 top_p를 초과하는 토큰 제거
+                remove_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
+                sorted_logits[remove_mask] = float("-inf")
+                # 원래 순서로 복원
+                next_logits = sorted_logits.scatter(1, sorted_indices, sorted_logits)
+            # 확률 분포에서 샘플링
+            probs = F.softmax(next_logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)  # (B, 1)
+            # 생성된 토큰 이어붙이기
+            generated = torch.cat([generated, next_token], dim=1)
+        return generated
+# ============================================================================
+# 8. 유틸리티 함수
+# ============================================================================
+def count_parameters_detailed(model: LLMModel) -> dict:
+    """모델의 파라미터 수를 컴포넌트별로 상세 출력합니다."""
+    total = 0
+    breakdown = {}
+    # Embedding
+    emb_params = model.token_embedding.weight.numel()
+    breakdown["token_embedding"] = emb_params
+    total += emb_params
+    # 각 레이어
+    layer_total = 0
+    layer_detail = {}
+    layer = model.layers[0]
+    for name, param in layer.named_parameters():
+        layer_detail[name] = param.numel()
+        layer_total += param.numel()
+    breakdown["per_layer"] = layer_detail
+    breakdown["per_layer_total"] = layer_total
+    breakdown["all_layers_total"] = layer_total * len(model.layers)
+    total += layer_total * len(model.layers)
+    # Final norm
+    norm_params = model.final_norm.weight.numel()
+    breakdown["final_norm"] = norm_params
+    total += norm_params
+    # LM head (weight tying이므로 실제 추가 파라미터 0)
+    breakdown["lm_head"] = "weight tying (0 additional)"
+    breakdown["total"] = total
+    return breakdown
+def estimate_memory_gb(config: ModelConfig, batch_size: int = 4, dtype_bytes: int = 2) -> dict:
+    """모델의 GPU 메모리 사용량을 추정합니다.
+    Args:
+        dtype_bytes: 2 (bf16/fp16) 또는 4 (fp32)
+    """
+    # 대략적인 파라미터 수 계산
+    emb = config.vocab_size * config.hidden_dim
+    per_layer = (
+        config.hidden_dim * (config.num_heads + 2 * config.num_kv_heads) * config.head_dim  # QKV
+        + config.num_heads * config.head_dim * config.hidden_dim  # O proj
+        + 3 * config.hidden_dim * config.intermediate_dim  # SwiGLU (gate + up + down)
+        + 2 * config.hidden_dim  # 2 × RMSNorm
+    )
+    total_params = emb + per_layer * config.num_layers + config.hidden_dim
+    model_gb = total_params * dtype_bytes / 1e9
+    optimizer_gb = total_params * 8 / 1e9  # AdamW: 2 states × fp32
+    gradient_gb = total_params * dtype_bytes / 1e9
+    # 활성화 메모리 (activation checkpointing 적용 가정)
+    # 대략적 추정: batch_size × seq_len × hidden_dim × num_layers × factor
+    activation_gb = (
+        batch_size * config.max_seq_len * config.hidden_dim * 4  # 바이트
+        * math.sqrt(config.num_layers)  # checkpointing 효과
+        / 1e9
+    )
+    return {
+        "total_parameters": total_params,
+        "model_weights_gb": round(model_gb, 2),
+        "optimizer_states_gb": round(optimizer_gb, 2),
+        "gradients_gb": round(gradient_gb, 2),
+        "activations_estimated_gb": round(activation_gb, 2),
+        "total_estimated_gb": round(model_gb + optimizer_gb + gradient_gb + activation_gb, 2),
+    }
+# ============================================================================
+# 9. 검증 스크립트 (실행 시)
+# ============================================================================
+if __name__ == "__main__":
+    print("=" * 70)
+    print("LLM-1B-Lab: 모델 아키텍처 검증")
+    print("=" * 70)
+    # ── 디버그 모델 (10M) 테스트 ──
+    print("\n[1] Debug Model (~10M params)")
+    cfg_debug = ModelConfig.debug_10m()
+    model_debug = LLMModel(cfg_debug)
+    n_params = model_debug.count_parameters()
+    print(f"    파라미터 수: {n_params:,} ({n_params / 1e6:.1f}M)")
+    # Forward pass 테스트
+    dummy_input = torch.randint(0, cfg_debug.vocab_size, (2, 64))
+    dummy_target = torch.randint(0, cfg_debug.vocab_size, (2, 64))
+    logits, loss = model_debug(dummy_input, dummy_target)
+    print(f"    Input shape:  {dummy_input.shape}")
+    print(f"    Logits shape: {logits.shape}")
+    print(f"    Loss:         {loss.item():.4f}")
+    # 초기 loss ≈ ln(vocab_size) ≈ ln(32000) ≈ 10.37 이면 정상
+    expected_loss = math.log(cfg_debug.vocab_size)
+    print(f"    Expected initial loss ≈ ln({cfg_debug.vocab_size}) = {expected_loss:.2f}")
+    # ── 1B 모델 파라미터 수 확인 ──
+    print("\n[2] Base Model (~1B params) — 파라미터 수만 확인")
+    cfg_1b = ModelConfig.base_1b()
+    # 메모리가 부족할 수 있으므로 meta device에서 생성
+    with torch.device("meta"):
+        model_1b = LLMModel(cfg_1b)
+    n_params_1b = model_1b.count_parameters()
+    print(f"    파라미터 수: {n_params_1b:,} ({n_params_1b / 1e6:.1f}M ≈ {n_params_1b / 1e9:.2f}B)")
+    # 상세 파라미터 분해
+    print("\n[3] 파라미터 상세 분해 (1B)")
+    detail = count_parameters_detailed(model_1b)
+    print(f"    Token Embedding: {detail['token_embedding']:,}")
+    print(f"    Per Layer Total: {detail['per_layer_total']:,}")
+    print(f"    All Layers ({cfg_1b.num_layers}): {detail['all_layers_total']:,}")
+    print(f"    Final Norm: {detail['final_norm']:,}")
+    print(f"    LM Head: {detail['lm_head']}")
+    print(f"    ────────────────────────")
+    print(f"    TOTAL: {detail['total']:,}")
+    # 메모리 추정
+    print("\n[4] GPU 메모리 추정 (A100 40GB, bf16, batch_size=4)")
+    mem = estimate_memory_gb(cfg_1b, batch_size=4, dtype_bytes=2)
+    print(f"    모델 가중치:   {mem['model_weights_gb']} GB")
+    print(f"    옵티마이저:    {mem['optimizer_states_gb']} GB")
+    print(f"    기울기:        {mem['gradients_gb']} GB")
+    print(f"    활성화 (추정): {mem['activations_estimated_gb']} GB")
+    print(f"    ────────────────────────")
+    print(f"    총 추정:       {mem['total_estimated_gb']} GB")
+    # 텍스트 생성 테스트 (디버그 모델)
+    print("\n[5] 텍스트 생성 테스트 (10M debug model, 랜덤 가중치)")
+    prompt = torch.randint(0, cfg_debug.vocab_size, (1, 10))
+    generated = model_debug.generate(prompt, max_new_tokens=20, temperature=1.0, top_k=50)
+    print(f"    Prompt length: {prompt.shape[1]}")
+    print(f"    Generated length: {generated.shape[1]}")
+    print(f"    Generated token IDs: {generated[0].tolist()}")
+    print("\n" + "=" * 70)
+    print("✅ 모든 검증 통과!")
+    print("=" * 70)

_archive/llm-1b-trainer.py ADDED Viewed

	@@ -0,0 +1,1108 @@

+"""
+LLM-1B-Lab: 학습 루프 (Training Loop)
+========================================
+Gradient Accumulation, Mixed Precision, LR Scheduling,
+체크포인트 저장/복원, wandb 로깅을 포함한 완전한 학습 파이프라인.
+전체 흐름:
+  배치 가져오기
+    → Forward (bf16 autocast)
+    → Loss / accumulation_steps (미니배치 평균)
+    → Backward (gradient 누적)
+    → [accumulation_steps마다]
+        → Gradient Clipping
+        → Optimizer Step
+        → LR Scheduler Step
+        → Logging
+    → [checkpoint_interval마다]
+        → 체크포인트 저장 (Google Drive)
+    → [eval_interval마다]
+        → 검증 Loss/Perplexity 측정
+설치 필요:
+  pip install wandb torch
+"""
+import os
+import math
+import time
+import json
+import shutil
+from pathlib import Path
+from dataclasses import dataclass, field
+from typing import Optional, Dict, Any, Tuple
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+# ============================================================================
+# 1. 학습 설정
+# ============================================================================
+@dataclass
+class TrainConfig:
+    """학습 하이퍼파라미터 + 인프라 설정.
+    Colab Pro+ (A100 40GB) 기준 최적화된 기본값.
+    모든 값에 '왜 이 값인지' 설명을 포함합니다.
+    """
+    # ── 최적화 ──
+    learning_rate: float = 3e-4
+    """Peak LR. 1B 모델 기준 3e-4가 표준.
+    GPT-3 논문에서 모델 크기별 최적 LR을 제시:
+      125M → 6e-4, 350M → 3e-4, 1.3B → 2e-4
+    우리 모델(1.1B)은 3e-4에서 시작, 불안정하면 2e-4로 하향."""
+    min_learning_rate: float = 3e-5
+    """Cosine decay 최저점. 보통 peak의 10%.
+    너무 낮으면 학습 후반 정체, 너무 높으면 수렴 불안정."""
+    weight_decay: float = 0.1
+    """AdamW의 L2 정규화. 0.1이 LLM 표준.
+    Embedding과 Bias에는 적용하지 않음 (관례)."""
+    beta1: float = 0.9
+    beta2: float = 0.95
+    """Adam 모멘텀 계수. β2=0.95는 LLM 학습에서 β2=0.999보다 안정적.
+    큰 배치 + 긴 학습에서 β2가 너무 크면 적응 속도가 느림."""
+    adam_eps: float = 1e-8
+    grad_clip: float = 1.0
+    """Gradient Clipping: gradient norm이 1.0을 초과하면 스케일링.
+    학습 초반이나 노이즈 데이터에서 발생하는 gradient spike 방지."""
+    # ── 스케줄링 ──
+    warmup_steps: int = 2000
+    """Warmup: 처음 2000 스텝 동안 LR을 0 → peak로 선형 증가.
+    왜 필요한가?
+      - 초기 가중치가 랜덤 → 큰 LR은 불안정한 업데이트 유발
+      - 작은 LR로 시작해 모델이 '방향'을 잡게 한 후 본격 학습
+      - 2000은 전체 학습의 ~10%가 적당 (경험적 규칙)."""
+    total_steps: int = 20_000
+    """총 학습 스텝 수.
+    10B tokens / (128 batch × 2048 seq_len) ≈ 38,000 이지만,
+    gradient accumulation 포함 effective step 기준 ~20,000."""
+    # ── 배치 ──
+    micro_batch_size: int = 4
+    """GPU에 한 번에 올리는 배치 크기.
+    A100 40GB에서 1B 모델 bf16 기준 4가 안전한 상한."""
+    gradient_accumulation_steps: int = 32
+    """Gradient 누적 횟수. Effective batch = 4 × 32 = 128.
+    왜 큰 배치가 좋은가?
+      - gradient 추정이 안정적 (노이즈 감소)
+      - LLM 학습은 보통 effective batch 128~512
+      - 메모리 부족 시 이 값을 늘리고 micro_batch를 줄임."""
+    # ── Mixed Precision ──
+    dtype: str = "bfloat16"
+    """bfloat16: A100에서 지원, fp16보다 수치 안정성 우수.
+    exponent 비트가 fp32와 동일 → overflow/underflow 위험 적음.
+    T4/V100 폴백 시 'float16'으로 변경."""
+    # ── 체크포인트 ──
+    checkpoint_dir: str = "/content/drive/MyDrive/llm-1b-lab/checkpoints"
+    """Google Drive 경로. Colab 세션 만료 시에도 보존됨."""
+    checkpoint_interval: int = 500
+    """500 스텝마다 체크포인트 저장.
+    A100 기준 ~30분 간격. 너무 잦으면 I/O 오버헤드,
+    너무 드물면 세션 만료 시 손실 큼."""
+    max_checkpoints: int = 3
+    """롤링 보관 수. 오래된 것부터 삭제.
+    체크포인트 1개 ≈ 8-10GB → 3개면 ~30GB."""
+    # ── 로깅 ──
+    log_interval: int = 10
+    """10 스텝마다 콘솔 + wandb 로깅."""
+    eval_interval: int = 500
+    """500 스텝마다 검증 Loss 측정."""
+    eval_steps: int = 20
+    """검증 시 사용할 배치 수. 20 × 4 × 2048 ≈ 160K 토큰."""
+    # ── wandb ──
+    wandb_project: str = "llm-1b-lab"
+    wandb_run_name: Optional[str] = None
+    use_wandb: bool = True
+    # ── 재현성 ──
+    seed: int = 42
+    @property
+    def effective_batch_size(self) -> int:
+        return self.micro_batch_size * self.gradient_accumulation_steps
+    @property
+    def tokens_per_step(self) -> int:
+        """한 optimizer step당 처리 토큰 수."""
+        # max_seq_len은 외부에서 주입 (ModelConfig 참조)
+        return self.effective_batch_size * 2048
+    @property
+    def torch_dtype(self) -> torch.dtype:
+        return {"bfloat16": torch.bfloat16, "float16": torch.float16, "float32": torch.float32}[self.dtype]
+# ============================================================================
+# 2. 학습률 스케줄러 (Cosine with Warmup)
+# ============================================================================
+class CosineWarmupScheduler:
+    """Cosine Annealing with Linear Warmup.
+    LR 곡선:
+      ┌─── peak_lr ───────╲
+      │                     ╲  cosine decay
+      │ warmup (linear)      ╲
+      │/                       ╲_______ min_lr
+      └──────────────────────────────────→ steps
+    왜 Cosine Decay인가?
+      - Step decay: 갑작스러운 LR 하락 → Loss 불안정
+      - Linear decay: 후반부 LR이 너무 빨리 감소
+      - Cosine: 부드러운 감소, 학습 후반에도 적절한 LR 유지
+      - GPT-3, LLaMA, Chinchilla 등 대부분의 LLM이 사용
+    구현 참고:
+      PyTorch 내장 스케줄러(CosineAnnealingLR 등)도 있지만,
+      warmup + min_lr + 체크포인트 복원을 위해 직접 구현이 더 유연합니다.
+    """
+    def __init__(self, config: TrainConfig):
+        self.peak_lr = config.learning_rate
+        self.min_lr = config.min_learning_rate
+        self.warmup_steps = config.warmup_steps
+        self.total_steps = config.total_steps
+    def get_lr(self, step: int) -> float:
+        """현재 step에 해당하는 학습률을 반환합니다.
+        Args:
+            step: 현재 optimizer step (0-indexed)
+        Returns:
+            학습률 (float)
+        """
+        # Phase 1: Linear Warmup
+        if step < self.warmup_steps:
+            # 0 → peak_lr 선형 증가
+            return self.peak_lr * (step / self.warmup_steps)
+        # Phase 2: Cosine Decay
+        # warmup 이후 남은 진행률 (0.0 → 1.0)
+        decay_steps = self.total_steps - self.warmup_steps
+        progress = (step - self.warmup_steps) / max(decay_steps, 1)
+        progress = min(progress, 1.0)  # 안전장치
+        # Cosine 공식: min_lr + 0.5 × (peak - min) × (1 + cos(π × progress))
+        cosine_decay = 0.5 * (1.0 + math.cos(math.pi * progress))
+        lr = self.min_lr + (self.peak_lr - self.min_lr) * cosine_decay
+        return lr
+    def set_lr(self, optimizer: torch.optim.Optimizer, step: int):
+        """Optimizer의 학습률을 업데이트합니다."""
+        lr = self.get_lr(step)
+        for param_group in optimizer.param_groups:
+            param_group["lr"] = lr
+        return lr
+# ============================================================================
+# 3. 체크포인트 관리
+# ============================================================================
+class CheckpointManager:
+    """학습 상태 저장/복원 관리자.
+    Colab에서 체크포인트가 중요한 이유:
+      - 세션 만료 (최대 ~24시간) 시 모든 메모리 상태 소멸
+      - Google Drive에 저장하면 세션 간 연속 학습 가능
+      - 옵티마이저 상태까지 저장해야 AdamW 모멘텀이 유지됨
+    저장 내용:
+      - model_state_dict:     모델 가중치
+      - optimizer_state_dict: 옵티마이저 상태 (m, v 모멘텀)
+      - step:                 현재 학습 스텝
+      - best_val_loss:        최저 검증 Loss
+      - config:               학습 설정 (재현성)
+      - rng_states:           랜덤 시드 상태 (완전 재현)
+      - metrics_history:      학습 메트릭 기록
+      - wandb_run_id:         wandb 실행 ID (로깅 연속성)
+    """
+    def __init__(self, config: TrainConfig):
+        self.config = config
+        self.checkpoint_dir = Path(config.checkpoint_dir)
+        self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
+        self.max_checkpoints = config.max_checkpoints
+    def save(
+        self,
+        model: nn.Module,
+        optimizer: torch.optim.Optimizer,
+        step: int,
+        best_val_loss: float,
+        metrics_history: Dict[str, list],
+        wandb_run_id: Optional[str] = None,
+    ):
+        """체크포인트를 저장합니다."""
+        ckpt_path = self.checkpoint_dir / f"step_{step:06d}"
+        ckpt_path.mkdir(parents=True, exist_ok=True)
+        print(f"\n💾 체크포인트 저장: {ckpt_path}")
+        start = time.time()
+        # 1) 모델 가중치 (bf16 상태 그대로)
+        torch.save(model.state_dict(), ckpt_path / "model.pt")
+        # 2) 옵티마이저 상태 (fp32 모멘텀 포함, 크기 큼)
+        torch.save(optimizer.state_dict(), ckpt_path / "optimizer.pt")
+        # 3) 학습 메타 정보
+        meta = {
+            "step": step,
+            "best_val_loss": best_val_loss,
+            "wandb_run_id": wandb_run_id,
+            "config": self.config.__dict__,
+        }
+        with open(ckpt_path / "meta.json", "w") as f:
+            json.dump(meta, f, indent=2)
+        # 4) 메트릭 기록
+        torch.save(metrics_history, ckpt_path / "metrics.pt")
+        # 5) 랜덤 상태 (완전 재현을 위해)
+        rng_states = {
+            "python": torch.random.get_rng_state(),
+            "cuda": torch.cuda.get_rng_state() if torch.cuda.is_available() else None,
+        }
+        torch.save(rng_states, ckpt_path / "rng_states.pt")
+        elapsed = time.time() - start
+        ckpt_size = sum(f.stat().st_size for f in ckpt_path.rglob("*")) / 1e9
+        print(f"   저장 완료: {ckpt_size:.2f} GB, {elapsed:.1f}초")
+        # 오래된 체크포인트 삭제 (롤링)
+        self._cleanup_old_checkpoints()
+    def load_latest(
+        self,
+        model: nn.Module,
+        optimizer: Optional[torch.optim.Optimizer] = None,
+        device: torch.device = torch.device("cpu"),
+    ) -> Dict[str, Any]:
+        """가장 최근 체크포인트를 로드합니다.
+        Returns:
+            {"step", "best_val_loss", "wandb_run_id", "metrics_history"}
+            또는 체크포인트가 없으면 None
+        """
+        ckpt_path = self._find_latest()
+        if ckpt_path is None:
+            print("[Checkpoint] 저장된 체크포인트 없음. 처음부터 시작합니다.")
+            return None
+        print(f"\n📂 체크포인트 로드: {ckpt_path}")
+        start = time.time()
+        # 1) 모델 가중치
+        model_state = torch.load(ckpt_path / "model.pt", map_location=device, weights_only=True)
+        model.load_state_dict(model_state)
+        del model_state  # 메모리 해제
+        # 2) 옵티마이저 상태
+        if optimizer is not None:
+            optim_state = torch.load(ckpt_path / "optimizer.pt", map_location=device, weights_only=True)
+            optimizer.load_state_dict(optim_state)
+            del optim_state
+        # 3) 메타 정보
+        with open(ckpt_path / "meta.json", "r") as f:
+            meta = json.load(f)
+        # 4) 메트릭 기록
+        metrics_history = {}
+        metrics_path = ckpt_path / "metrics.pt"
+        if metrics_path.exists():
+            metrics_history = torch.load(metrics_path, weights_only=False)
+        # 5) 랜덤 상태 복원
+        rng_path = ckpt_path / "rng_states.pt"
+        if rng_path.exists():
+            rng_states = torch.load(rng_path, weights_only=False)
+            torch.random.set_rng_state(rng_states["python"])
+            if rng_states["cuda"] is not None and torch.cuda.is_available():
+                torch.cuda.set_rng_state(rng_states["cuda"])
+        elapsed = time.time() - start
+        print(f"   로드 완료: step={meta['step']}, {elapsed:.1f}초")
+        return {
+            "step": meta["step"],
+            "best_val_loss": meta["best_val_loss"],
+            "wandb_run_id": meta.get("wandb_run_id"),
+            "metrics_history": metrics_history,
+        }
+    def _find_latest(self) -> Optional[Path]:
+        """가장 최근 체크포인트 경로를 찾습니다."""
+        ckpts = sorted(self.checkpoint_dir.glob("step_*"))
+        return ckpts[-1] if ckpts else None
+    def _cleanup_old_checkpoints(self):
+        """오래된 체크포인트를 삭제합니다 (롤링)."""
+        ckpts = sorted(self.checkpoint_dir.glob("step_*"))
+        while len(ckpts) > self.max_checkpoints:
+            old = ckpts.pop(0)
+            print(f"   🗑️ 오래된 체크포인트 삭제: {old.name}")
+            shutil.rmtree(old)
+# ============================================================================
+# 4. 메트릭 추적기
+# ============================================================================
+class MetricsTracker:
+    """학습 메트릭을 추적하고 로깅합니다.
+    추적 항목:
+      - train/loss:      학습 Loss (Cross-Entropy)
+      - train/lr:        현재 학습률
+      - train/grad_norm: Gradient L2 Norm
+      - train/tokens_per_sec: 처리량
+      - train/gpu_mem_gb: GPU 메모리 사용량
+      - val/loss:        검증 Loss
+      - val/perplexity:  검증 Perplexity (= exp(loss))
+    """
+    def __init__(self, config: TrainConfig):
+        self.config = config
+        self.history: Dict[str, list] = {
+            "step": [],
+            "train_loss": [],
+            "learning_rate": [],
+            "grad_norm": [],
+            "tokens_per_sec": [],
+            "gpu_mem_gb": [],
+            "val_loss": [],
+            "val_ppl": [],
+        }
+        # wandb 초기화
+        self.wandb_run = None
+        if config.use_wandb:
+            self._init_wandb()
+    def _init_wandb(self, resume_id: Optional[str] = None):
+        """wandb 초기화 (세션 간 연속 로깅 지원)."""
+        try:
+            import wandb
+            run_id = resume_id or wandb.util.generate_id()
+            self.wandb_run = wandb.init(
+                project=self.config.wandb_project,
+                name=self.config.wandb_run_name or f"1b-run-{run_id[:6]}",
+                id=run_id,
+                resume="allow",
+                config=self.config.__dict__,
+            )
+            print(f"[wandb] 초기화 완료: {self.wandb_run.url}")
+        except ImportError:
+            print("[wandb] 설치되지 않음. 콘솔 로깅만 사용합니다.")
+            self.config.use_wandb = False
+        except Exception as e:
+            print(f"[wandb] 초기화 실패: {e}. 콘솔 로깅만 사용합니다.")
+            self.config.use_wandb = False
+    def resume_wandb(self, run_id: str):
+        """이전 wandb 실행을 이어서 로깅합니다."""
+        if self.config.use_wandb:
+            self._init_wandb(resume_id=run_id)
+    def log_train_step(
+        self,
+        step: int,
+        loss: float,
+        lr: float,
+        grad_norm: float,
+        tokens_per_sec: float,
+        gpu_mem_gb: float,
+    ):
+        """학습 스텝 메트릭을 기록합니다."""
+        self.history["step"].append(step)
+        self.history["train_loss"].append(loss)
+        self.history["learning_rate"].append(lr)
+        self.history["grad_norm"].append(grad_norm)
+        self.history["tokens_per_sec"].append(tokens_per_sec)
+        self.history["gpu_mem_gb"].append(gpu_mem_gb)
+        if self.config.use_wandb and self.wandb_run:
+            import wandb
+            wandb.log({
+                "train/loss": loss,
+                "train/lr": lr,
+                "train/grad_norm": grad_norm,
+                "train/tokens_per_sec": tokens_per_sec,
+                "train/gpu_mem_gb": gpu_mem_gb,
+            }, step=step)
+    def log_eval(self, step: int, val_loss: float, val_ppl: float):
+        """검증 메트릭을 기록합니다."""
+        self.history["val_loss"].append(val_loss)
+        self.history["val_ppl"].append(val_ppl)
+        if self.config.use_wandb and self.wandb_run:
+            import wandb
+            wandb.log({
+                "val/loss": val_loss,
+                "val/perplexity": val_ppl,
+            }, step=step)
+    @property
+    def wandb_run_id(self) -> Optional[str]:
+        if self.wandb_run:
+            return self.wandb_run.id
+        return None
+# ============================================================================
+# 5. Optimizer 생성 (AdamW with weight decay 분리)
+# ============================================================================
+def create_optimizer(model: nn.Module, config: TrainConfig) -> torch.optim.AdamW:
+    """AdamW 옵티마이저를 생성합니다.
+    Weight Decay 분리 규칙:
+      - Decay 적용: Linear 가중치 (attention proj, FFN 등)
+      - Decay 미적용: Embedding, LayerNorm/RMSNorm, Bias
+    왜 분리하는가?
+      - Weight Decay는 큰 가중치에 패널티를 주어 과적합 방지
+      - 하지만 Norm의 scale 파라미터에 적용하면 정규화 효과를 방해
+      - Embedding에 적용하면 희귀 토큰의 표현이 0으로 수축
+      - 1D 파라미터(bias, norm weight)는 decay에서 제외하는 것이 관례
+    """
+    # 파라미터를 decay/no-decay 그룹으로 분리
+    decay_params = []
+    no_decay_params = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue
+        # 1D 텐서(bias, norm weight) 또는 embedding → no decay
+        if param.dim() <= 1 or "embedding" in name:
+            no_decay_params.append(param)
+        else:
+            decay_params.append(param)
+    param_groups = [
+        {"params": decay_params, "weight_decay": config.weight_decay},
+        {"params": no_decay_params, "weight_decay": 0.0},
+    ]
+    n_decay = sum(p.numel() for p in decay_params)
+    n_no_decay = sum(p.numel() for p in no_decay_params)
+    print(f"[Optimizer] Decay 파라미터: {n_decay:,} ({n_decay/1e6:.1f}M)")
+    print(f"[Optimizer] No-decay 파라미터: {n_no_decay:,} ({n_no_decay/1e6:.1f}M)")
+    optimizer = torch.optim.AdamW(
+        param_groups,
+        lr=config.learning_rate,
+        betas=(config.beta1, config.beta2),
+        eps=config.adam_eps,
+        fused=torch.cuda.is_available(),  # CUDA fused AdamW (더 빠름)
+    )
+    return optimizer
+# ============================================================================
+# 6. Trainer (핵심 학습 루프)
+# ============================================================================
+class Trainer:
+    """LLM 사전학습 트레이너.
+    학습 루프의 핵심 구조:
+    ```
+    for step in range(total_steps):
+        # ── Gradient Accumulation Loop ──
+        for micro_step in range(accumulation_steps):
+            batch = next(dataloader)
+            with autocast(bf16):
+                logits, loss = model(input_ids, targets)
+            scaled_loss = loss / accumulation_steps
+            scaled_loss.backward()          # gradient 누적
+        # ── Optimizer Step (accumulation 완료 후) ──
+        clip_grad_norm(model, max_norm=1.0)
+        optimizer.step()
+        optimizer.zero_grad()
+        scheduler.set_lr(optimizer, step)
+    ```
+    Gradient Accumulation이란?
+      - GPU 메모리에 큰 배치를 한 번에 올릴 수 없을 때
+      - 작은 micro_batch로 여러 번 forward/backward → gradient를 누적
+      - 누적 후 한 번에 optimizer step
+      - 결과적으로 큰 effective_batch와 동일한 효과
+      - Loss를 accumulation_steps로 나누는 이유:
+        gradient의 평균을 구하기 위해 (합이 아닌 평균)
+    """
+    def __init__(
+        self,
+        model: nn.Module,
+        train_dataloader: DataLoader,
+        val_dataloader: Optional[DataLoader],
+        config: TrainConfig,
+        seq_len: int = 2048,
+    ):
+        self.config = config
+        self.seq_len = seq_len
+        # ── 디바이스 설정 ──
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        print(f"[Trainer] 디바이스: {self.device}")
+        if torch.cuda.is_available():
+            print(f"[Trainer] GPU: {torch.cuda.get_device_name()}")
+            print(f"[Trainer] GPU 메모리: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
+        # ── 모델 ──
+        self.model = model.to(self.device)
+        # torch.compile: PyTorch 2.0+ 그래프 최적화 (속도 10-30% 향상)
+        if torch.cuda.is_available() and hasattr(torch, "compile"):
+            print("[Trainer] torch.compile 적용 중...")
+            self.model = torch.compile(self.model)
+        # ── 데이터 ──
+        self.train_dataloader = train_dataloader
+        self.val_dataloader = val_dataloader
+        self.train_iter = iter(train_dataloader)
+        # ── 옵티마이저 ──
+        self.optimizer = create_optimizer(self.model, config)
+        # ── 스케줄러 ──
+        self.scheduler = CosineWarmupScheduler(config)
+        # ── 체크포인트 ──
+        self.ckpt_manager = CheckpointManager(config)
+        # ── 메트릭 ──
+        self.metrics = MetricsTracker(config)
+        # ── 학습 상태 ──
+        self.global_step = 0
+        self.best_val_loss = float("inf")
+        self.tokens_seen = 0
+        # ── Mixed Precision ──
+        # bf16은 GradScaler가 불필요 (fp16일 때만 필요)
+        self.use_amp = config.dtype != "float32"
+        self.amp_dtype = config.torch_dtype
+        # ── 자동 복원 시도 ──
+        self._try_resume()
+    def _try_resume(self):
+        """이전 체크포인트가 있으면 자동으로 복원합니다."""
+        result = self.ckpt_manager.load_latest(
+            self.model, self.optimizer, self.device
+        )
+        if result is not None:
+            self.global_step = result["step"]
+            self.best_val_loss = result["best_val_loss"]
+            self.metrics.history = result.get("metrics_history", self.metrics.history)
+            # wandb 연속 로깅
+            if result.get("wandb_run_id"):
+                self.metrics.resume_wandb(result["wandb_run_id"])
+            self.tokens_seen = self.global_step * self.config.effective_batch_size * self.seq_len
+            print(f"[Trainer] 학습 재개: step={self.global_step}, "
+                  f"tokens={self.tokens_seen/1e9:.2f}B, "
+                  f"best_val_loss={self.best_val_loss:.4f}")
+    def _get_next_batch(self) -> Dict[str, torch.Tensor]:
+        """다음 학습 배치를 가져옵니다.
+        Streaming DataLoader는 에폭 개념이 없으므로,
+        StopIteration 시 새 이터레이터를 생성합니다.
+        """
+        try:
+            batch = next(self.train_iter)
+        except StopIteration:
+            self.train_iter = iter(self.train_dataloader)
+            batch = next(self.train_iter)
+        return {
+            "input_ids": batch["input_ids"].to(self.device, non_blocking=True),
+            "targets": batch["targets"].to(self.device, non_blocking=True),
+        }
+    def _train_step(self) -> Tuple[float, float]:
+        """하나의 optimizer step을 수행합니다.
+        Returns:
+            (loss, grad_norm)
+        """
+        self.model.train()
+        self.optimizer.zero_grad(set_to_none=True)
+        # set_to_none=True: gradient를 None으로 설정 → 메모리 절약
+        total_loss = 0.0
+        # ── Gradient Accumulation Loop ──
+        for micro_step in range(self.config.gradient_accumulation_steps):
+            batch = self._get_next_batch()
+            # Mixed Precision Forward
+            with torch.amp.autocast(device_type="cuda", dtype=self.amp_dtype, enabled=self.use_amp):
+                logits, loss = self.model(batch["input_ids"], batch["targets"])
+            # Loss 스케일링: effective batch의 평균을 위해
+            scaled_loss = loss / self.config.gradient_accumulation_steps
+            total_loss += loss.item()
+            # Backward (gradient 누적)
+            scaled_loss.backward()
+        # ── Gradient Clipping ──
+        # 모든 파라미터의 gradient를 하나의 벡터로 보고 L2 norm 계산
+        # norm이 max_norm을 초과하면 비례적으로 스케일 다운
+        grad_norm = torch.nn.utils.clip_grad_norm_(
+            self.model.parameters(),
+            max_norm=self.config.grad_clip,
+        ).item()
+        # ── Optimizer Step ──
+        self.optimizer.step()
+        # ── LR 업데이트 ──
+        self.scheduler.set_lr(self.optimizer, self.global_step)
+        avg_loss = total_loss / self.config.gradient_accumulation_steps
+        return avg_loss, grad_norm
+    @torch.no_grad()
+    def _evaluate(self) -> Tuple[float, float]:
+        """검증 데이터에서 Loss와 Perplexity를 측정합니다.
+        Perplexity = exp(loss)
+          - 직관: "모델이 다음 토큰을 평균 몇 개의 후보 중에서 고르는가"
+          - PPL 100 → 100개 중 1개를 균일하게 고르는 수준
+          - PPL 20  → 20개 중 1개 수준 (꽤 좋음)
+          - PPL 10  → 매우 자신있게 예측
+        """
+        if self.val_dataloader is None:
+            return float("inf"), float("inf")
+        self.model.eval()
+        total_loss = 0.0
+        num_batches = 0
+        for i, batch in enumerate(self.val_dataloader):
+            if i >= self.config.eval_steps:
+                break
+            input_ids = batch["input_ids"].to(self.device)
+            targets = batch["targets"].to(self.device)
+            with torch.amp.autocast(device_type="cuda", dtype=self.amp_dtype, enabled=self.use_amp):
+                _, loss = self.model(input_ids, targets)
+            total_loss += loss.item()
+            num_batches += 1
+        avg_loss = total_loss / max(num_batches, 1)
+        perplexity = math.exp(min(avg_loss, 20))  # overflow 방지 (exp(20) ≈ 5억)
+        return avg_loss, perplexity
+    def train(self):
+        """메인 학습 루프.
+        이 메서드가 전체 학습을 실행합니다.
+        Colab 세션 만료 시 중단되어도 체크포인트에서 자동 재개됩니다.
+        """
+        config = self.config
+        print("\n" + "=" * 70)
+        print("🚀 학습 시작")
+        print("=" * 70)
+        print(f"  총 스텝: {config.total_steps:,}")
+        print(f"  시작 스텝: {self.global_step}")
+        print(f"  Effective batch size: {config.effective_batch_size}")
+        print(f"  토큰/스텝: {config.effective_batch_size * self.seq_len:,}")
+        print(f"  총 학습 토큰 (예상): {config.total_steps * config.effective_batch_size * self.seq_len / 1e9:.1f}B")
+        print(f"  Mixed Precision: {config.dtype}")
+        print(f"  Gradient Accumulation: {config.gradient_accumulation_steps}")
+        print(f"  체크포인트: {config.checkpoint_dir}")
+        print("=" * 70 + "\n")
+        step_start_time = time.time()
+        tokens_at_log_start = self.tokens_seen
+        # ════════════════════════════════════════════
+        # 메인 루프
+        # ════════════════════════════════════════════
+        while self.global_step < config.total_steps:
+            # ── Train Step ──
+            loss, grad_norm = self._train_step()
+            self.global_step += 1
+            self.tokens_seen += config.effective_batch_size * self.seq_len
+            # ── Logging ──
+            if self.global_step % config.log_interval == 0:
+                elapsed = time.time() - step_start_time
+                tokens_delta = self.tokens_seen - tokens_at_log_start
+                tokens_per_sec = tokens_delta / max(elapsed, 1e-6)
+                # GPU 메모리
+                gpu_mem_gb = 0.0
+                if torch.cuda.is_available():
+                    gpu_mem_gb = torch.cuda.max_memory_allocated() / 1e9
+                # 현재 LR
+                current_lr = self.scheduler.get_lr(self.global_step)
+                # 남은 시간 추정
+                remaining_steps = config.total_steps - self.global_step
+                steps_per_sec = config.log_interval / max(elapsed, 1e-6)
+                eta_seconds = remaining_steps / max(steps_per_sec, 1e-6)
+                eta_hours = eta_seconds / 3600
+                # 콘솔 출력
+                print(
+                    f"  Step {self.global_step:>6d}/{config.total_steps} │ "
+                    f"Loss {loss:.4f} │ "
+                    f"LR {current_lr:.2e} │ "
+                    f"Grad {grad_norm:.2f} │ "
+                    f"{tokens_per_sec:,.0f} tok/s │ "
+                    f"GPU {gpu_mem_gb:.1f}GB │ "
+                    f"ETA {eta_hours:.1f}h │ "
+                    f"Tokens {self.tokens_seen/1e9:.2f}B"
+                )
+                # wandb 로깅
+                self.metrics.log_train_step(
+                    step=self.global_step,
+                    loss=loss,
+                    lr=current_lr,
+                    grad_norm=grad_norm,
+                    tokens_per_sec=tokens_per_sec,
+                    gpu_mem_gb=gpu_mem_gb,
+                )
+                step_start_time = time.time()
+                tokens_at_log_start = self.tokens_seen
+            # ── Evaluation ──
+            if self.global_step % config.eval_interval == 0:
+                val_loss, val_ppl = self._evaluate()
+                print(f"\n  📊 Eval @ Step {self.global_step}: "
+                      f"Val Loss = {val_loss:.4f}, "
+                      f"Val PPL = {val_ppl:.2f}")
+                self.metrics.log_eval(self.global_step, val_loss, val_ppl)
+                if val_loss < self.best_val_loss:
+                    self.best_val_loss = val_loss
+                    print(f"  🏆 New best val loss: {val_loss:.4f}")
+                print()
+            # ── Checkpoint ──
+            if self.global_step % config.checkpoint_interval == 0:
+                self.ckpt_manager.save(
+                    model=self.model,
+                    optimizer=self.optimizer,
+                    step=self.global_step,
+                    best_val_loss=self.best_val_loss,
+                    metrics_history=self.metrics.history,
+                    wandb_run_id=self.metrics.wandb_run_id,
+                )
+        # ════════════════════════════════════════════
+        # 학습 완료
+        # ════════════════════════════════════════════
+        print("\n" + "=" * 70)
+        print("🎉 학습 완료!")
+        print("=" * 70)
+        print(f"  총 스텝: {self.global_step:,}")
+        print(f"  총 토큰: {self.tokens_seen/1e9:.2f}B")
+        print(f"  최저 Val Loss: {self.best_val_loss:.4f}")
+        print(f"  최저 Val PPL: {math.exp(min(self.best_val_loss, 20)):.2f}")
+        print("=" * 70)
+        # 최종 체크포인트 저장
+        self.ckpt_manager.save(
+            model=self.model,
+            optimizer=self.optimizer,
+            step=self.global_step,
+            best_val_loss=self.best_val_loss,
+            metrics_history=self.metrics.history,
+            wandb_run_id=self.metrics.wandb_run_id,
+        )
+        if self.config.use_wandb and self.metrics.wandb_run:
+            import wandb
+            wandb.finish()
+# ============================================================================
+# 7. GPU 환경 자동 감지 및 설정 조정
+# ============================================================================
+def auto_configure(config: TrainConfig) -> TrainConfig:
+    """GPU 종류에 따라 설정을 자동 조정합니다.
+    Colab Pro+에서 A100이 항상 배정되지는 않습니다.
+    T4나 V100이 배정될 경우 자동으로 설정을 조정합니다.
+    Returns:
+        조정된 TrainConfig
+    """
+    if not torch.cuda.is_available():
+        print("⚠️ GPU 없음! CPU 모드 (매우 느림)")
+        config.dtype = "float32"
+        config.micro_batch_size = 1
+        config.gradient_accumulation_steps = 4
+        return config
+    gpu_name = torch.cuda.get_device_name().lower()
+    gpu_mem = torch.cuda.get_device_properties(0).total_mem / 1e9
+    print(f"\n🔍 GPU 감지: {torch.cuda.get_device_name()} ({gpu_mem:.1f} GB)")
+    if "a100" in gpu_name:
+        # A100 40GB: 기본 설정 그대로 (최적)
+        print("  → A100 감지: 기본 설정 사용 (bf16, batch=4)")
+        config.dtype = "bfloat16"
+        config.micro_batch_size = 4
+    elif "v100" in gpu_name:
+        # V100 16GB: bf16 미지원, 배치 축소
+        print("  → V100 감지: fp16 모드, 배치 축소")
+        config.dtype = "float16"
+        config.micro_batch_size = 2
+        config.gradient_accumulation_steps = 64  # effective batch 유지
+    elif "t4" in gpu_name:
+        # T4 16GB: bf16 미지원, 더 작은 배치
+        print("  → T4 감지: fp16 모드, 최소 배치")
+        config.dtype = "float16"
+        config.micro_batch_size = 1
+        config.gradient_accumulation_steps = 128
+    elif "l4" in gpu_name:
+        # L4 24GB: bf16 지원
+        print("  → L4 감지: bf16 모드, 배치 조정")
+        config.dtype = "bfloat16"
+        config.micro_batch_size = 2
+        config.gradient_accumulation_steps = 64
+    else:
+        print(f"  → 알 수 없는 GPU. 메모리 기준으로 설정 조정")
+        if gpu_mem >= 30:
+            config.micro_batch_size = 4
+        elif gpu_mem >= 16:
+            config.micro_batch_size = 2
+        else:
+            config.micro_batch_size = 1
+            config.gradient_accumulation_steps = 128
+    print(f"  → dtype: {config.dtype}")
+    print(f"  → micro_batch: {config.micro_batch_size}")
+    print(f"  → grad_accum: {config.gradient_accumulation_steps}")
+    print(f"  → effective_batch: {config.effective_batch_size}")
+    return config
+# ============================================================================
+# 8. Quick Start (Colab 실행용)
+# ============================================================================
+def start_training(
+    model: nn.Module,
+    train_dataloader: DataLoader,
+    val_dataloader: Optional[DataLoader] = None,
+    config: Optional[TrainConfig] = None,
+    seq_len: int = 2048,
+    auto_config: bool = True,
+) -> Trainer:
+    """학습을 시작합니다 (한 줄 실행).
+    사용법 (Colab):
+    ```python
+    from model import LLMModel, ModelConfig
+    from data_pipeline import setup_data_pipeline, DataConfig
+    from trainer import start_training, TrainConfig
+    # 1. 모델 생성
+    model_config = ModelConfig.base_1b()
+    model = LLMModel(model_config)
+    # 2. 데이터 파이프라인
+    tok, train_dl, val_dl = setup_data_pipeline("pretrained")
+    # 3. 학습 시작 (체크포인트 자동 복원)
+    trainer = start_training(model, train_dl, val_dl)
+    ```
+    """
+    config = config or TrainConfig()
+    # GPU 자동 감지 및 설정 조정
+    if auto_config:
+        config = auto_configure(config)
+    # Google Drive 마운트 확인 (Colab)
+    if "/content/drive" in config.checkpoint_dir:
+        drive_path = Path("/content/drive/MyDrive")
+        if not drive_path.exists():
+            print("\n⚠️ Google Drive가 마운트되지 않았습니다!")
+            print("  Colab에서 실행: from google.colab import drive; drive.mount('/content/drive')")
+            print("  로컬 경로로 변경합니다.")
+            config.checkpoint_dir = "./checkpoints"
+    # 재현성 시드 설정
+    torch.manual_seed(config.seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(config.seed)
+    # Trainer 생성 (체크포인트 자동 복원 포함)
+    trainer = Trainer(model, train_dataloader, val_dataloader, config, seq_len)
+    # 학습 실행
+    trainer.train()
+    return trainer
+# ============================================================================
+# 9. 검증 스크립트
+# ============================================================================
+if __name__ == "__main__":
+    print("=" * 70)
+    print("LLM-1B-Lab: Trainer 검증")
+    print("=" * 70)
+    # ── 미니 모델로 학습 루프 검증 ──
+    print("\n[테스트 1] 미니 모델 학습 루프 검증")
+    # 간단한 더미 모델
+    class TinyModel(nn.Module):
+        def __init__(self, vocab_size=100, dim=64):
+            super().__init__()
+            self.emb = nn.Embedding(vocab_size, dim)
+            self.linear = nn.Linear(dim, vocab_size)
+            self.linear.weight = self.emb.weight  # weight tying
+        def forward(self, input_ids, targets=None):
+            import torch.nn.functional as F
+            h = self.emb(input_ids)
+            logits = self.linear(h)
+            loss = None
+            if targets is not None:
+                loss = F.cross_entropy(logits.view(-1, 100), targets.view(-1))
+            return logits, loss
+        def count_parameters(self, trainable_only=True):
+            return sum(p.numel() for p in self.parameters() if p.requires_grad)
+    model = TinyModel()
+    print(f"  모델 파라미터: {model.count_parameters():,}")
+    # 더미 데이터 생성
+    def dummy_dataloader(num_batches=100, batch_size=4, seq_len=32, vocab=100):
+        for _ in range(num_batches):
+            ids = torch.randint(0, vocab, (batch_size, seq_len + 1))
+            yield {
+                "input_ids": ids[:, :-1],
+                "targets": ids[:, 1:],
+            }
+    # 설정 (매우 짧은 학습)
+    config = TrainConfig(
+        total_steps=20,
+        warmup_steps=5,
+        micro_batch_size=4,
+        gradient_accumulation_steps=2,
+        log_interval=5,
+        eval_interval=10,
+        checkpoint_interval=10,
+        checkpoint_dir="./test_checkpoints",
+        use_wandb=False,
+        dtype="float32",  # CPU 테스트
+    )
+    # 스케줄러 테스트
+    print("\n[테스트 2] LR 스케줄러 검증")
+    scheduler = CosineWarmupScheduler(config)
+    test_steps = [0, 2, 5, 10, 15, 20]
+    for s in test_steps:
+        lr = scheduler.get_lr(s)
+        phase = "warmup" if s < config.warmup_steps else "cosine"
+        print(f"  Step {s:3d}: LR = {lr:.6f} ({phase})")
+    # Optimizer 테스트
+    print("\n[테스트 3] Optimizer 생성 검증")
+    optimizer = create_optimizer(model, config)
+    print(f"  파라미터 그룹 수: {len(optimizer.param_groups)}")
+    for i, pg in enumerate(optimizer.param_groups):
+        n_params = sum(p.numel() for p in pg["params"])
+        print(f"  그룹 {i}: {n_params:,} params, weight_decay={pg['weight_decay']}")
+    # 학습 루프 ��스트 (짧은 버전)
+    print("\n[테스트 4] 학습 루프 실행 (20 steps)")
+    train_dl = list(dummy_dataloader(num_batches=200))
+    # DataLoader 시뮬레이션
+    class SimpleLoader:
+        def __init__(self, data):
+            self.data = data
+        def __iter__(self):
+            return iter(self.data)
+    trainer = Trainer(
+        model=model,
+        train_dataloader=SimpleLoader(train_dl),
+        val_dataloader=SimpleLoader(train_dl[:20]),
+        config=config,
+        seq_len=32,
+    )
+    trainer.train()
+    # 정리
+    import shutil
+    if os.path.exists("./test_checkpoints"):
+        shutil.rmtree("./test_checkpoints")
+    print("\n" + "=" * 70)
+    print("✅ Trainer 검증 완료!")
+    print()
+    print("실제 학습 실행 방법:")
+    print("  trainer = start_training(model, train_dl, val_dl)")
+    print("=" * 70)

llm_lab/__init__.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""
+LLM-1B-Lab: 1B Parameter LLaMA-style Transformer (from scratch)
+================================================================
+딥러닝 초보자를 위한 학습용 구현.
+각 컴포넌트에 상세 주석을 달아 "왜 이렇게 하는지"를 설명합니다.
+모듈 구조:
+  llm_lab.config      — 모든 설정 (ModelConfig, DataConfig, TrainConfig, EvalConfig)
+  llm_lab.model       — 모델 아키텍처 (RMSNorm, RoPE, GQA, SwiGLU, Transformer)
+  llm_lab.data        — 데이터 파이프라인 (토크나이저, 스트리밍, 패킹)
+  llm_lab.training    — 학습 루프 (Trainer, 스케줄러, 체크포인트)
+  llm_lab.evaluation  — 평가 (Perplexity, 생성, Scaling Law, Attention)
+  llm_lab.utils       — 공통 유틸리티 (디바이스 감지, 시드)
+Quick Start:
+  from llm_lab.config import ModelConfig, DataConfig, TrainConfig
+  from llm_lab.model import LLMModel
+  from llm_lab.data import setup_data_pipeline
+  from llm_lab.training import start_training
+  from llm_lab.evaluation import run_evaluation
+"""
+__version__ = "0.1.0"
+from .config import ModelConfig, DataConfig, TrainConfig, EvalConfig
+from .model import LLMModel
+from .data import setup_data_pipeline
+from .training import start_training
+from .evaluation import run_evaluation
+from .utils import get_device, auto_configure

llm_lab/config/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""설정(Config) 모듈 — 모든 하이퍼파라미터를 한 곳에서 관리합니다."""
+from .model_config import ModelConfig
+from .data_config import DataConfig
+from .train_config import TrainConfig
+from .eval_config import EvalConfig
+__all__ = ["ModelConfig", "DataConfig", "TrainConfig", "EvalConfig"]

llm_lab/config/data_config.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from dataclasses import dataclass
+from typing import Optional
+@dataclass
+class DataConfig:
+    """데이터 파이프라인 설정.
+    Colab Pro+ 환경 제약을 고려한 기본값:
+      - Streaming 모드로 디스크 사용 최소화
+      - 시퀀스 패킹으로 패딩 없이 GPU 활용률 극대화
+      - 전처리를 on-the-fly로 수행하여 메모리 절약
+    """
+    # ── 데이터셋 ──
+    dataset_name: str = "HuggingFaceFW/fineweb-edu"
+    dataset_subset: str = "sample-10BT"       # 10B 토큰 샘플
+    dataset_split: str = "train"
+    text_column: str = "text"                  # 텍스트가 담긴 컬럼명
+    # ── 토크나이저 ──
+    tokenizer_type: str = "sentencepiece"      # "sentencepiece" 또는 "hf"
+    # 사전 학습된 토크나이저 경로 (없으면 새로 학습)
+    tokenizer_path: Optional[str] = None
+    vocab_size: int = 32_000
+    # ── 시퀀스 ──
+    max_seq_len: int = 2048
+    # 문서 구분 토큰 사용 여부 (패킹 시 문서 경계 표시)
+    use_eos_separator: bool = True
+    # ── 배치 ──
+    batch_size: int = 4                        # micro batch (GPU당)
+    num_workers: int = 2                       # DataLoader 워커 수
+    prefetch_factor: int = 4                   # 미리 준비할 배치 수
+    # ── 토크나이저 학습 설정 (새로 학습 시) ──
+    tokenizer_train_samples: int = 50_000      # 학습에 사용할 문서 수
+    tokenizer_save_dir: str = "./tokenizer"
+    # ── 검증 데이터 ──
+    val_ratio: float = 0.001                   # 전체의 0.1%를 검증용으로

llm_lab/config/eval_config.py ADDED Viewed

	@@ -0,0 +1,20 @@

+from dataclasses import dataclass
+@dataclass
+class EvalConfig:
+    """평가 파라미터."""
+    # ── Perplexity ──
+    eval_batch_size: int = 4
+    max_eval_batches: int = 100      # 최대 평가 배치 수
+    # ── 생성 ──
+    max_new_tokens: int = 200
+    temperature: float = 0.8
+    top_k: int = 50
+    top_p: float = 0.9
+    num_samples: int = 3             # 프롬프트당 생성 횟수
+    # ── 출력 ──
+    save_dir: str = "./eval_results"
+    plot_dpi: int = 150

llm_lab/config/model_config.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from dataclasses import dataclass
+@dataclass
+class ModelConfig:
+    """모델 하이퍼파라미터를 하나의 데이터클래스로 관리합니다.
+    규모별 프리셋:
+      - debug:  ~10M  (파이프라인 검증용)
+      - small:  ~100M (중간 검증용)
+      - base:   ~1.1B (최종 목표)
+    """
+    vocab_size: int = 32_000
+    hidden_dim: int = 2048        # d_model: 모델의 기본 차원
+    num_layers: int = 22          # Transformer 블록 수
+    num_heads: int = 16           # Query 헤드 수
+    num_kv_heads: int = 4         # Key/Value 헤드 수 (GQA)
+    intermediate_dim: int = 5632  # FFN 중간 차원 (≈ 2.75 × hidden_dim)
+    max_seq_len: int = 2048       # 최대 시퀀스 길이
+    dropout: float = 0.0          # Pretraining에서는 보통 0 사용
+    rope_theta: float = 10000.0   # RoPE 주파수 베이스
+    norm_eps: float = 1e-6        # RMSNorm epsilon
+    @property
+    def head_dim(self) -> int:
+        """각 어텐션 헤드의 차원."""
+        return self.hidden_dim // self.num_heads
+    @property
+    def num_kv_groups(self) -> int:
+        """GQA에서 하나의 KV 헤드가 담당하는 Q 헤드 수."""
+        return self.num_heads // self.num_kv_heads
+    @classmethod
+    def debug_10m(cls) -> "ModelConfig":
+        """~10M 파라미터 - 빠른 디버깅용."""
+        return cls(
+            hidden_dim=256, num_layers=6, num_heads=8,
+            num_kv_heads=4, intermediate_dim=704, max_seq_len=512,
+        )
+    @classmethod
+    def small_100m(cls) -> "ModelConfig":
+        """~100M 파라미터 - 중간 검증용."""
+        return cls(
+            hidden_dim=768, num_layers=12, num_heads=12,
+            num_kv_heads=4, intermediate_dim=2048, max_seq_len=1024,
+        )
+    @classmethod
+    def base_1b(cls) -> "ModelConfig":
+        """~1.1B 파라미터 - 최종 학습 목표."""
+        return cls()  # 기본값이 1B 설정

llm_lab/config/train_config.py ADDED Viewed

	@@ -0,0 +1,114 @@

+from dataclasses import dataclass
+from typing import Optional
+import torch
+@dataclass
+class TrainConfig:
+    """학습 하이퍼파라미터 + 인프라 설정.
+    Colab Pro+ (A100 40GB) 기준 최적화된 기본값.
+    모든 값에 '왜 이 값인지' 설명을 포함합니다.
+    """
+    # ── 최적화 ──
+    learning_rate: float = 3e-4
+    """Peak LR. 1B 모델 기준 3e-4가 표준.
+    GPT-3 논문에서 모델 크기별 최적 LR을 제시:
+      125M → 6e-4, 350M → 3e-4, 1.3B → 2e-4
+    우리 모델(1.1B)은 3e-4에서 시작, 불안정하면 2e-4로 하향."""
+    min_learning_rate: float = 3e-5
+    """Cosine decay 최저점. 보통 peak의 10%.
+    너무 낮으면 학습 후반 정체, 너무 높으면 수렴 불안정."""
+    weight_decay: float = 0.1
+    """AdamW의 L2 정규화. 0.1이 LLM 표준.
+    Embedding과 Bias에는 적용하지 않음 (관례)."""
+    beta1: float = 0.9
+    beta2: float = 0.95
+    """Adam 모멘텀 계수. β2=0.95는 LLM 학습에서 β2=0.999보다 안정적.
+    큰 배치 + 긴 학습에서 β2가 너무 크면 적응 속도가 느림."""
+    adam_eps: float = 1e-8
+    grad_clip: float = 1.0
+    """Gradient Clipping: gradient norm이 1.0을 초과하면 스케일링.
+    학습 초반이나 노이즈 데이터에서 발생하는 gradient spike 방지."""
+    # ── 스케줄링 ──
+    warmup_steps: int = 2000
+    """Warmup: 처음 2000 스텝 동안 LR을 0 → peak로 선형 증가.
+    왜 필요한가?
+      - 초기 가중치가 랜덤 → 큰 LR은 불안정한 업데이트 유발
+      - 작은 LR로 시작해 모델이 '방향'을 잡게 한 후 본격 학습
+      - 2000은 전체 학습의 ~10%가 적당 (경험적 규칙)."""
+    total_steps: int = 20_000
+    """총 학습 스텝 수.
+    10B tokens / (128 batch × 2048 seq_len) ≈ 38,000 이지만,
+    gradient accumulation 포함 effective step 기준 ~20,000."""
+    # ── 배치 ──
+    micro_batch_size: int = 4
+    """GPU에 한 번에 올리는 배치 크기.
+    A100 40GB에서 1B 모델 bf16 기준 4가 안전한 상한."""
+    gradient_accumulation_steps: int = 32
+    """Gradient 누적 횟수. Effective batch = 4 × 32 = 128.
+    왜 큰 배치가 좋은가?
+      - gradient 추정이 안정적 (노이즈 감소)
+      - LLM 학습은 보통 effective batch 128~512
+      - 메모리 부족 시 이 값을 늘리고 micro_batch를 줄임."""
+    # ── Mixed Precision ──
+    dtype: str = "bfloat16"
+    """bfloat16: A100에서 지원, fp16보다 수치 안정성 우수.
+    exponent 비트가 fp32와 동일 → overflow/underflow 위험 적음.
+    T4/V100 폴백 시 'float16'으로 변경."""
+    # ── 체크포인트 ──
+    checkpoint_dir: str = "/content/drive/MyDrive/llm-1b-lab/checkpoints"
+    """Google Drive 경로. Colab 세션 만료 시에도 보존됨."""
+    checkpoint_interval: int = 500
+    """500 스텝마다 체크포인트 저장.
+    A100 기준 ~30분 간격. 너무 잦으면 I/O 오버헤드,
+    너무 드물면 세션 만료 시 손실 큼."""
+    max_checkpoints: int = 3
+    """롤링 보관 수. 오래된 것부터 삭제.
+    체크포인트 1개 ≈ 8-10GB → 3개면 ~30GB."""
+    # ── 로깅 ──
+    log_interval: int = 10
+    """10 스텝마다 콘솔 + wandb 로깅."""
+    eval_interval: int = 500
+    """500 스텝마다 검증 Loss 측정."""
+    eval_steps: int = 20
+    """검증 시 사용할 배치 수. 20 × 4 × 2048 ≈ 160K 토큰."""
+    # ── wandb ──
+    wandb_project: str = "llm-1b-lab"
+    wandb_run_name: Optional[str] = None
+    use_wandb: bool = True
+    # ── 재현성 ──
+    seed: int = 42
+    @property
+    def effective_batch_size(self) -> int:
+        return self.micro_batch_size * self.gradient_accumulation_steps
+    @property
+    def tokens_per_step(self) -> int:
+        """한 optimizer step당 처리 토큰 수."""
+        # max_seq_len은 외부에서 주입 (ModelConfig 참조)
+        return self.effective_batch_size * 2048
+    @property
+    def torch_dtype(self) -> torch.dtype:
+        return {"bfloat16": torch.bfloat16, "float16": torch.float16, "float32": torch.float32}[self.dtype]

llm_lab/data/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""데이터 파이프라인 모듈 — 토크나이저, 스트리밍, 시퀀스 패킹."""
+from .tokenizer import Tokenizer
+from .dataset import PackedStreamingDataset, ValidationDataset
+from .pipeline import create_train_dataloader, train_tokenizer_from_dataset, setup_data_pipeline
+from .diagnostics import DataPipelineDiagnostics
+__all__ = [
+    "Tokenizer", "PackedStreamingDataset", "ValidationDataset",
+    "create_train_dataloader", "train_tokenizer_from_dataset",
+    "setup_data_pipeline", "DataPipelineDiagnostics",
+]

llm_lab/data/dataset.py ADDED Viewed

	@@ -0,0 +1,218 @@

+"""스트리밍 데이터셋 — 시퀀스 패킹, 검증 데이터셋."""
+from typing import Iterator, List, Dict, Optional
+import torch
+from torch.utils.data import IterableDataset, DataLoader
+from llm_lab.config import DataConfig
+from .tokenizer import Tokenizer
+class PackedStreamingDataset(IterableDataset):
+    """Streaming + 시퀀스 패킹 데이터셋.
+    왜 시퀀스 패킹인가?
+      - 일반적 방법: 각 문서를 max_seq_len으로 잘라 패딩 → GPU 낭비
+      - 시퀀스 패킹: 여러 문서를 이어붙여 max_seq_len을 꽉 채움 → 100% 활용
+    동작 방식:
+      문서1 (300 토큰) + 문서2 (1500 토큰) + 문서3 (248 토큰) = 2048 토큰
+      → [문서1][EOS][문서2][EOS][문서3][EOS][...패딩 없이 딱 맞춤]
+    왜 Streaming인가?
+      - FineWeb-Edu 10B 샘플: 압축 상태에서도 수십 GB
+      - Colab 디스크 한계 (~200GB)에서 전체 다운로드 불가
+      - Streaming: 필요한 만큼만 네트워크에서 읽어옴
+    학습 시 주의사항:
+      - 시퀀스 내 문서 경계에 EOS 토큰 삽입으로 모델이 문서 끝을 인식
+      - Cross-Attention 마스크 없이도 EOS가 자연스러운 경계 역할
+    """
+    def __init__(
+        self,
+        tokenizer: Tokenizer,
+        config: DataConfig,
+        split: str = "train",
+        seed: int = 42,
+    ):
+        super().__init__()
+        self.tokenizer = tokenizer
+        self.config = config
+        self.split = split
+        self.seed = seed
+        self.max_seq_len = config.max_seq_len
+    def _load_dataset(self):
+        """HuggingFace 데이터셋을 스트리밍 모드로 로드합니다."""
+        from datasets import load_dataset
+        ds = load_dataset(
+            self.config.dataset_name,
+            name=self.config.dataset_subset,
+            split=self.config.dataset_split,
+            streaming=True,         # 핵심: 스트리밍 모드
+            trust_remote_code=True,
+        )
+        # 셔플 (스트리밍에서는 버퍼 기반 근사 셔플)
+        ds = ds.shuffle(seed=self.seed, buffer_size=10_000)
+        return ds
+    def _tokenize_and_pack(self, dataset) -> Iterator[Dict[str, torch.Tensor]]:
+        """문서를 토크나이즈하고 시퀀스 패킹합니다.
+        Yields:
+            {"input_ids": (max_seq_len,), "targets": (max_seq_len,)}
+        targets = input_ids를 한 칸 shift:
+            input_ids:  [A, B, C, D, E]
+            targets:    [B, C, D, E, F]
+            → 모델은 A를 보고 B를 예측, B를 보고 C를 예측, ...
+        """
+        buffer: List[int] = []  # 토큰 버퍼
+        for example in dataset:
+            text = example[self.config.text_column]
+            if not text or not text.strip():
+                continue
+            # 토크나이즈 (특수 토큰 없이)
+            token_ids = self.tokenizer.encode(text, add_special_tokens=False)
+            if not token_ids:
+                continue
+            # EOS 토큰 추가 (문서 경계 표시)
+            if self.config.use_eos_separator:
+                token_ids.append(self.tokenizer.eos_id)
+            # 버퍼에 추가
+            buffer.extend(token_ids)
+            # 버퍼가 충분히 차면 시퀀스 생성
+            # +1은 targets 생성을 위해 (input + 다음 토큰)
+            while len(buffer) >= self.max_seq_len + 1:
+                # max_seq_len + 1 만큼 꺼냄
+                chunk = buffer[: self.max_seq_len + 1]
+                buffer = buffer[self.max_seq_len + 1 :]
+                # input_ids: 처음 ~ 끝에서 두 번째
+                input_ids = torch.tensor(chunk[:-1], dtype=torch.long)
+                # targets: 두 번째 ~ 끝 (한 칸 shift)
+                targets = torch.tensor(chunk[1:], dtype=torch.long)
+                yield {"input_ids": input_ids, "targets": targets}
+    def __iter__(self) -> Iterator[Dict[str, torch.Tensor]]:
+        """DataLoader가 호출하는 이터레이터.
+        멀티 워커 지원:
+          - 각 워커가 서로 다른 시드로 셔플된 스트림을 처리
+          - 워커 간 데이터 중복을 최소화
+        """
+        worker_info = torch.utils.data.get_worker_info()
+        if worker_info is not None:
+            # 멀티 워커: 각 워커에 다른 시드
+            worker_seed = self.seed + worker_info.id
+        else:
+            worker_seed = self.seed
+        # 워커별 시드로 데이터셋 로드
+        self.seed = worker_seed
+        dataset = self._load_dataset()
+        return self._tokenize_and_pack(dataset)
+class ValidationDataset:
+    """검증용 데이터셋.
+    Streaming 데이터셋에서 일정량을 미리 가져와 메모리에 저장합니다.
+    매 에폭 동일한 데이터로 평가해야 비교가 의미 있기 때문입니다.
+    """
+    def __init__(
+        self,
+        tokenizer: Tokenizer,
+        config: DataConfig,
+        num_samples: int = 100,
+        seed: int = 9999,
+    ):
+        self.tokenizer = tokenizer
+        self.config = config
+        self.num_samples = num_samples
+        self.samples: List[Dict[str, torch.Tensor]] = []
+        self._prepare(seed)
+    def _prepare(self, seed: int):
+        """데이터셋에서 검증 샘플을 미리 추출합니다."""
+        from datasets import load_dataset
+        print(f"[Validation] {self.num_samples}개 검증 샘플 준비 중...")
+        ds = load_dataset(
+            self.config.dataset_name,
+            name=self.config.dataset_subset,
+            split=self.config.dataset_split,
+            streaming=True,
+            trust_remote_code=True,
+        )
+        # 학습 데이터와 겹치지 않도록 다른 시드, 앞부분 건너뛰기
+        ds = ds.shuffle(seed=seed, buffer_size=5_000)
+        buffer: List[int] = []
+        count = 0
+        for example in ds:
+            if count >= self.num_samples:
+                break
+            text = example[self.config.text_column]
+            if not text or not text.strip():
+                continue
+            token_ids = self.tokenizer.encode(text, add_special_tokens=False)
+            if not token_ids:
+                continue
+            token_ids.append(self.tokenizer.eos_id)
+            buffer.extend(token_ids)
+            while len(buffer) >= self.config.max_seq_len + 1 and count < self.num_samples:
+                chunk = buffer[: self.config.max_seq_len + 1]
+                buffer = buffer[self.config.max_seq_len + 1 :]
+                self.samples.append({
+                    "input_ids": torch.tensor(chunk[:-1], dtype=torch.long),
+                    "targets": torch.tensor(chunk[1:], dtype=torch.long),
+                })
+                count += 1
+        print(f"[Validation] {len(self.samples)}개 샘플 준비 완료")
+    def get_dataloader(self, batch_size: int) -> DataLoader:
+        """검증 DataLoader를 반환합니다."""
+        return DataLoader(
+            self.samples,
+            batch_size=batch_size,
+            shuffle=False,
+            num_workers=0,
+            collate_fn=_collate_fn,
+        )
+def _collate_fn(batch: List[Dict[str, torch.Tensor]]) -> Dict[str, torch.Tensor]:
+    """배치 내 샘플들을 하나의 텐서로 합칩니다.
+    시퀀스 패킹 덕분에 모든 샘플이 동일한 길이(max_seq_len)이므로
+    추가 패딩이 필요 없습니다.
+    """
+    return {
+        "input_ids": torch.stack([s["input_ids"] for s in batch]),
+        "targets": torch.stack([s["targets"] for s in batch]),
+    }

llm_lab/data/diagnostics.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""데이터 파이프라인 진단 도구."""
+import time
+from typing import Dict
+import torch
+from torch.utils.data import DataLoader
+from llm_lab.config import DataConfig
+from .tokenizer import Tokenizer
+class DataPipelineDiagnostics:
+    """데이터 파이프라인의 성능과 품질을 진단합니다.
+    학습 전 반드시 확인해야 할 항목:
+      1) 토크나이저 품질: 평균 토큰/문서, 알 수 없는 토큰 비율
+      2) 패킹 효율: 실제 토큰 비율 vs 패딩 비율
+      3) 처리 속도: tokens/sec (데이터 로딩 병목 확인)
+      4) 배치 형태: shape, dtype 정확성
+    """
+    @staticmethod
+    def check_tokenizer_quality(
+        tokenizer: Tokenizer,
+        config: DataConfig,
+        num_samples: int = 1000,
+    ):
+        """토크나이저 품질을 진단합니다."""
+        from datasets import load_dataset
+        print("\n" + "=" * 60)
+        print("📊 토크나이저 품질 진단")
+        print("=" * 60)
+        ds = load_dataset(
+            config.dataset_name,
+            name=config.dataset_subset,
+            split=config.dataset_split,
+            streaming=True,
+            trust_remote_code=True,
+        )
+        token_counts = []
+        char_counts = []
+        sample_count = 0
+        for example in ds:
+            if sample_count >= num_samples:
+                break
+            text = example[config.text_column]
+            if not text or not text.strip():
+                continue
+            tokens = tokenizer.encode(text)
+            token_counts.append(len(tokens))
+            char_counts.append(len(text))
+            sample_count += 1
+        avg_tokens = sum(token_counts) / len(token_counts)
+        avg_chars = sum(char_counts) / len(char_counts)
+        compression_ratio = avg_chars / avg_tokens  # 문자/토큰 비율
+        print(f"  분석 문서 수: {len(token_counts):,}")
+        print(f"  평균 토큰/문서: {avg_tokens:.1f}")
+        print(f"  평균 문자/문서: {avg_chars:.1f}")
+        print(f"  압축 비율 (문자/토큰): {compression_ratio:.2f}")
+        print(f"    → 영어 기준 3.5~4.5가 정상")
+        print(f"  최소 토큰: {min(token_counts)}, 최대: {max(token_counts)}")
+        # 디코드 왕복 테스트
+        test_text = "The quick brown fox jumps over the lazy dog."
+        encoded = tokenizer.encode(test_text)
+        decoded = tokenizer.decode(encoded)
+        roundtrip_ok = test_text.strip() in decoded.strip()
+        print(f"\n  왕복 테스트: {'✅ 통과' if roundtrip_ok else '❌ 실패'}")
+        print(f"    원본:  {test_text}")
+        print(f"    인코딩: {encoded[:20]}{'...' if len(encoded) > 20 else ''}")
+        print(f"    디코딩: {decoded}")
+    @staticmethod
+    def benchmark_throughput(
+        dataloader: DataLoader,
+        num_batches: int = 50,
+        seq_len: int = 2048,
+    ):
+        """데이터 로딩 처리량을 측정합니다.
+        GPU 학습 속도의 병목이 데이터 로딩인지 확인하는 핵심 진단.
+        목표: 데이터 로딩이 GPU 연산보다 빨라야 함 (data loading ≠ bottleneck).
+        """
+        print("\n" + "=" * 60)
+        print("⚡ 데이터 로딩 처리량 벤치마크")
+        print("=" * 60)
+        total_tokens = 0
+        start_time = time.time()
+        for i, batch in enumerate(dataloader):
+            if i >= num_batches:
+                break
+            batch_tokens = batch["input_ids"].numel()
+            total_tokens += batch_tokens
+            if (i + 1) % 10 == 0:
+                elapsed = time.time() - start_time
+                tps = total_tokens / elapsed
+                print(f"  Batch {i+1}: {tps:,.0f} tokens/sec")
+        elapsed = time.time() - start_time
+        tps = total_tokens / elapsed
+        print(f"\n  총 배치 수: {num_batches}")
+        print(f"  총 토큰 수: {total_tokens:,}")
+        print(f"  소요 시간: {elapsed:.2f}초")
+        print(f"  평균 처리량: {tps:,.0f} tokens/sec")
+        print(f"\n  💡 A100 학습 처리량 ~50-80K tokens/sec 기준:")
+        if tps > 80_000:
+            print(f"     ✅ 데이터 로딩이 병목이 아닙니다")
+        elif tps > 30_000:
+            print(f"     ⚠️ 경계선 - num_workers 증가를 고려하세요")
+        else:
+            print(f"     ❌ 데이터 로딩이 병목! num_workers/prefetch 조정 필요")
+    @staticmethod
+    def inspect_batch(batch: Dict[str, torch.Tensor], tokenizer: Tokenizer):
+        """배치 하나를 상세 검사합니다."""
+        print("\n" + "=" * 60)
+        print("🔍 배치 상세 검사")
+        print("=" * 60)
+        input_ids = batch["input_ids"]
+        targets = batch["targets"]
+        print(f"  input_ids shape: {input_ids.shape}")
+        print(f"  targets shape:   {targets.shape}")
+        print(f"  dtype:           {input_ids.dtype}")
+        print(f"  값 범위:         [{input_ids.min().item()}, {input_ids.max().item()}]")
+        # Shift 관계 확인: targets[i] == input_ids[i+1]
+        shift_correct = (input_ids[:, 1:] == targets[:, :-1]).float().mean().item()
+        print(f"  Shift 정합성:    {shift_correct*100:.1f}% (100%여야 정상)")
+        # EOS 토큰 분포 (문서 경계)
+        eos_count = (input_ids == tokenizer.eos_id).sum().item()
+        total_tokens = input_ids.numel()
+        print(f"  EOS 토큰 수:     {eos_count} / {total_tokens} ({eos_count/total_tokens*100:.2f}%)")
+        # 첫 번째 샘플 디코딩 미리보기
+        first_sample = input_ids[0][:100].tolist()
+        decoded_preview = tokenizer.decode(first_sample)
+        print(f"\n  첫 샘플 디코딩 (처음 100 토큰):")
+        print(f"  {decoded_preview[:300]}...")

llm_lab/data/pipeline.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""데이터 파이프라인 통합 — DataLoader 생성, 토크나이저 학습, Quick Start."""
+from typing import Optional
+import torch
+from torch.utils.data import DataLoader
+from llm_lab.config import DataConfig
+from .tokenizer import Tokenizer
+from .dataset import PackedStreamingDataset, ValidationDataset, _collate_fn
+def create_train_dataloader(
+    tokenizer: Tokenizer,
+    config: DataConfig,
+    seed: int = 42,
+) -> DataLoader:
+    """학습용 DataLoader를 생성합니다.
+    Returns:
+        무한히 반복되는 스트리밍 DataLoader
+    사용법:
+        dataloader = create_train_dataloader(tokenizer, config)
+        for step, batch in enumerate(dataloader):
+            input_ids = batch["input_ids"].to(device)  # (B, seq_len)
+            targets = batch["targets"].to(device)       # (B, seq_len)
+            logits, loss = model(input_ids, targets)
+            ...
+    """
+    dataset = PackedStreamingDataset(
+        tokenizer=tokenizer,
+        config=config,
+        split="train",
+        seed=seed,
+    )
+    dataloader = DataLoader(
+        dataset,
+        batch_size=config.batch_size,
+        num_workers=config.num_workers,
+        prefetch_factor=config.prefetch_factor if config.num_workers > 0 else None,
+        pin_memory=True,     # GPU 전송 속도 향상
+        collate_fn=_collate_fn,
+    )
+    return dataloader
+def train_tokenizer_from_dataset(config: DataConfig) -> Tokenizer:
+    """데이터셋에서 BPE 토크나이저를 학습합니다.
+    전체 데이터를 다 사용할 필요 없이, 50K 문서면 충분합니다.
+    토크나이저 vocab은 전체 데이터의 통계를 반영하면 되므로.
+    """
+    from datasets import load_dataset
+    print(f"[Train Tokenizer] {config.dataset_name}에서 토크나이저 학습")
+    print(f"[Train Tokenizer] 학습 문서 수: {config.tokenizer_train_samples:,}")
+    # 텍스트 이터레이터 생성
+    ds = load_dataset(
+        config.dataset_name,
+        name=config.dataset_subset,
+        split=config.dataset_split,
+        streaming=True,
+        trust_remote_code=True,
+    )
+    def text_iterator():
+        count = 0
+        for example in ds:
+            if count >= config.tokenizer_train_samples:
+                break
+            text = example[config.text_column]
+            if text and text.strip():
+                yield text
+                count += 1
+                if count % 10_000 == 0:
+                    print(f"  ... {count:,} 문서 처리")
+    # 토크나이저 학습
+    tokenizer = Tokenizer(config)
+    tokenizer.train_bpe(text_iterator(), save_dir=config.tokenizer_save_dir)
+    return tokenizer
+def setup_data_pipeline(
+    tokenizer_mode: str = "train_new",
+    tokenizer_path: Optional[str] = None,
+    config: Optional[DataConfig] = None,
+) -> tuple:
+    """데이터 파이프라인을 한 번에 설정합니다.
+    Args:
+        tokenizer_mode:
+            "train_new"    - BPE 토크나이저 새로 학습
+            "load_trained" - 이전에 학습한 토크나이저 로드
+            "pretrained"   - HuggingFace 사전학습 토크나이저 사용
+        tokenizer_path:
+            "train_new"    → 저장 경로 (기본: ./tokenizer)
+            "load_trained" → 저장된 토크나이저 경로
+            "pretrained"   → HF 모델명 (기본: mistralai/Mistral-7B-v0.1)
+    Returns:
+        (tokenizer, train_dataloader, val_dataloader)
+    사용 예시 (Colab):
+        # 방법 1: 토크나이저 새로 학습
+        tok, train_dl, val_dl = setup_data_pipeline("train_new")
+        # 방법 2: 기존 토크나이저 로드
+        tok, train_dl, val_dl = setup_data_pipeline("load_trained", "./tokenizer")
+        # 방법 3: 사전학습 토크나이저 (가장 간편)
+        tok, train_dl, val_dl = setup_data_pipeline("pretrained")
+    """
+    config = config or DataConfig()
+    print("=" * 60)
+    print("🚀 데이터 파이프라인 설정")
+    print("=" * 60)
+    # ── Step 1: 토크나이저 ──
+    tokenizer = Tokenizer(config)
+    if tokenizer_mode == "train_new":
+        tokenizer = train_tokenizer_from_dataset(config)
+    elif tokenizer_mode == "load_trained":
+        path = tokenizer_path or config.tokenizer_save_dir
+        tokenizer.load_trained_hf(path)
+    elif tokenizer_mode == "pretrained":
+        name = tokenizer_path or "mistralai/Mistral-7B-v0.1"
+        tokenizer.load_pretrained_hf(name)
+    else:
+        raise ValueError(f"Unknown tokenizer_mode: {tokenizer_mode}")
+    # ── Step 2: 학습 DataLoader ──
+    print("\n[DataLoader] 학습 DataLoader 생성...")
+    train_dataloader = create_train_dataloader(tokenizer, config)
+    # ── Step 3: 검증 DataLoader ──
+    print("\n[DataLoader] 검증 DataLoader 생성...")
+    val_dataset = ValidationDataset(tokenizer, config, num_samples=100)
+    val_dataloader = val_dataset.get_dataloader(batch_size=config.batch_size)
+    print("\n" + "=" * 60)
+    print("✅ 데이터 파이프라인 설정 완료!")
+    print(f"   토크나이저 vocab: {tokenizer.vocab_size:,}")
+    print(f"   시퀀스 길이: {config.max_seq_len}")
+    print(f"   배치 크기: {config.batch_size}")
+    print(f"   토큰/배치: {config.batch_size * config.max_seq_len:,}")
+    print("=" * 60)
+    return tokenizer, train_dataloader, val_dataloader

llm_lab/data/tokenizer.py ADDED Viewed

	@@ -0,0 +1,196 @@

+"""토크나이저 래퍼 — SentencePiece / HuggingFace BPE 통합."""
+import os
+import json
+from typing import Optional, Iterator, List
+from llm_lab.config import DataConfig
+class Tokenizer:
+    """토크나이저 통합 래퍼.
+    세 가지 방법 지원:
+      1) 기존 SentencePiece 모델 로드
+      2) HuggingFace tokenizers 라이브러리로 새로 학습
+      3) 사전 학습된 HF 토크나이저 로드 (예: LLaMA tokenizer)
+    왜 직접 구현하지 않는가?
+      - BPE 토크나이저 학습은 대규모 텍스트 통계 처리이며,
+        모델 아키텍처 이해와 직접적 관련이 적습니다.
+      - 다만 토크나이저의 동작 원리(BPE 병합 규칙)는 이해해야 합니다.
+    BPE(Byte Pair Encoding) 핵심 원리:
+      1) 텍스트를 바이트/문자 단위로 분리
+      2) 가장 빈번한 인접 쌍을 반복적으로 병합
+      3) vocab_size에 도달할 때까지 반복
+      → 자주 등장하는 단어는 하나의 토큰, 희귀 단어는 여러 토큰으로 분리
+    """
+    def __init__(self, config: DataConfig):
+        self.config = config
+        self._tokenizer = None
+        self.vocab_size = config.vocab_size
+        # 특수 토큰 ID (초기화 후 설정됨)
+        self.bos_id: int = 1   # Beginning of Sequence
+        self.eos_id: int = 2   # End of Sequence
+        self.pad_id: int = 0   # Padding
+    # ────────────────────────────────────────────────
+    # 방법 1: SentencePiece 모델 로드
+    # ────────────────────────────────────────────────
+    def load_sentencepiece(self, model_path: str):
+        """기존 SentencePiece 모델을 로드합니다."""
+        import sentencepiece as spm
+        self._tokenizer = spm.SentencePieceProcessor()
+        self._tokenizer.Load(model_path)
+        self.vocab_size = self._tokenizer.GetPieceSize()
+        self.bos_id = self._tokenizer.bos_id()
+        self.eos_id = self._tokenizer.eos_id()
+        self.pad_id = self._tokenizer.pad_id()
+        self._encode_fn = self._tokenizer.Encode
+        self._decode_fn = self._tokenizer.Decode
+        print(f"[Tokenizer] SentencePiece 로드 완료: vocab_size={self.vocab_size}")
+    # ────────────────────────────────────────────────
+    # 방법 2: HuggingFace tokenizers로 BPE 학습
+    # ────────────────────────────────────────────────
+    def train_bpe(self, text_iterator: Iterator[str], save_dir: Optional[str] = None):
+        """BPE 토크나이저를 처음부터 학습합니다.
+        Args:
+            text_iterator: 학습 텍스트를 yield하는 이터레이터
+            save_dir: 저장 경로
+        학습 포인트:
+          - vocab_size가 클수록: 자주 쓰는 표현이 1토큰 → 시퀀스 짧아짐
+          - vocab_size가 작을수록: Embedding 파라미터 절약, 하지만 시퀀스 길어짐
+          - 32K는 영어 기준 좋은 균형점
+        """
+        from tokenizers import Tokenizer as HFTokenizer
+        from tokenizers.models import BPE
+        from tokenizers.trainers import BpeTrainer
+        from tokenizers.pre_tokenizers import ByteLevel
+        from tokenizers.processors import TemplateProcessing
+        print("[Tokenizer] BPE 토크나이저 학습 시작...")
+        # BPE 모델 생성
+        tokenizer = HFTokenizer(BPE(unk_token="<unk>"))
+        tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
+        # 특수 토큰 정의
+        special_tokens = ["<pad>", "<s>", "</s>", "<unk>"]
+        # 트레이너 설정
+        trainer = BpeTrainer(
+            vocab_size=self.config.vocab_size,
+            special_tokens=special_tokens,
+            min_frequency=2,           # 최소 2번 등장한 쌍만 병합
+            show_progress=True,
+        )
+        # 학습 실행
+        tokenizer.train_from_iterator(text_iterator, trainer=trainer)
+        # 후처리: BOS/EOS 자동 추가
+        tokenizer.post_processor = TemplateProcessing(
+            single="<s> $A </s>",
+            special_tokens=[("<s>", 1), ("</s>", 2)],
+        )
+        self._tokenizer = tokenizer
+        self.vocab_size = tokenizer.get_vocab_size()
+        self.pad_id = 0
+        self.bos_id = 1
+        self.eos_id = 2
+        self._encode_fn = lambda text: tokenizer.encode(text).ids
+        self._decode_fn = lambda ids: tokenizer.decode(ids)
+        # 저장
+        save_dir = save_dir or self.config.tokenizer_save_dir
+        os.makedirs(save_dir, exist_ok=True)
+        tokenizer.save(os.path.join(save_dir, "tokenizer.json"))
+        # 메타 정보 저장
+        meta = {
+            "vocab_size": self.vocab_size,
+            "bos_id": self.bos_id,
+            "eos_id": self.eos_id,
+            "pad_id": self.pad_id,
+        }
+        with open(os.path.join(save_dir, "tokenizer_meta.json"), "w") as f:
+            json.dump(meta, f, indent=2)
+        print(f"[Tokenizer] 학습 완료: vocab_size={self.vocab_size}")
+        print(f"[Tokenizer] 저장 위치: {save_dir}")
+    # ────────────────────────────────────────────────
+    # 방법 3: 사전 학습된 HF 토크나이저 로드
+    # ────────────────────────────────────────────────
+    def load_pretrained_hf(self, name_or_path: str = "meta-llama/Llama-2-7b-hf"):
+        """HuggingFace에서 사전 학습된 토크나이저를 로드합니다.
+        가장 간편한 방법. LLaMA 토크나이저는 32K vocab, BPE 기반.
+        주의: meta-llama 모델은 HF 승인이 필요할 수 있음.
+        대안: mistralai/Mistral-7B-v0.1 (승인 불필요)
+        """
+        from transformers import AutoTokenizer
+        print(f"[Tokenizer] HF 토크나이저 로드: {name_or_path}")
+        tokenizer = AutoTokenizer.from_pretrained(name_or_path)
+        self._tokenizer = tokenizer
+        self.vocab_size = tokenizer.vocab_size
+        self.bos_id = tokenizer.bos_token_id or 1
+        self.eos_id = tokenizer.eos_token_id or 2
+        self.pad_id = tokenizer.pad_token_id or 0
+        self._encode_fn = lambda text: tokenizer.encode(text, add_special_tokens=False)
+        self._decode_fn = lambda ids: tokenizer.decode(ids)
+        print(f"[Tokenizer] 로드 완료: vocab_size={self.vocab_size}")
+    def load_trained_hf(self, path: str):
+        """train_bpe()로 학습한 토크나이저를 다시 로드합니다."""
+        from tokenizers import Tokenizer as HFTokenizer
+        tokenizer = HFTokenizer.from_file(os.path.join(path, "tokenizer.json"))
+        with open(os.path.join(path, "tokenizer_meta.json"), "r") as f:
+            meta = json.load(f)
+        self._tokenizer = tokenizer
+        self.vocab_size = meta["vocab_size"]
+        self.bos_id = meta["bos_id"]
+        self.eos_id = meta["eos_id"]
+        self.pad_id = meta["pad_id"]
+        self._encode_fn = lambda text: tokenizer.encode(text).ids
+        self._decode_fn = lambda ids: tokenizer.decode(ids)
+        print(f"[Tokenizer] 로드 완료: vocab_size={self.vocab_size}")
+    # ────────────────────────────────────────────────
+    # 공통 인터페이스
+    # ────────────────────────────────────────────────
+    def encode(self, text: str, add_special_tokens: bool = False) -> List[int]:
+        """텍스트 → 토큰 ID 리스트."""
+        ids = self._encode_fn(text)
+        if add_special_tokens:
+            ids = [self.bos_id] + ids + [self.eos_id]
+        return ids
+    def decode(self, ids: List[int]) -> str:
+        """토큰 ID 리스트 → 텍스트."""
+        return self._decode_fn(ids)
+    def __len__(self) -> int:
+        return self.vocab_size

llm_lab/evaluation/__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""평가 모듈 — Perplexity, 텍스트 생성, Scaling Law, Attention 시각화."""
+from .perplexity import PerplexityEvaluator
+from .generation import GenerationEvaluator
+from .scaling import ScalingAnalyzer
+from .dynamics import TrainingDynamicsAnalyzer
+from .attention_viz import AttentionVisualizer
+from .full_evaluator import FullEvaluator
+from .checklist import InsightChecklist
+from .runner import run_evaluation
+__all__ = [
+    "PerplexityEvaluator",
+    "GenerationEvaluator",
+    "ScalingAnalyzer",
+    "TrainingDynamicsAnalyzer",
+    "AttentionVisualizer",
+    "FullEvaluator",
+    "InsightChecklist",
+    "run_evaluation",
+]

llm_lab/evaluation/attention_viz.py ADDED Viewed

	@@ -0,0 +1,176 @@

+"""Attention 패턴 시각화."""
+import math
+from pathlib import Path
+from typing import List, Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+try:
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    HAS_MATPLOTLIB = True
+except ImportError:
+    HAS_MATPLOTLIB = False
+class AttentionVisualizer:
+    """Attention 패턴을 시각화합니다.
+    학습 포인트:
+      - Causal Mask: 하삼각 패턴 (미래 토큰은 볼 수 없음)
+      - 헤드별 역할 분화: 일부는 로컬(인접), 일부는 글로벌(먼 토큰) 주목
+      - 구문론적 패턴: 동사→주어, 대명사→선행사 등에 높은 attention
+    주의: 1B 모델의 전체 attention을 저장하면 메모리 부족!
+    → 특정 레이어/헤드만 선택적으로 시각화합니다.
+    """
+    def __init__(self, save_dir: str = "./eval_results"):
+        self.save_dir = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    @torch.no_grad()
+    def extract_attention(
+        self,
+        model: nn.Module,
+        input_ids: torch.Tensor,
+        layer_idx: int = 0,
+        device: torch.device = torch.device("cpu"),
+    ) -> torch.Tensor:
+        """특정 레이어의 attention weight를 추출합니다.
+        모델의 attention 모듈을 일시적으로 수정하여
+        attention weight를 캡처합니다.
+        Returns:
+            attention_weights: (num_heads, seq_len, seq_len)
+        """
+        model.eval()
+        captured_attn = {}
+        # Hook으로 attention weight 캡처
+        target_layer = model.layers[layer_idx].attention
+        # scaled_dot_product_attention을 수동 구현으로 대체
+        original_forward = target_layer.forward
+        def hooked_forward(x, mask=None, position_offset=0):
+            B, S, _ = x.shape
+            hd = target_layer.head_dim
+            q = target_layer.q_proj(x).view(B, S, target_layer.num_heads, hd).transpose(1, 2)
+            k = target_layer.k_proj(x).view(B, S, target_layer.num_kv_heads, hd).transpose(1, 2)
+            v = target_layer.v_proj(x).view(B, S, target_layer.num_kv_heads, hd).transpose(1, 2)
+            q, k = target_layer.rope(q, k, position_offset)
+            if target_layer.num_kv_groups > 1:
+                k = target_layer._repeat_kv(k)
+                v = target_layer._repeat_kv(v)
+            # 수동 attention 계산 (weight 추출용)
+            scale = 1.0 / math.sqrt(hd)
+            scores = torch.matmul(q, k.transpose(-2, -1)) * scale
+            # Causal mask
+            causal = torch.triu(torch.ones(S, S, device=x.device, dtype=torch.bool), diagonal=1)
+            scores.masked_fill_(causal.unsqueeze(0).unsqueeze(0), float("-inf"))
+            attn_weights = F.softmax(scores, dim=-1)
+            captured_attn["weights"] = attn_weights[0].cpu()  # 첫 배치만
+            out = torch.matmul(attn_weights, v)
+            out = out.transpose(1, 2).contiguous().view(B, S, -1)
+            return target_layer.o_proj(out)
+        # Hook 적용
+        target_layer.forward = hooked_forward
+        try:
+            model(input_ids.to(device))
+        finally:
+            target_layer.forward = original_forward
+        return captured_attn.get("weights")  # (num_heads, S, S)
+    def plot_attention_heatmap(
+        self,
+        attn_weights: torch.Tensor,
+        tokens: List[str],
+        head_idx: int = 0,
+        save_path: Optional[str] = None,
+        title: str = "Attention Weights",
+    ):
+        """Attention heatmap을 그립니다."""
+        if not HAS_MATPLOTLIB:
+            print("⚠️ matplotlib가 필요합니다")
+            return
+        weights = attn_weights[head_idx].numpy()
+        max_len = min(len(tokens), 50)  # 최대 50 토큰만 표시
+        weights = weights[:max_len, :max_len]
+        display_tokens = tokens[:max_len]
+        fig, ax = plt.subplots(figsize=(12, 10))
+        im = ax.imshow(weights, cmap="Blues", aspect="auto")
+        ax.set_xticks(range(max_len))
+        ax.set_yticks(range(max_len))
+        ax.set_xticklabels(display_tokens, rotation=90, fontsize=7)
+        ax.set_yticklabels(display_tokens, fontsize=7)
+        ax.set_xlabel("Key (attended to)", fontsize=11)
+        ax.set_ylabel("Query (attending from)", fontsize=11)
+        ax.set_title(f"{title} — Head {head_idx}", fontsize=13, fontweight="bold")
+        fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / f"attention_head{head_idx}.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"  📊 Attention 시각화 저장: {save_path}")
+        plt.close(fig)
+    def plot_multi_head_summary(
+        self,
+        attn_weights: torch.Tensor,
+        num_heads_to_show: int = 8,
+        save_path: Optional[str] = None,
+    ):
+        """여러 헤드의 attention 패턴을 요약 비교합니다."""
+        if not HAS_MATPLOTLIB:
+            return
+        n_heads = min(attn_weights.shape[0], num_heads_to_show)
+        cols = 4
+        rows = math.ceil(n_heads / cols)
+        fig, axes = plt.subplots(rows, cols, figsize=(16, 4 * rows))
+        if rows == 1:
+            axes = axes.reshape(1, -1)
+        for idx in range(n_heads):
+            r, c = idx // cols, idx % cols
+            ax = axes[r, c]
+            w = attn_weights[idx].numpy()
+            ax.imshow(w, cmap="Blues", aspect="auto")
+            ax.set_title(f"Head {idx}", fontsize=10)
+            ax.set_xticks([])
+            ax.set_yticks([])
+        # 빈 subplot 숨기기
+        for idx in range(n_heads, rows * cols):
+            r, c = idx // cols, idx % cols
+            axes[r, c].axis("off")
+        fig.suptitle("Attention Patterns by Head", fontsize=14, fontweight="bold")
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "attention_multi_head.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"  📊 멀티 헤드 요약 저장: {save_path}")
+        plt.close(fig)

llm_lab/evaluation/checklist.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""학습 인사이트 체크리스트 검증기."""
+from typing import Any, Dict, Optional
+class InsightChecklist:
+    """PRD에 정의된 학습 인사이트 체크리스트를 자동/수동으로 검증합니다.
+    자동 검증 가능 항목은 메트릭 기반으로 판정하고,
+    수동 항목은 질문으로 제시합니다.
+    """
+    @staticmethod
+    def run_checklist(
+        report: Dict[str, Any],
+        metrics_history: Optional[Dict[str, list]] = None,
+    ):
+        """체크리스트를 실행합니다."""
+        print("\n" + "=" * 70)
+        print("✅ 학습 인사이트 체크리스트")
+        print("=" * 70)
+        checks = {
+            "passed": [],
+            "failed": [],
+            "manual": [],
+        }
+        # ── 자동 검증 ──
+        # 1. Loss 수렴
+        if report.get("perplexity", {}).get("loss", 99) < 4.0:
+            checks["passed"].append("모델 Loss가 4.0 이하로 수렴")
+        else:
+            checks["failed"].append("모델 Loss가 4.0 이하로 미수렴")
+        # 2. Loss 스파이크
+        spikes = report.get("training_dynamics", {}).get("loss", {}).get("spikes", [])
+        if len(spikes) < 5:
+            checks["passed"].append(f"Loss 스파이크 {len(spikes)}회 (< 5회)")
+        else:
+            checks["failed"].append(f"Loss 스파이크 {len(spikes)}회 (≥ 5회, 안정성 개선 필요)")
+        # 3. 위치별 Loss 패턴
+        if report.get("position_losses"):
+            early = report["position_losses"]["early_avg"]
+            late = report["position_losses"]["late_avg"]
+            if early > late:
+                checks["passed"].append("위치별 Loss 감소 패턴 확인 (컨텍스트 활용)")
+            else:
+                checks["failed"].append("위치별 Loss 패턴 이상 (컨텍스트 미활용?)")
+        # 4. 생성 반복률
+        rep = report.get("generation", {}).get("avg_metrics", {}).get("repetition_rate", 1.0)
+        if rep < 0.3:
+            checks["passed"].append(f"생성 반복률 {rep:.1%} (< 30%)")
+        else:
+            checks["failed"].append(f"생성 반복률 {rep:.1%} (≥ 30%, temperature/top_p 조정)")
+        # 5. Gradient 클리핑 비율
+        if metrics_history and metrics_history.get("grad_norm"):
+            gnorms = metrics_history["grad_norm"]
+            clip_rate = sum(1 for g in gnorms if g >= 0.99) / max(len(gnorms), 1)
+            if clip_rate < 0.3:
+                checks["passed"].append(f"Gradient 클리핑 비율 {clip_rate:.1%} (건강)")
+            else:
+                checks["failed"].append(f"Gradient 클리핑 비율 {clip_rate:.1%} (너무 잦음)")
+        # ── 수동 확인 항목 ──
+        manual_items = [
+            "Self-Attention에서 Q, K, V 각각의 역할을 설명할 수 있는가?",
+            "RoPE가 위치 정보를 인코딩하는 수학적 원리를 이해하는가?",
+            "GQA가 MHA 대비 메모리를 절약하는 메커니즘을 설명할 수 있는가?",
+            "SwiGLU의 게이팅 메커니즘이 ReLU FFN과 어떻게 다른지 이해하는가?",
+            "Learning Rate Warmup이 왜 필요한지 체감했는가?",
+            "Gradient Accumulation이 큰 배치를 시뮬레이션하는 원리를 이해하는가?",
+            "Mixed Precision(bf16)의 메모리-속도 효과를 측정했는가?",
+            "Activation Checkpointing의 메모리-연산 트레이드오프를 이해하는가?",
+        ]
+        checks["manual"] = manual_items
+        # ── 출력 ──
+        total_auto = len(checks["passed"]) + len(checks["failed"])
+        passed_auto = len(checks["passed"])
+        print(f"\n  자동 검증: {passed_auto}/{total_auto} 통과")
+        for item in checks["passed"]:
+            print(f"    ✅ {item}")
+        for item in checks["failed"]:
+            print(f"    ❌ {item}")
+        print(f"\n  수동 확인 ({len(manual_items)} 항목):")
+        for i, item in enumerate(manual_items, 1):
+            print(f"    {i}. [ ] {item}")
+        print(f"\n  총 진행률: {passed_auto}/{total_auto + len(manual_items)} "
+              f"(수동 항목 포함 시)")
+        return checks

llm_lab/evaluation/dynamics.py ADDED Viewed

	@@ -0,0 +1,242 @@

+"""학습 역학 분석기."""
+import math
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+try:
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    HAS_MATPLOTLIB = True
+except ImportError:
+    HAS_MATPLOTLIB = False
+class TrainingDynamicsAnalyzer:
+    """학습 과정의 메트릭을 분석하고 시각화합니다.
+    분석 항목:
+      - Loss 곡선:      수렴 패턴, 스파이크 감지
+      - LR 스케줄:      Warmup + Cosine decay 확인
+      - Gradient Norm:  학습 안정성, 폭발/소멸 감지
+      - 처리량:         tokens/sec 안정성, 병목 감지
+    """
+    def __init__(self, save_dir: str = "./eval_results"):
+        self.save_dir = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def analyze_metrics(self, metrics_history: Dict[str, list]) -> Dict[str, Any]:
+        """학습 메트릭을 분석합니다.
+        Args:
+            metrics_history: Trainer.metrics.history 딕셔너리
+        Returns:
+            분석 결과
+        """
+        print("\n" + "=" * 70)
+        print("🔬 학습 역학 분석")
+        print("=" * 70)
+        analysis = {}
+        # ── Loss 분석 ──
+        if metrics_history.get("train_loss"):
+            losses = metrics_history["train_loss"]
+            analysis["loss"] = {
+                "initial": round(losses[0], 4),
+                "final": round(losses[-1], 4),
+                "minimum": round(min(losses), 4),
+                "total_reduction": round(losses[0] - losses[-1], 4),
+            }
+            # 스파이크 감지 (이전 값 대비 50% 이상 급증)
+            spikes = []
+            for i in range(1, len(losses)):
+                if losses[i] > losses[i-1] * 1.5:
+                    step = metrics_history["step"][i] if "step" in metrics_history else i
+                    spikes.append({"step": step, "loss": round(losses[i], 4)})
+            analysis["loss"]["spikes"] = spikes
+            print(f"\n  📉 Loss 분석:")
+            print(f"    초기:  {analysis['loss']['initial']:.4f}")
+            print(f"    최종:  {analysis['loss']['final']:.4f}")
+            print(f"    최소:  {analysis['loss']['minimum']:.4f}")
+            print(f"    감소:  {analysis['loss']['total_reduction']:.4f}")
+            print(f"    스파이크: {len(spikes)}회")
+            if spikes:
+                for s in spikes[:5]:
+                    print(f"      Step {s['step']}: Loss = {s['loss']}")
+        # ── Gradient Norm 분석 ──
+        if metrics_history.get("grad_norm"):
+            gnorms = metrics_history["grad_norm"]
+            analysis["grad_norm"] = {
+                "mean": round(sum(gnorms) / len(gnorms), 4),
+                "max": round(max(gnorms), 4),
+                "min": round(min(gnorms), 4),
+                "clipped_pct": round(sum(1 for g in gnorms if g >= 0.99) / len(gnorms) * 100, 1),
+            }
+            print(f"\n  📐 Gradient Norm 분석:")
+            print(f"    평균: {analysis['grad_norm']['mean']:.4f}")
+            print(f"    최대: {analysis['grad_norm']['max']:.4f}")
+            print(f"    클리핑 비율: {analysis['grad_norm']['clipped_pct']:.1f}%")
+            if analysis["grad_norm"]["clipped_pct"] > 30:
+                print(f"    ⚠️ 클리핑이 잦음 → LR 하향 또는 warmup 연장 고려")
+        # ── 처리량 분석 ──
+        if metrics_history.get("tokens_per_sec"):
+            tps = metrics_history["tokens_per_sec"]
+            tps_valid = [t for t in tps if t > 0]
+            if tps_valid:
+                analysis["throughput"] = {
+                    "mean": round(sum(tps_valid) / len(tps_valid)),
+                    "std": round((sum((t - sum(tps_valid)/len(tps_valid))**2 for t in tps_valid) / len(tps_valid))**0.5),
+                    "min": round(min(tps_valid)),
+                    "max": round(max(tps_valid)),
+                }
+                print(f"\n  ⚡ 처리량 분석:")
+                print(f"    평균: {analysis['throughput']['mean']:,} tokens/sec")
+                print(f"    표준편차: {analysis['throughput']['std']:,}")
+                print(f"    범위: [{analysis['throughput']['min']:,}, {analysis['throughput']['max']:,}]")
+        return analysis
+    def plot_training_curves(
+        self,
+        metrics_history: Dict[str, list],
+        save_path: Optional[str] = None,
+    ):
+        """학습 곡선을 4-panel 차트로 시각화합니다."""
+        if not HAS_MATPLOTLIB:
+            print("⚠️ matplotlib가 필요합니다: pip install matplotlib")
+            return
+        fig, axes = plt.subplots(2, 2, figsize=(16, 10))
+        fig.suptitle("Training Dynamics", fontsize=16, fontweight="bold")
+        steps = metrics_history.get("step", list(range(len(metrics_history.get("train_loss", [])))))
+        # ── (1) Loss ──
+        ax = axes[0, 0]
+        if metrics_history.get("train_loss"):
+            ax.plot(steps[:len(metrics_history["train_loss"])],
+                    metrics_history["train_loss"],
+                    color="#2563eb", alpha=0.6, linewidth=0.8, label="Train Loss")
+            # 이동 평균 (스무딩)
+            if len(metrics_history["train_loss"]) > 20:
+                window = min(50, len(metrics_history["train_loss"]) // 5)
+                smoothed = self._moving_average(metrics_history["train_loss"], window)
+                ax.plot(steps[window-1:len(smoothed)+window-1],
+                        smoothed, color="#1d4ed8", linewidth=2, label=f"Smoothed (window={window})")
+        if metrics_history.get("val_loss"):
+            val_steps = [steps[i] for i in range(0, len(steps),
+                         max(1, len(steps)//len(metrics_history["val_loss"])))][:len(metrics_history["val_loss"])]
+            ax.plot(val_steps, metrics_history["val_loss"],
+                    "o-", color="#dc2626", linewidth=2, markersize=5, label="Val Loss")
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Loss")
+        ax.set_title("Training & Validation Loss")
+        ax.legend()
+        ax.grid(True, alpha=0.3)
+        # ── (2) Learning Rate ──
+        ax = axes[0, 1]
+        if metrics_history.get("learning_rate"):
+            ax.plot(steps[:len(metrics_history["learning_rate"])],
+                    metrics_history["learning_rate"],
+                    color="#059669", linewidth=2)
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Learning Rate")
+        ax.set_title("Learning Rate Schedule")
+        ax.ticklabel_format(style="scientific", axis="y", scilimits=(0,0))
+        ax.grid(True, alpha=0.3)
+        # ── (3) Gradient Norm ──
+        ax = axes[1, 0]
+        if metrics_history.get("grad_norm"):
+            ax.plot(steps[:len(metrics_history["grad_norm"])],
+                    metrics_history["grad_norm"],
+                    color="#d97706", alpha=0.6, linewidth=0.8)
+            ax.axhline(y=1.0, color="red", linestyle="--", alpha=0.5, label="Clip threshold")
+            ax.legend()
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Gradient Norm")
+        ax.set_title("Gradient Norm (clipped at 1.0)")
+        ax.grid(True, alpha=0.3)
+        # ── (4) Throughput ──
+        ax = axes[1, 1]
+        if metrics_history.get("tokens_per_sec"):
+            tps = metrics_history["tokens_per_sec"]
+            ax.plot(steps[:len(tps)], tps, color="#7c3aed", alpha=0.6, linewidth=0.8)
+            if tps:
+                avg_tps = sum(tps) / len(tps)
+                ax.axhline(y=avg_tps, color="#7c3aed", linestyle="--", alpha=0.5,
+                           label=f"Avg: {avg_tps:,.0f}")
+                ax.legend()
+        ax.set_xlabel("Step")
+        ax.set_ylabel("Tokens/sec")
+        ax.set_title("Training Throughput")
+        ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "training_curves.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"\n  📊 학습 곡선 저장: {save_path}")
+        plt.close(fig)
+    def plot_position_loss(
+        self,
+        position_losses: List[float],
+        save_path: Optional[str] = None,
+    ):
+        """위치별 Loss 분포를 시각화합니다."""
+        if not HAS_MATPLOTLIB:
+            return
+        fig, ax = plt.subplots(figsize=(12, 5))
+        positions = list(range(len(position_losses)))
+        ax.plot(positions, position_losses, color="#2563eb", linewidth=1.5)
+        ax.fill_between(positions, position_losses, alpha=0.1, color="#2563eb")
+        ax.set_xlabel("Position in Sequence", fontsize=12)
+        ax.set_ylabel("Cross-Entropy Loss", fontsize=12)
+        ax.set_title("Loss by Position (earlier positions have less context)", fontsize=13, fontweight="bold")
+        ax.grid(True, alpha=0.3)
+        # 주요 구간 표시
+        if len(position_losses) > 100:
+            early_avg = sum(position_losses[:50]) / 50
+            late_avg = sum(position_losses[-200:]) / 200
+            ax.axhline(y=early_avg, color="red", linestyle="--", alpha=0.4,
+                       label=f"Early avg (0-50): {early_avg:.2f}")
+            ax.axhline(y=late_avg, color="green", linestyle="--", alpha=0.4,
+                       label=f"Late avg (-200): {late_avg:.2f}")
+            ax.legend()
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "position_loss.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"  📊 위치별 Loss 저장: {save_path}")
+        plt.close(fig)
+    @staticmethod
+    def _moving_average(data: list, window: int) -> list:
+        """이동 평균 계산."""
+        result = []
+        for i in range(window - 1, len(data)):
+            avg = sum(data[i - window + 1 : i + 1]) / window
+            result.append(avg)
+        return result

llm_lab/evaluation/full_evaluator.py ADDED Viewed

	@@ -0,0 +1,222 @@

+"""종합 평가 실행기."""
+import json
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from llm_lab.config import EvalConfig
+from .perplexity import PerplexityEvaluator
+from .generation import GenerationEvaluator
+from .dynamics import TrainingDynamicsAnalyzer
+from .attention_viz import AttentionVisualizer
+class FullEvaluator:
+    """모든 평가를 한 번에 실행하고 리포트를 생성합니다.
+    사용법:
+    ```python
+    evaluator = FullEvaluator(model, tokenizer, val_dataloader, device)
+    report = evaluator.run_full_evaluation()
+    ```
+    """
+    def __init__(
+        self,
+        model: nn.Module,
+        tokenizer: Any,
+        val_dataloader: DataLoader,
+        device: torch.device,
+        config: Optional[EvalConfig] = None,
+        dtype: torch.dtype = torch.bfloat16,
+        metrics_history: Optional[Dict[str, list]] = None,
+    ):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.val_dataloader = val_dataloader
+        self.device = device
+        self.config = config or EvalConfig()
+        self.dtype = dtype
+        self.metrics_history = metrics_history
+        self.save_dir = Path(self.config.save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def run_full_evaluation(self) -> Dict[str, Any]:
+        """전체 평가를 실행합니다."""
+        report = {"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")}
+        print("\n" + "=" * 70)
+        print("🔍 종합 평가 시작")
+        print("=" * 70)
+        # ── 1. Perplexity ──
+        print("\n" + "━" * 40)
+        print("Phase 1/4: Perplexity 측정")
+        print("━" * 40)
+        ppl_evaluator = PerplexityEvaluator(self.config)
+        report["perplexity"] = ppl_evaluator.evaluate(
+            self.model, self.val_dataloader, self.device, self.dtype
+        )
+        # 위치별 Loss
+        print("\n  위치별 Loss 측정 중...")
+        position_losses = ppl_evaluator.evaluate_per_position(
+            self.model, self.val_dataloader, self.device, self.dtype
+        )
+        report["position_losses"] = {
+            "early_avg": round(sum(position_losses[:50]) / max(len(position_losses[:50]), 1), 4),
+            "late_avg": round(sum(position_losses[-200:]) / max(len(position_losses[-200:]), 1), 4),
+        }
+        # 위치별 Loss 시각화
+        dynamics = TrainingDynamicsAnalyzer(str(self.save_dir))
+        dynamics.plot_position_loss(position_losses, str(self.save_dir / "position_loss.png"))
+        # ── 2. 텍스트 생성 ──
+        print("\n" + "━" * 40)
+        print("Phase 2/4: 텍스트 생성")
+        print("━" * 40)
+        gen_evaluator = GenerationEvaluator(self.config)
+        gen_results = gen_evaluator.generate_samples(
+            self.model, self.tokenizer, self.device
+        )
+        report["generation"] = {
+            "num_prompts": len(gen_results),
+            "avg_metrics": self._average_gen_metrics(gen_results),
+        }
+        # ── 3. 학습 역학 분석 ──
+        if self.metrics_history:
+            print("\n" + "━" * 40)
+            print("Phase 3/4: 학습 역학 분석")
+            print("━" * 40)
+            report["training_dynamics"] = dynamics.analyze_metrics(self.metrics_history)
+            dynamics.plot_training_curves(self.metrics_history,
+                                          str(self.save_dir / "training_curves.png"))
+        else:
+            print("\n  Phase 3/4: 건너뜀 (metrics_history 없음)")
+        # ── 4. Attention 시각화 (샘플) ──
+        print("\n" + "━" * 40)
+        print("Phase 4/4: Attention 시각화")
+        print("━" * 40)
+        try:
+            self._visualize_attention_sample()
+        except Exception as e:
+            print(f"  ⚠️ Attention 시각화 실패: {e}")
+        # ── 리포트 저장 ──
+        report_path = self.save_dir / "eval_report.json"
+        with open(report_path, "w") as f:
+            json.dump(report, f, indent=2, default=str)
+        print(f"\n📋 리포트 저장: {report_path}")
+        # ── 요약 출력 ──
+        self._print_summary(report)
+        return report
+    def _visualize_attention_sample(self):
+        """샘플 텍스트로 attention을 시각화합니다."""
+        viz = AttentionVisualizer(str(self.save_dir))
+        sample_text = "The cat sat on the mat and looked at the bird."
+        token_ids = self.tokenizer.encode(sample_text, add_special_tokens=False)
+        input_tensor = torch.tensor([token_ids], dtype=torch.long)
+        # 토큰 문자열 (시각화 라벨용)
+        tokens_str = []
+        for tid in token_ids:
+            decoded = self.tokenizer.decode([tid])
+            tokens_str.append(decoded.replace("\n", "\\n"))
+        # Layer 0 attention 추출
+        attn_weights = viz.extract_attention(
+            self.model, input_tensor, layer_idx=0, device=self.device
+        )
+        if attn_weights is not None:
+            viz.plot_attention_heatmap(
+                attn_weights, tokens_str, head_idx=0,
+                title="Layer 0 Attention"
+            )
+            viz.plot_multi_head_summary(attn_weights)
+    @staticmethod
+    def _average_gen_metrics(gen_results: List[Dict]) -> Dict[str, float]:
+        """모든 프롬프트의 생성 메트릭 평균."""
+        if not gen_results:
+            return {}
+        all_metrics = [r["metrics"] for r in gen_results if r.get("metrics")]
+        if not all_metrics:
+            return {}
+        keys = all_metrics[0].keys()
+        return {
+            k: round(sum(m.get(k, 0) for m in all_metrics) / len(all_metrics), 3)
+            for k in keys
+        }
+    def _print_summary(self, report: Dict[str, Any]):
+        """최종 요약을 출력합니다."""
+        print("\n" + "=" * 70)
+        print("📋 평가 요약 리포트")
+        print("=" * 70)
+        # Perplexity
+        if "perplexity" in report:
+            ppl = report["perplexity"]
+            print(f"\n  🎯 Perplexity:")
+            print(f"     Loss:       {ppl['loss']:.4f}")
+            print(f"     PPL:        {ppl['perplexity']:.2f}")
+            # 등급 판정
+            ppl_val = ppl["perplexity"]
+            if ppl_val < 20:
+                grade = "🌟 우수 (Strong)"
+            elif ppl_val < 35:
+                grade = "✅ 양호 (Good)"
+            elif ppl_val < 60:
+                grade = "⚠️ 보통 (Fair)"
+            else:
+                grade = "❌ 미흡 (학습 추가 필요)"
+            print(f"     등급:       {grade}")
+        # 위치별 Loss
+        if "position_losses" in report:
+            pl = report["position_losses"]
+            print(f"\n  📍 위치별 Loss:")
+            print(f"     초반 (0-50):    {pl['early_avg']:.4f}")
+            print(f"     후반 (-200):    {pl['late_avg']:.4f}")
+            print(f"     컨텍스트 효과:  {pl['early_avg'] - pl['late_avg']:.4f} 감소")
+        # 생성 품질
+        if "generation" in report and report["generation"].get("avg_metrics"):
+            gm = report["generation"]["avg_metrics"]
+            print(f"\n  ✍️ 생성 품질:")
+            print(f"     평균 길이:      {gm.get('avg_length', 0):.0f} 자")
+            print(f"     반복률:         {gm.get('repetition_rate', 0):.1%}")
+            print(f"     어휘 다양성:    {gm.get('lexical_diversity', 0):.3f}")
+        # 학습 역학
+        if "training_dynamics" in report:
+            td = report["training_dynamics"]
+            if "loss" in td:
+                print(f"\n  📉 학습 역학:")
+                print(f"     Loss 감소:    {td['loss']['initial']:.4f} → {td['loss']['final']:.4f}")
+                print(f"     스파이크:     {len(td['loss']['spikes'])}회")
+        # 생성된 파일
+        print(f"\n  📂 결과 파일:")
+        for f in sorted(self.save_dir.glob("*")):
+            size = f.stat().st_size / 1024
+            print(f"     {f.name} ({size:.1f} KB)")
+        print("\n" + "=" * 70)

llm_lab/evaluation/generation.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""텍스트 생성 평가기."""
+from typing import Any, Dict, List, Optional
+import torch
+import torch.nn as nn
+from llm_lab.config import EvalConfig
+class GenerationEvaluator:
+    """다양한 프롬프트로 텍스트를 생성하여 품질을 평가합니다.
+    평가 관점:
+      1) 문법적 정확성:  영어 문법에 맞는 문장을 생성하는가?
+      2) 일관성:         문맥을 유지하며 이어가는가?
+      3) 다양성:         같은 프롬프트에 다른 결과를 생성하는가?
+      4) 반복 회피:      같은 구절을 반복하지 않는가?
+      5) 지식 표현:      학습 데이터의 지식이 반영되는가?
+    1B 모델의 현실적 기대치:
+      - 문법적으로 올바른 영어 문장 생성 ✅
+      - 짧은 문단 내 일관성 유지 ✅
+      - 복잡한 추론이나 긴 논리 전개 ❌ (더 큰 모델 필요)
+      - 사실적 정확성은 보장 안 됨 ⚠️
+    """
+    # 다양한 도메인의 테스트 프롬프트
+    DEFAULT_PROMPTS = [
+        # ── 일반 지식 ──
+        "The theory of relativity states that",
+        "In the history of computer science,",
+        "The human brain is remarkable because",
+        # ── 설명/교육 ──
+        "To understand machine learning, one must first",
+        "The water cycle begins when",
+        "Photosynthesis is the process by which",
+        # ── 서사/스토리 ──
+        "Once upon a time, in a small village near the mountains,",
+        "The detective looked at the evidence and realized that",
+        # ── 코드/기술 ──
+        "def fibonacci(n):\n    \"\"\"Calculate the nth Fibonacci number.\"\"\"\n",
+        "The most important data structures in programming are",
+        # ── 짧은 완성 ──
+        "The capital of France is",
+        "Water boils at a temperature of",
+        # ── 긴 문맥 ──
+        ("Artificial intelligence has transformed many industries. "
+         "In healthcare, AI is used for diagnosis and drug discovery. "
+         "In finance, it powers algorithmic trading and fraud detection. "
+         "Looking ahead, the most promising application of AI is"),
+    ]
+    def __init__(self, config: EvalConfig):
+        self.config = config
+    @torch.no_grad()
+    def generate_samples(
+        self,
+        model: nn.Module,
+        tokenizer: Any,
+        device: torch.device,
+        prompts: Optional[List[str]] = None,
+        verbose: bool = True,
+    ) -> List[Dict[str, Any]]:
+        """프롬프트별로 텍스트를 생성합니다.
+        Returns:
+            [{"prompt": str, "generations": [str, ...], "metrics": {...}}, ...]
+        """
+        model.eval()
+        prompts = prompts or self.DEFAULT_PROMPTS
+        results = []
+        if verbose:
+            print("\n" + "=" * 70)
+            print("📝 텍스트 생성 평가")
+            print("=" * 70)
+        for idx, prompt in enumerate(prompts):
+            prompt_results = {
+                "prompt": prompt,
+                "generations": [],
+                "metrics": {},
+            }
+            if verbose:
+                print(f"\n{'─'*60}")
+                print(f"프롬프트 [{idx+1}/{len(prompts)}]:")
+                print(f"  \"{prompt[:80]}{'...' if len(prompt) > 80 else ''}\"")
+                print(f"{'─'*60}")
+            # 프롬프트 인코딩
+            prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
+            input_tensor = torch.tensor([prompt_ids], dtype=torch.long, device=device)
+            all_texts = []
+            for sample_idx in range(self.config.num_samples):
+                # 생성
+                generated_ids = model.generate(
+                    input_tensor,
+                    max_new_tokens=self.config.max_new_tokens,
+                    temperature=self.config.temperature,
+                    top_k=self.config.top_k,
+                    top_p=self.config.top_p,
+                )
+                # 디코딩 (프롬프트 이후 부분만)
+                new_ids = generated_ids[0][len(prompt_ids):].tolist()
+                generated_text = tokenizer.decode(new_ids)
+                all_texts.append(generated_text)
+                prompt_results["generations"].append(generated_text)
+                if verbose:
+                    print(f"\n  ✍️ 생성 #{sample_idx+1}:")
+                    # 깔끔한 출력 (줄바꿈 포함)
+                    display_text = generated_text[:500]
+                    for line in display_text.split("\n"):
+                        print(f"    {line}")
+                    if len(generated_text) > 500:
+                        print(f"    ... (총 {len(generated_text)} 문자)")
+            # 생성 품질 메트릭
+            prompt_results["metrics"] = self._compute_generation_metrics(all_texts)
+            if verbose and prompt_results["metrics"]:
+                m = prompt_results["metrics"]
+                print(f"\n  📊 메트릭: "
+                      f"평균 길이={m['avg_length']:.0f}자, "
+                      f"반복률={m['repetition_rate']:.1%}, "
+                      f"어휘 다양성={m['lexical_diversity']:.2f}")
+            results.append(prompt_results)
+        return results
+    @staticmethod
+    def _compute_generation_metrics(texts: List[str]) -> Dict[str, float]:
+        """생성 텍스트의 품질 메트릭을 계산합니다.
+        메트릭:
+          - avg_length:        평균 생성 길이 (문자)
+          - avg_word_count:    평균 단어 수
+          - repetition_rate:   n-gram 반복률 (낮을수록 좋음)
+          - lexical_diversity: 고유 단어 비율 (높을수록 다양)
+          - sample_diversity:  샘플 간 다양성 (다른 생성끼리 얼마나 다른가)
+        """
+        if not texts:
+            return {}
+        # 길이
+        lengths = [len(t) for t in texts]
+        word_counts = [len(t.split()) for t in texts]
+        # 반복률 (4-gram 기준)
+        rep_rates = []
+        for text in texts:
+            words = text.lower().split()
+            if len(words) < 4:
+                rep_rates.append(0.0)
+                continue
+            ngrams = [tuple(words[i:i+4]) for i in range(len(words)-3)]
+            unique_ratio = len(set(ngrams)) / len(ngrams) if ngrams else 1.0
+            rep_rates.append(1.0 - unique_ratio)  # 반복률 = 1 - 고유비율
+        # 어휘 다양성 (Type-Token Ratio)
+        diversities = []
+        for text in texts:
+            words = text.lower().split()
+            if words:
+                diversities.append(len(set(words)) / len(words))
+            else:
+                diversities.append(0.0)
+        # 샘플 간 다양성 (자카드 유사도의 역)
+        sample_div = 0.0
+        if len(texts) > 1:
+            word_sets = [set(t.lower().split()) for t in texts]
+            similarities = []
+            for i in range(len(word_sets)):
+                for j in range(i+1, len(word_sets)):
+                    inter = len(word_sets[i] & word_sets[j])
+                    union = len(word_sets[i] | word_sets[j])
+                    if union > 0:
+                        similarities.append(inter / union)
+            sample_div = 1.0 - (sum(similarities) / max(len(similarities), 1))
+        return {
+            "avg_length": sum(lengths) / len(lengths),
+            "avg_word_count": sum(word_counts) / len(word_counts),
+            "repetition_rate": sum(rep_rates) / len(rep_rates),
+            "lexical_diversity": sum(diversities) / len(diversities),
+            "sample_diversity": round(sample_div, 3),
+        }

llm_lab/evaluation/perplexity.py ADDED Viewed

	@@ -0,0 +1,172 @@

+"""Perplexity(PPL) 평가기."""
+import math
+import time
+from typing import Dict, List
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader
+from llm_lab.config import EvalConfig
+class PerplexityEvaluator:
+    """Perplexity(PPL)를 측정합니다.
+    Perplexity란?
+      PPL = exp(average cross-entropy loss)
+      직관적 의미:
+        - PPL = 1:     완벽한 예측 (불가능)
+        - PPL = 10:    매번 10개 후보 중 고르는 수준
+        - PPL = 100:   100개 후보 중 고르는 수준 (무작위에 가까움)
+        - PPL = 32000: vocab 전체에서 랜덤 선택 (초기 랜덤 모델)
+      좋은 1B 모델 기준 (영어 웹 텍스트):
+        - 5B 토큰 학습: PPL ~30-40
+        - 10B 토큰 학습: PPL ~20-30
+        - 20B 토큰 학습: PPL ~15-25
+    측정 방법:
+      - 검증 데이터셋의 모든 토큰에 대해 cross-entropy 계산
+      - 토큰 단위 평균 후 exp() 적용
+      - 패딩 토큰은 제외 (ignore_index=-100)
+    """
+    def __init__(self, config: EvalConfig):
+        self.config = config
+    @torch.no_grad()
+    def evaluate(
+        self,
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+        desc: str = "Evaluation",
+    ) -> Dict[str, float]:
+        """Perplexity를 측정합니다.
+        Returns:
+            {
+                "loss": 평균 cross-entropy loss,
+                "perplexity": exp(loss),
+                "num_tokens": 평가에 사용된 총 토큰 수,
+                "num_batches": 평가에 사용된 배치 수,
+            }
+        """
+        model.eval()
+        total_loss = 0.0
+        total_tokens = 0
+        num_batches = 0
+        print(f"\n📊 {desc}")
+        start_time = time.time()
+        for i, batch in enumerate(dataloader):
+            if i >= self.config.max_eval_batches:
+                break
+            input_ids = batch["input_ids"].to(device)
+            targets = batch["targets"].to(device)
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                logits, _ = model(input_ids)
+            # 토큰별 cross-entropy (reduction='none')
+            # logits: (B, S, V) → (B*S, V)
+            # targets: (B, S) → (B*S,)
+            loss_per_token = F.cross_entropy(
+                logits.view(-1, logits.size(-1)),
+                targets.view(-1),
+                ignore_index=-100,
+                reduction="none",
+            )
+            # -100이 아닌 유효 토큰만 카운트
+            valid_mask = (targets.view(-1) != -100)
+            valid_tokens = valid_mask.sum().item()
+            total_loss += loss_per_token[valid_mask].sum().item()
+            total_tokens += valid_tokens
+            num_batches += 1
+            if (i + 1) % 20 == 0:
+                running_ppl = math.exp(min(total_loss / max(total_tokens, 1), 20))
+                print(f"  Batch {i+1}/{self.config.max_eval_batches}: running PPL = {running_ppl:.2f}")
+        elapsed = time.time() - start_time
+        avg_loss = total_loss / max(total_tokens, 1)
+        perplexity = math.exp(min(avg_loss, 100))  # overflow 방지
+        results = {
+            "loss": round(avg_loss, 4),
+            "perplexity": round(perplexity, 2),
+            "num_tokens": total_tokens,
+            "num_batches": num_batches,
+            "eval_time_sec": round(elapsed, 1),
+        }
+        print(f"  ────────────────────────────────")
+        print(f"  Loss:        {results['loss']:.4f}")
+        print(f"  Perplexity:  {results['perplexity']:.2f}")
+        print(f"  평가 토큰:   {total_tokens:,}")
+        print(f"  소요 시간:   {elapsed:.1f}초")
+        return results
+    @torch.no_grad()
+    def evaluate_per_position(
+        self,
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+        max_batches: int = 50,
+    ) -> List[float]:
+        """시퀀스 내 위치별 Loss를 측정합니다.
+        학습 포인트:
+          - 위치 0~10: Loss가 높음 (문맥이 부족)
+          - 위치 100+: Loss가 안정적으로 낮아짐 (문맥 활용)
+          - 이 패턴이 Transformer의 in-context learning 능력을 보여줌
+        """
+        model.eval()
+        seq_len = None
+        position_loss_sum = None
+        position_count = None
+        for i, batch in enumerate(dataloader):
+            if i >= max_batches:
+                break
+            input_ids = batch["input_ids"].to(device)
+            targets = batch["targets"].to(device)
+            B, S = targets.shape
+            if seq_len is None:
+                seq_len = S
+                position_loss_sum = torch.zeros(S, device=device)
+                position_count = torch.zeros(S, device=device)
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                logits, _ = model(input_ids)
+            # (B, S) 형태의 토큰별 loss
+            loss_per_token = F.cross_entropy(
+                logits.view(-1, logits.size(-1)),
+                targets.view(-1),
+                ignore_index=-100,
+                reduction="none",
+            ).view(B, S)
+            valid_mask = (targets != -100).float()
+            position_loss_sum += (loss_per_token * valid_mask).sum(dim=0)
+            position_count += valid_mask.sum(dim=0)
+        # 위치별 평균 loss
+        position_avg_loss = (position_loss_sum / position_count.clamp(min=1)).cpu().tolist()
+        return position_avg_loss

llm_lab/evaluation/runner.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""평가 실행 헬퍼 (Quick Start)."""
+from typing import Any, Dict, Optional
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from llm_lab.config import EvalConfig
+from .full_evaluator import FullEvaluator
+from .checklist import InsightChecklist
+def run_evaluation(
+    model: nn.Module,
+    tokenizer: Any,
+    val_dataloader: DataLoader,
+    device: torch.device = None,
+    dtype: torch.dtype = torch.bfloat16,
+    metrics_history: Optional[Dict[str, list]] = None,
+    config: Optional[EvalConfig] = None,
+) -> Dict[str, Any]:
+    """평가를 한 번에 실행합니다.
+    사용법 (Colab):
+    ```python
+    from llm_lab.evaluation import run_evaluation
+    # 학습 완료 후
+    report = run_evaluation(
+        model=trainer.model,
+        tokenizer=tokenizer,
+        val_dataloader=val_dl,
+        metrics_history=trainer.metrics.history,
+    )
+    ```
+    """
+    if device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    evaluator = FullEvaluator(
+        model=model,
+        tokenizer=tokenizer,
+        val_dataloader=val_dataloader,
+        device=device,
+        config=config,
+        dtype=dtype,
+        metrics_history=metrics_history,
+    )
+    report = evaluator.run_full_evaluation()
+    # 인사이트 체크리스트
+    InsightChecklist.run_checklist(report, metrics_history)
+    return report

llm_lab/evaluation/scaling.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""Scaling Law 분석기."""
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+try:
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    HAS_MATPLOTLIB = True
+except ImportError:
+    HAS_MATPLOTLIB = False
+try:
+    import numpy as np
+    HAS_NUMPY = True
+except ImportError:
+    HAS_NUMPY = False
+class ScalingAnalyzer:
+    """10M → 100M → 1B 모델의 Scaling Law를 분석합니다.
+    Chinchilla Scaling Law (2022):
+      - 최적 학습: 토큰 수 ≈ 20 × 파라미터 수
+      - Loss ∝ N^(-α) × D^(-β)  (N=파라미터, D=데이터)
+      - α ≈ 0.076, β ≈ 0.095 (논문 기준)
+    이 분석의 목적:
+      - 우리 모델이 Scaling Law를 따르는지 확인
+      - 더 큰 모델/더 많은 데이터의 효과를 예측
+      - 컴퓨팅 자원 배분의 최적점 이해
+    """
+    def __init__(self, save_dir: str = "./eval_results"):
+        self.save_dir = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+    def analyze(
+        self,
+        model_results: List[Dict[str, Any]],
+    ) -> Dict[str, Any]:
+        """여러 모델 크기의 결과를 비교 분석합니다.
+        Args:
+            model_results: [
+                {"name": "10M",  "params": 10e6,  "tokens": 1e9, "loss": 4.2, "ppl": 66.7},
+                {"name": "100M", "params": 100e6, "tokens": 5e9, "loss": 3.5, "ppl": 33.1},
+                {"name": "1B",   "params": 1.1e9, "tokens": 10e9,"loss": 3.0, "ppl": 20.1},
+            ]
+        Returns:
+            분석 결과 딕셔너리
+        """
+        if len(model_results) < 2:
+            print("⚠️ Scaling 분석에는 최소 2개 모델 결과가 필요합니다.")
+            return {}
+        print("\n" + "=" * 70)
+        print("📈 Scaling Law 분석")
+        print("=" * 70)
+        # ── 결과 테이블 ──
+        print(f"\n  {'모델':<8} {'파라미터':>12} {'토큰':>10} {'Loss':>8} {'PPL':>8}")
+        print(f"  {'─'*52}")
+        for r in model_results:
+            params_str = f"{r['params']/1e6:.0f}M" if r["params"] < 1e9 else f"{r['params']/1e9:.1f}B"
+            tokens_str = f"{r['tokens']/1e9:.1f}B"
+            print(f"  {r['name']:<8} {params_str:>12} {tokens_str:>10} {r['loss']:>8.4f} {r['ppl']:>8.2f}")
+        # ── Scaling 효율 계산 ──
+        analysis = {"models": model_results, "scaling_efficiency": []}
+        for i in range(1, len(model_results)):
+            prev = model_results[i-1]
+            curr = model_results[i]
+            param_ratio = curr["params"] / prev["params"]
+            loss_reduction = prev["loss"] - curr["loss"]
+            ppl_reduction = (prev["ppl"] - curr["ppl"]) / prev["ppl"]
+            efficiency = {
+                "from": prev["name"],
+                "to": curr["name"],
+                "param_multiplier": round(param_ratio, 1),
+                "loss_reduction": round(loss_reduction, 4),
+                "ppl_reduction_pct": round(ppl_reduction * 100, 1),
+            }
+            analysis["scaling_efficiency"].append(efficiency)
+            print(f"\n  {prev['name']} → {curr['name']}:")
+            print(f"    파라미터 ×{param_ratio:.1f}")
+            print(f"    Loss 감소: {loss_reduction:.4f}")
+            print(f"    PPL 감소: {ppl_reduction*100:.1f}%")
+        # ── Chinchilla 최적성 체크 ──
+        print(f"\n  Chinchilla 최적성 체크 (토큰 ≈ 20 × 파라미터):")
+        for r in model_results:
+            actual_ratio = r["tokens"] / r["params"]
+            status = "✅ 최적 범위" if 15 <= actual_ratio <= 25 else "⚠️ 범위 밖"
+            print(f"    {r['name']}: 토큰/파라미터 = {actual_ratio:.1f}x "
+                  f"(최적: 20x) {status}")
+        analysis["chinchilla_ratios"] = [
+            {"name": r["name"], "ratio": round(r["tokens"] / r["params"], 1)}
+            for r in model_results
+        ]
+        return analysis
+    def plot_scaling_curves(
+        self,
+        model_results: List[Dict[str, Any]],
+        save_path: Optional[str] = None,
+    ):
+        """Scaling 곡선을 시각화합니다."""
+        if not HAS_MATPLOTLIB or not HAS_NUMPY:
+            print("⚠️ matplotlib/numpy가 필요합니다: pip install matplotlib numpy")
+            return
+        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+        params = [r["params"] for r in model_results]
+        losses = [r["loss"] for r in model_results]
+        ppls = [r["ppl"] for r in model_results]
+        names = [r["name"] for r in model_results]
+        # ── Loss vs Parameters (log-log) ──
+        ax = axes[0]
+        ax.loglog(params, losses, "o-", color="#2563eb", linewidth=2, markersize=10)
+        for p, l, n in zip(params, losses, names):
+            ax.annotate(f"  {n}\n  Loss={l:.2f}", (p, l), fontsize=9)
+        ax.set_xlabel("Parameters", fontsize=12)
+        ax.set_ylabel("Validation Loss", fontsize=12)
+        ax.set_title("Loss vs Model Size (log-log)", fontsize=13, fontweight="bold")
+        ax.grid(True, alpha=0.3)
+        # ── PPL vs Parameters (log-log) ──
+        ax = axes[1]
+        ax.loglog(params, ppls, "s-", color="#dc2626", linewidth=2, markersize=10)
+        for p, pp, n in zip(params, ppls, names):
+            ax.annotate(f"  {n}\n  PPL={pp:.1f}", (p, pp), fontsize=9)
+        ax.set_xlabel("Parameters", fontsize=12)
+        ax.set_ylabel("Perplexity", fontsize=12)
+        ax.set_title("Perplexity vs Model Size (log-log)", fontsize=13, fontweight="bold")
+        ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        save_path = save_path or str(self.save_dir / "scaling_curves.png")
+        fig.savefig(save_path, dpi=150, bbox_inches="tight")
+        print(f"\n  📊 Scaling 곡선 저장: {save_path}")
+        plt.close(fig)

llm_lab/model/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""모델 아키텍처 모듈 — LLaMA-style Decoder-Only Transformer."""
+from .norm import RMSNorm
+from .rope import RotaryPositionalEmbedding
+from .attention import GroupedQueryAttention
+from .feedforward import SwiGLUFeedForward
+from .transformer_block import TransformerBlock
+from .llm_model import LLMModel
+from .utils import count_parameters_detailed, estimate_memory_gb
+__all__ = [
+    "RMSNorm", "RotaryPositionalEmbedding", "GroupedQueryAttention",
+    "SwiGLUFeedForward", "TransformerBlock", "LLMModel",
+    "count_parameters_detailed", "estimate_memory_gb",
+]

llm_lab/model/attention.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""Grouped Query Attention (GQA)."""
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from llm_lab.config import ModelConfig
+from .rope import RotaryPositionalEmbedding
+class GroupedQueryAttention(nn.Module):
+    """GQA: Multi-Head Attention의 메모리 효율적 변형.
+    MHA vs GQA vs MQA:
+      - MHA (Multi-Head Attention):  Q, K, V 모두 num_heads개 → 메모리 큼
+      - MQA (Multi-Query Attention): K, V는 1개 헤드 공유 → 품질 저하 우려
+      - GQA (Grouped Query Attention): K, V를 num_kv_heads개로 그룹화
+        → MHA와 MQA의 중간, 좋은 품질-효율 균형
+    예시 (num_heads=16, num_kv_heads=4):
+      Q 헤드: [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15]
+      K/V 그룹:  [  0  ,   1   ,    2     ,     3      ]
+      → Q 헤드 4개가 K/V 헤드 1개를 공유
+    Attention 수식:
+      Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V
+    """
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        self.config = config
+        self.head_dim = config.head_dim
+        self.num_heads = config.num_heads
+        self.num_kv_heads = config.num_kv_heads
+        self.num_kv_groups = config.num_kv_groups  # num_heads // num_kv_heads
+        # Q/K/V 프로젝션
+        # Q: hidden_dim → num_heads × head_dim
+        self.q_proj = nn.Linear(config.hidden_dim, config.num_heads * self.head_dim, bias=False)
+        # K, V: hidden_dim → num_kv_heads × head_dim (Q보다 작음!)
+        self.k_proj = nn.Linear(config.hidden_dim, config.num_kv_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(config.hidden_dim, config.num_kv_heads * self.head_dim, bias=False)
+        # 출력 프로젝션: 모든 헤드의 출력을 다시 hidden_dim으로
+        self.o_proj = nn.Linear(config.num_heads * self.head_dim, config.hidden_dim, bias=False)
+        # RoPE
+        self.rope = RotaryPositionalEmbedding(
+            dim=self.head_dim, max_seq_len=config.max_seq_len, theta=config.rope_theta
+        )
+        # Attention dropout (pretraining에서는 보통 0)
+        self.attn_dropout = nn.Dropout(config.dropout)
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[torch.Tensor] = None,
+        position_offset: int = 0,
+    ) -> torch.Tensor:
+        """
+        Args:
+            x: (batch_size, seq_len, hidden_dim)
+            mask: (seq_len, seq_len) causal mask
+            position_offset: 위치 오프셋 (추론 시 사용)
+        Returns:
+            (batch_size, seq_len, hidden_dim)
+        """
+        B, S, _ = x.shape
+        # ──────────────────────────────────────────────
+        # Step 1: Q, K, V 프로젝션
+        # ──────────────────────────────────────────────
+        q = self.q_proj(x)  # (B, S, num_heads × head_dim)
+        k = self.k_proj(x)  # (B, S, num_kv_heads × head_dim)
+        v = self.v_proj(x)  # (B, S, num_kv_heads × head_dim)
+        # 멀티헤드 형태로 reshape
+        q = q.view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
+        # → (B, num_heads, S, head_dim)
+        k = k.view(B, S, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        # → (B, num_kv_heads, S, head_dim)
+        v = v.view(B, S, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        # ──────────────────────────────────────────────
+        # Step 2: RoPE 적용 (Q, K에만! V에는 적용하지 않음)
+        # ──────────────────────────────────────────────
+        # 위치 정보는 "어디를 볼지"(Q·K)에만 영향을 줘야 하고,
+        # "무엇을 가져올지"(V)에는 영향을 주면 안 됩니다.
+        q, k = self.rope(q, k, position_offset)
+        # ──────────────────────────────────────────────
+        # Step 3: GQA - KV 헤드 확장 (repeat)
+        # ──────────────────────────────────────────────
+        # num_kv_heads=4 → num_heads=16: 각 KV를 4번 반복
+        if self.num_kv_groups > 1:
+            k = self._repeat_kv(k)  # (B, num_heads, S, head_dim)
+            v = self._repeat_kv(v)
+        # ──────────────────────────────────────────────
+        # Step 4: Scaled Dot-Product Attention
+        # ──────────────────────────────────────────────
+        # PyTorch >= 2.0의 최적화된 구현 사용 (Flash Attention 자동 적용)
+        attn_out = F.scaled_dot_product_attention(
+            q, k, v,
+            attn_mask=mask,
+            dropout_p=self.config.dropout if self.training else 0.0,
+            is_causal=(mask is None),  # mask가 없으면 자동 causal masking
+        )
+        # → (B, num_heads, S, head_dim)
+        # ──────────────────────────────────────────────
+        # Step 5: 헤드 합치기 + 출력 프로젝션
+        # ──────────────────────────────────────────────
+        attn_out = attn_out.transpose(1, 2).contiguous().view(B, S, -1)
+        # → (B, S, num_heads × head_dim)
+        return self.o_proj(attn_out)  # → (B, S, hidden_dim)
+    def _repeat_kv(self, x: torch.Tensor) -> torch.Tensor:
+        """KV 헤드를 Q 헤드 수에 맞게 반복합니다.
+        (B, num_kv_heads, S, head_dim) → (B, num_heads, S, head_dim)
+        예: num_kv_heads=4, num_kv_groups=4
+          [kv0, kv1, kv2, kv3] → [kv0,kv0,kv0,kv0, kv1,kv1,kv1,kv1, ...]
+        """
+        B, H_kv, S, D = x.shape
+        x = x[:, :, None, :, :]               # (B, H_kv, 1, S, D)
+        x = x.expand(B, H_kv, self.num_kv_groups, S, D)  # (B, H_kv, groups, S, D)
+        return x.reshape(B, self.num_heads, S, D)

llm_lab/model/feedforward.py ADDED Viewed

	@@ -0,0 +1,48 @@

+"""SwiGLU Feed-Forward Network."""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from llm_lab.config import ModelConfig
+class SwiGLUFeedForward(nn.Module):
+    """SwiGLU: Gated Linear Unit with Swish 활성화 함수.
+    기존 FFN:
+      FFN(x) = ReLU(x·W1 + b1)·W2 + b2
+      → 단순한 비선형 변환
+    SwiGLU FFN:
+      SwiGLU(x) = (Swish(x·W_gate) ⊙ (x·W_up)) · W_down
+      → 게이팅 메커니즘으로 정보 흐름을 제어
+    왜 SwiGLU가 더 좋은가?
+      - Swish(x) = x · sigmoid(x): 부드러운 활성화, 음수 영역 일부 허용
+      - Gate 벡터가 "어떤 정보를 통과시킬지" 학습
+      - PaLM, LLaMA 등에서 ReLU FFN 대비 일관된 성능 향상 보고
+    참고: W_gate와 W_up 두 개의 up-projection이 있어서
+    파라미터 수가 기존 FFN 대비 1.5배이지만, intermediate_dim을
+    조정하여 총 파라미터 수를 맞춥니다.
+    """
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        # 게이트 프로젝션: hidden_dim → intermediate_dim
+        self.gate_proj = nn.Linear(config.hidden_dim, config.intermediate_dim, bias=False)
+        # 업 프로젝션: hidden_dim → intermediate_dim
+        self.up_proj   = nn.Linear(config.hidden_dim, config.intermediate_dim, bias=False)
+        # 다운 프로젝션: intermediate_dim → hidden_dim
+        self.down_proj = nn.Linear(config.intermediate_dim, config.hidden_dim, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # SwiGLU(x) = (Swish(gate(x)) ⊙ up(x)) · down
+        #
+        # 1) gate: 어떤 정보를 통과시킬지 결정 (Swish 활성화)
+        gate = F.silu(self.gate_proj(x))  # silu = Swish = x * sigmoid(x)
+        # 2) up: 정보를 고차원으로 사영
+        up = self.up_proj(x)
+        # 3) element-wise 곱 (게이팅) → 다시 원래 차원으로
+        return self.down_proj(gate * up)

llm_lab/model/llm_model.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""Full Transformer Model (LLaMA-style)."""
+import math
+from typing import Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from llm_lab.config import ModelConfig
+from .norm import RMSNorm
+from .transformer_block import TransformerBlock
+class LLMModel(nn.Module):
+    """1B 파라미터 LLaMA-style Decoder-Only Transformer.
+    전체 구조:
+      Input Token IDs
+        → Token Embedding
+        → [TransformerBlock] × num_layers  (+ Activation Checkpointing)
+        → RMSNorm (최종)
+        → Linear Head (→ vocab logits)
+    Weight Tying:
+      - 입력 Embedding과 출력 Linear Head의 가중치를 공유
+      - 파라미터 수 절약 (~65M) + 성능 유지/향상
+      - 직관: "단어의 의미 표현"과 "단어 예측"이 같은 공간을 사용
+    """
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        self.config = config
+        # ── Token Embedding ──
+        self.token_embedding = nn.Embedding(config.vocab_size, config.hidden_dim)
+        # ── Transformer Blocks ──
+        self.layers = nn.ModuleList([
+            TransformerBlock(config, layer_idx=i)
+            for i in range(config.num_layers)
+        ])
+        # ── 최종 정규화 ──
+        self.final_norm = RMSNorm(config.hidden_dim, eps=config.norm_eps)
+        # ── 출력 헤드 (Weight Tying) ──
+        self.lm_head = nn.Linear(config.hidden_dim, config.vocab_size, bias=False)
+        # Weight Tying: lm_head의 가중치 = token_embedding의 가중치
+        self.lm_head.weight = self.token_embedding.weight
+        # 가중치 초기화
+        self._init_weights()
+    def _init_weights(self):
+        """가중치 초기화 전략.
+        왜 초기화가 중요한가?
+          - 너무 크면: 활성화 폭발 → NaN
+          - 너무 작으면: gradient 소멸 → 학습 정체
+          - 적절한 초기화: 각 레이어의 출력 분산을 일정하게 유지
+        GPT-2 스타일 초기화:
+          - 일반 Linear: N(0, 0.02)
+          - Residual projection: N(0, 0.02 / √(2 × num_layers))
+            → 레이어가 깊어질수록 residual 기여를 줄여 안정화
+        """
+        std = 0.02
+        residual_std = std / math.sqrt(2 * self.config.num_layers)
+        for module in self.modules():
+            if isinstance(module, nn.Linear):
+                nn.init.normal_(module.weight, mean=0.0, std=std)
+                if module.bias is not None:
+                    nn.init.zeros_(module.bias)
+            elif isinstance(module, nn.Embedding):
+                nn.init.normal_(module.weight, mean=0.0, std=std)
+        # Residual projection 레이어에 축소된 초기화 적용
+        for layer in self.layers:
+            nn.init.normal_(layer.attention.o_proj.weight, mean=0.0, std=residual_std)
+            nn.init.normal_(layer.feed_forward.down_proj.weight, mean=0.0, std=residual_std)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        targets: Optional[torch.Tensor] = None,
+        position_offset: int = 0,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Args:
+            input_ids: (batch_size, seq_len) - 토큰 ID
+            targets:   (batch_size, seq_len) - 정답 토큰 ID (학습 시)
+            position_offset: 위치 오프셋 (추론 시)
+        Returns:
+            logits: (batch_size, seq_len, vocab_size)
+            loss:   스칼라 (targets 제공 시) 또는 None
+        """
+        B, S = input_ids.shape
+        # ── Step 1: Token Embedding ──
+        # 각 토큰 ID를 hidden_dim 차원의 벡터로 변환
+        h = self.token_embedding(input_ids)  # (B, S, hidden_dim)
+        # ── Step 2: Transformer Blocks ──
+        # Activation Checkpointing: 학습 시 메모리 절약
+        # (중간 활성화를 저장하지 않고, backward 시 재계산)
+        for layer in self.layers:
+            if self.training and torch.is_grad_enabled():
+                # Activation Checkpointing 적용
+                h = torch.utils.checkpoint.checkpoint(
+                    layer, h, None, position_offset,
+                    use_reentrant=False,  # PyTorch >= 2.0 권장
+                )
+            else:
+                h = layer(h, mask=None, position_offset=position_offset)
+        # ── Step 3: 최종 정규화 ──
+        h = self.final_norm(h)
+        # ── Step 4: 출력 로짓 계산 ──
+        logits = self.lm_head(h)  # (B, S, vocab_size)
+        # ── Step 5: Loss 계산 (학습 시) ──
+        loss = None
+        if targets is not None:
+            # Cross-Entropy Loss: 다음 토큰 예측
+            # logits: (B, S, V) → (B*S, V)
+            # targets: (B, S)   → (B*S,)
+            loss = F.cross_entropy(
+                logits.view(-1, self.config.vocab_size),
+                targets.view(-1),
+                ignore_index=-100,  # 패딩 토큰 무시
+            )
+        return logits, loss
+    def count_parameters(self, trainable_only: bool = True) -> int:
+        """모델 파라미터 수 계산."""
+        if trainable_only:
+            return sum(p.numel() for p in self.parameters() if p.requires_grad)
+        return sum(p.numel() for p in self.parameters())
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: torch.Tensor,
+        max_new_tokens: int = 100,
+        temperature: float = 1.0,
+        top_k: int = 50,
+        top_p: float = 0.9,
+    ) -> torch.Tensor:
+        """텍스트 생성 (추론).
+        Autoregressive 생성: 한 토큰씩 예측하여 이어붙이기.
+        Args:
+            input_ids: (1, prompt_len) - 초기 프롬프트
+            max_new_tokens: 생성할 최대 토큰 수
+            temperature: 확률 분포 날카로움 조절 (낮을수록 보수적)
+            top_k: 확률 상위 k개만 고려
+            top_p: 누적 확률 p까지만 고려 (nucleus sampling)
+        """
+        self.eval()
+        generated = input_ids
+        for _ in range(max_new_tokens):
+            # 현재 시퀀스가 max_seq_len을 초과하면 잘라내기
+            ctx = generated[:, -self.config.max_seq_len:]
+            # Forward pass
+            logits, _ = self(ctx)
+            # 마지막 토큰의 logits만 사용 (다음 토큰 예측)
+            next_logits = logits[:, -1, :] / temperature
+            # ── Top-K 필터링 ──
+            if top_k > 0:
+                top_k_values, _ = torch.topk(next_logits, min(top_k, next_logits.size(-1)))
+                min_top_k = top_k_values[:, -1].unsqueeze(-1)
+                next_logits = next_logits.masked_fill(next_logits < min_top_k, float("-inf"))
+            # ── Top-P (Nucleus) 필터링 ──
+            if top_p < 1.0:
+                sorted_logits, sorted_indices = torch.sort(next_logits, descending=True)
+                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                # 누적 확률이 top_p를 초과하는 토큰 제거
+                remove_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
+                sorted_logits[remove_mask] = float("-inf")
+                # 원래 순서로 복원
+                next_logits = sorted_logits.scatter(1, sorted_indices, sorted_logits)
+            # 확률 분포에서 샘플링
+            probs = F.softmax(next_logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)  # (B, 1)
+            # 생성된 토큰 이어붙이기
+            generated = torch.cat([generated, next_token], dim=1)
+        return generated

llm_lab/model/norm.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""RMSNorm (Root Mean Square Layer Normalization)."""
+import torch
+import torch.nn as nn
+class RMSNorm(nn.Module):
+    """RMSNorm: LayerNorm의 경량화 버전.
+    일반 LayerNorm과의 차이:
+      - 평균(mean)을 빼지 않음 → 연산 절약
+      - 분산 대신 RMS(Root Mean Square)로 정규화
+      - bias 파라미터 없음
+    수식:
+      RMSNorm(x) = (x / RMS(x)) * γ
+      RMS(x) = sqrt(mean(x²) + ε)
+    왜 정규화가 필요한가?
+      → 레이어를 깊게 쌓으면 활성화 값의 스케일이 폭발하거나 소멸합니다.
+      → 정규화로 각 레이어의 입력을 안정적인 범위로 유지합니다.
+    """
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        # γ (gamma): 학습 가능한 스케일 파라미터, 1로 초기화
+        self.weight = nn.Parameter(torch.ones(dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # 1) 입력을 float32로 변환 (수치 안정성)
+        #    bf16/fp16 상태에서 제곱합을 구하면 오버플로우 위험
+        x_float = x.float()
+        # 2) RMS 계산: sqrt(mean(x²) + ε)
+        rms = torch.rsqrt(x_float.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+        # rsqrt = 1/sqrt(x) → 나눗셈 대신 곱셈으로 대체 (더 빠름)
+        # 3) 정규화 후 원래 dtype으로 복원, 스케일 적용
+        return (x_float * rms).to(x.dtype) * self.weight

llm_lab/model/rope.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Rotary Positional Embedding (RoPE)."""
+from typing import Tuple
+import torch
+import torch.nn as nn
+class RotaryPositionalEmbedding(nn.Module):
+    """RoPE: 회전 행렬을 이용한 상대 위치 인코딩.
+    핵심 아이디어:
+      - 각 차원 쌍(2i, 2i+1)을 2D 평면의 좌표로 보고,
+        위치(position)에 비례한 각도만큼 회전시킵니다.
+      - 두 토큰의 어텐션 스코어(Q·K)는 상대 거리에만 의존하게 됩니다.
+    왜 RoPE인가?
+      - 절대 위치 임베딩: 각 위치에 고정 벡터를 더함 → 길이 일반화 어려움
+      - 상대 위치 임베딩: 구현 복잡, 추가 파라미터 필요
+      - RoPE: 파라미터 없이, 자연스럽게 상대 위치 정보 인코딩
+    수식:
+      θ_i = theta^(-2i/d)  (i = 0, 1, ..., d/2-1)
+      RoPE(x, pos) = x를 각 차원 쌍에서 pos × θ_i 만큼 회전
+    """
+    def __init__(self, dim: int, max_seq_len: int = 2048, theta: float = 10000.0):
+        super().__init__()
+        self.dim = dim
+        self.max_seq_len = max_seq_len
+        self.theta = theta
+        # 주파수 벡터 미리 계산 (학습 불필요 → buffer로 등록)
+        # freqs[i] = 1 / (theta^(2i/dim)), i = 0, 1, ..., dim/2-1
+        freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("freqs", freqs, persistent=False)
+        # (max_seq_len, dim/2) 크기의 cos/sin 테이블 미리 계산
+        self._build_cache(max_seq_len)
+    def _build_cache(self, seq_len: int):
+        """cos/sin 값을 미리 계산하여 캐싱합니다."""
+        t = torch.arange(seq_len, device=self.freqs.device, dtype=torch.float32)
+        # outer product: (seq_len,) × (dim/2,) → (seq_len, dim/2)
+        angles = torch.outer(t, self.freqs)
+        self.register_buffer("cos_cached", angles.cos(), persistent=False)
+        self.register_buffer("sin_cached", angles.sin(), persistent=False)
+    def forward(
+        self, q: torch.Tensor, k: torch.Tensor, position_offset: int = 0
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Q, K에 회전 변환을 적용합니다.
+        Args:
+            q: (batch, num_heads, seq_len, head_dim)
+            k: (batch, num_kv_heads, seq_len, head_dim)
+            position_offset: 시퀀스 시작 위치 오프셋 (추론 시 KV 캐시 사용 시)
+        Returns:
+            회전 변환이 적용된 (q_rotated, k_rotated)
+        """
+        seq_len = q.shape[2]
+        # 필요 시 캐시 확장
+        if position_offset + seq_len > self.cos_cached.shape[0]:
+            self._build_cache(position_offset + seq_len)
+        # 현재 위치에 해당하는 cos/sin 슬라이스
+        cos = self.cos_cached[position_offset : position_offset + seq_len]  # (seq_len, dim/2)
+        sin = self.sin_cached[position_offset : position_offset + seq_len]
+        q_rotated = self._apply_rotation(q, cos, sin)
+        k_rotated = self._apply_rotation(k, cos, sin)
+        return q_rotated, k_rotated
+    @staticmethod
+    def _apply_rotation(
+        x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
+    ) -> torch.Tensor:
+        """회전 변환 적용.
+        2D 회전 행렬:
+          [cos θ, -sin θ] [x1]   [x1·cos θ - x2·sin θ]
+          [sin θ,  cos θ] [x2] = [x1·sin θ + x2·cos θ]
+        이를 벡터 연산으로 효율적으로 구현합니다.
+        """
+        # x: (batch, heads, seq_len, head_dim)
+        # 짝수/홀수 인덱스를 분리: (x0, x1, x2, x3, ...) → (x0, x2, ...), (x1, x3, ...)
+        x_even = x[..., 0::2]  # 짝수 인덱스
+        x_odd  = x[..., 1::2]  # 홀수 인덱스
+        # 브로드캐스팅을 위해 차원 맞춤: (seq_len, dim/2) → (1, 1, seq_len, dim/2)
+        cos = cos.unsqueeze(0).unsqueeze(0)
+        sin = sin.unsqueeze(0).unsqueeze(0)
+        # 회전 적용
+        rotated_even = x_even * cos - x_odd * sin
+        rotated_odd  = x_even * sin + x_odd * cos
+        # 다시 인터리빙: (even0, odd0, even1, odd1, ...)
+        out = torch.stack([rotated_even, rotated_odd], dim=-1)
+        return out.flatten(-2)  # 마지막 두 차원을 합쳐 원래 shape 복원

llm_lab/model/transformer_block.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""Transformer Block (하나의 레이어)."""
+from typing import Optional
+import torch
+import torch.nn as nn
+from llm_lab.config import ModelConfig
+from .norm import RMSNorm
+from .attention import GroupedQueryAttention
+from .feedforward import SwiGLUFeedForward
+class TransformerBlock(nn.Module):
+    """하나의 Transformer 디코더 블록.
+    구조 (Pre-Norm 방식):
+      x → RMSNorm → Attention → + (residual) → RMSNorm → FFN → + (residual) → out
+    Pre-Norm vs Post-Norm:
+      - Post-Norm (원래 Transformer): LayerNorm이 residual 이후
+        → 깊은 모델에서 학습 불안정
+      - Pre-Norm (GPT-2 이후 표준): LayerNorm이 sublayer 이전
+        → gradient 흐름이 원활, 학습이 안정적
+    Residual Connection의 역할:
+      - 입력을 출력에 더함 → gradient가 레이어를 건너뛸 수 있는 "고속도로"
+      - 22개 레이어를 쌓아도 학습이 가능한 핵심 이유
+    """
+    def __init__(self, config: ModelConfig, layer_idx: int):
+        super().__init__()
+        self.layer_idx = layer_idx
+        # Pre-Norm: Attention 전 정규화
+        self.attn_norm = RMSNorm(config.hidden_dim, eps=config.norm_eps)
+        # Self-Attention
+        self.attention = GroupedQueryAttention(config)
+        # Pre-Norm: FFN 전 정규화
+        self.ffn_norm = RMSNorm(config.hidden_dim, eps=config.norm_eps)
+        # Feed-Forward Network
+        self.feed_forward = SwiGLUFeedForward(config)
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[torch.Tensor] = None,
+        position_offset: int = 0,
+    ) -> torch.Tensor:
+        """
+        Args:
+            x: (batch_size, seq_len, hidden_dim)
+        Returns:
+            (batch_size, seq_len, hidden_dim)
+        """
+        # ── Attention sublayer with residual ──
+        # h = x + Attention(RMSNorm(x))
+        h = x + self.attention(self.attn_norm(x), mask, position_offset)
+        # ── FFN sublayer with residual ──
+        # out = h + FFN(RMSNorm(h))
+        out = h + self.feed_forward(self.ffn_norm(h))
+        return out

llm_lab/model/utils.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""모델 유틸리티 함수."""
+from __future__ import annotations
+import math
+from typing import TYPE_CHECKING
+from llm_lab.config import ModelConfig
+if TYPE_CHECKING:
+    from .llm_model import LLMModel
+def count_parameters_detailed(model: "LLMModel") -> dict:
+    """모델의 파라미터 수를 컴포넌트별로 상세 출력합니다."""
+    total = 0
+    breakdown = {}
+    # Embedding
+    emb_params = model.token_embedding.weight.numel()
+    breakdown["token_embedding"] = emb_params
+    total += emb_params
+    # 각 레이어
+    layer_total = 0
+    layer_detail = {}
+    layer = model.layers[0]
+    for name, param in layer.named_parameters():
+        layer_detail[name] = param.numel()
+        layer_total += param.numel()
+    breakdown["per_layer"] = layer_detail
+    breakdown["per_layer_total"] = layer_total
+    breakdown["all_layers_total"] = layer_total * len(model.layers)
+    total += layer_total * len(model.layers)
+    # Final norm
+    norm_params = model.final_norm.weight.numel()
+    breakdown["final_norm"] = norm_params
+    total += norm_params
+    # LM head (weight tying이므로 실제 추가 파라미터 0)
+    breakdown["lm_head"] = "weight tying (0 additional)"
+    breakdown["total"] = total
+    return breakdown
+def estimate_memory_gb(config: ModelConfig, batch_size: int = 4, dtype_bytes: int = 2) -> dict:
+    """모델의 GPU 메모리 사용량을 추정합니다.
+    Args:
+        dtype_bytes: 2 (bf16/fp16) 또는 4 (fp32)
+    """
+    # 대략적인 파라미터 수 계산
+    emb = config.vocab_size * config.hidden_dim
+    per_layer = (
+        config.hidden_dim * (config.num_heads + 2 * config.num_kv_heads) * config.head_dim  # QKV
+        + config.num_heads * config.head_dim * config.hidden_dim  # O proj
+        + 3 * config.hidden_dim * config.intermediate_dim  # SwiGLU (gate + up + down)
+        + 2 * config.hidden_dim  # 2 × RMSNorm
+    )
+    total_params = emb + per_layer * config.num_layers + config.hidden_dim
+    model_gb = total_params * dtype_bytes / 1e9
+    optimizer_gb = total_params * 8 / 1e9  # AdamW: 2 states × fp32
+    gradient_gb = total_params * dtype_bytes / 1e9
+    # 활성화 메모리 (activation checkpointing 적용 가정)
+    # 대략적 추정: batch_size × seq_len × hidden_dim × num_layers × factor
+    activation_gb = (
+        batch_size * config.max_seq_len * config.hidden_dim * 4  # 바이트
+        * math.sqrt(config.num_layers)  # checkpointing 효과
+        / 1e9
+    )
+    return {
+        "total_parameters": total_params,
+        "model_weights_gb": round(model_gb, 2),
+        "optimizer_states_gb": round(optimizer_gb, 2),
+        "gradients_gb": round(gradient_gb, 2),
+        "activations_estimated_gb": round(activation_gb, 2),
+        "total_estimated_gb": round(model_gb + optimizer_gb + gradient_gb + activation_gb, 2),
+    }

llm_lab/training/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+"""학습 모듈 — Gradient Accumulation, Mixed Precision, 체크포인트, wandb 로깅."""
+from .scheduler import CosineWarmupScheduler
+from .checkpoint import CheckpointManager
+from .metrics import MetricsTracker
+from .optimizer import create_optimizer
+from .trainer import Trainer
+from .runner import start_training
+__all__ = [
+    "CosineWarmupScheduler", "CheckpointManager", "MetricsTracker",
+    "create_optimizer", "Trainer", "start_training",
+]

llm_lab/training/checkpoint.py ADDED Viewed

	@@ -0,0 +1,159 @@

+"""학습 상태 저장/복원 관리자."""
+import json
+import shutil
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional
+import torch
+import torch.nn as nn
+from llm_lab.config import TrainConfig
+class CheckpointManager:
+    """학습 상태 저장/복원 관리자.
+    Colab에서 체크포인트가 중요한 이유:
+      - 세션 만료 (최대 ~24시간) 시 모든 메모리 상태 소멸
+      - Google Drive에 저장하면 세션 간 연속 학습 가능
+      - 옵티마이저 상태까지 저장해야 AdamW 모멘텀이 유지됨
+    저장 내용:
+      - model_state_dict:     모델 가중치
+      - optimizer_state_dict: 옵티마이저 상태 (m, v 모멘텀)
+      - step:                 현재 학습 스텝
+      - best_val_loss:        최저 검증 Loss
+      - config:               학습 설정 (재현성)
+      - rng_states:           랜덤 시드 상태 (완전 재현)
+      - metrics_history:      학습 메트릭 기록
+      - wandb_run_id:         wandb 실행 ID (로깅 연속성)
+    """
+    def __init__(self, config: TrainConfig):
+        self.config = config
+        self.checkpoint_dir = Path(config.checkpoint_dir)
+        self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
+        self.max_checkpoints = config.max_checkpoints
+    def save(
+        self,
+        model: nn.Module,
+        optimizer: torch.optim.Optimizer,
+        step: int,
+        best_val_loss: float,
+        metrics_history: Dict[str, list],
+        wandb_run_id: Optional[str] = None,
+    ):
+        """체크포인트를 저장합니다."""
+        ckpt_path = self.checkpoint_dir / f"step_{step:06d}"
+        ckpt_path.mkdir(parents=True, exist_ok=True)
+        print(f"\n💾 체크포인트 저장: {ckpt_path}")
+        start = time.time()
+        # 1) 모델 가중치 (bf16 상태 그대로)
+        torch.save(model.state_dict(), ckpt_path / "model.pt")
+        # 2) 옵티마이저 상태 (fp32 모멘텀 포함, 크기 큼)
+        torch.save(optimizer.state_dict(), ckpt_path / "optimizer.pt")
+        # 3) 학습 메타 정보
+        meta = {
+            "step": step,
+            "best_val_loss": best_val_loss,
+            "wandb_run_id": wandb_run_id,
+            "config": self.config.__dict__,
+        }
+        with open(ckpt_path / "meta.json", "w") as f:
+            json.dump(meta, f, indent=2)
+        # 4) 메트릭 기록
+        torch.save(metrics_history, ckpt_path / "metrics.pt")
+        # 5) 랜덤 상태 (완전 재현을 위해)
+        rng_states = {
+            "python": torch.random.get_rng_state(),
+            "cuda": torch.cuda.get_rng_state() if torch.cuda.is_available() else None,
+        }
+        torch.save(rng_states, ckpt_path / "rng_states.pt")
+        elapsed = time.time() - start
+        ckpt_size = sum(f.stat().st_size for f in ckpt_path.rglob("*")) / 1e9
+        print(f"   저장 완료: {ckpt_size:.2f} GB, {elapsed:.1f}초")
+        # 오래된 체크포인트 삭제 (롤링)
+        self._cleanup_old_checkpoints()
+    def load_latest(
+        self,
+        model: nn.Module,
+        optimizer: Optional[torch.optim.Optimizer] = None,
+        device: torch.device = torch.device("cpu"),
+    ) -> Dict[str, Any]:
+        """가장 최근 체크포인트를 로드합니다.
+        Returns:
+            {"step", "best_val_loss", "wandb_run_id", "metrics_history"}
+            또는 체크포인트가 없으면 None
+        """
+        ckpt_path = self._find_latest()
+        if ckpt_path is None:
+            print("[Checkpoint] 저장된 체크포인트 없음. 처음부터 시작합니다.")
+            return None
+        print(f"\n📂 체크포인트 로드: {ckpt_path}")
+        start = time.time()
+        # 1) 모델 가중치
+        model_state = torch.load(ckpt_path / "model.pt", map_location=device, weights_only=True)
+        model.load_state_dict(model_state)
+        del model_state  # 메모리 해제
+        # 2) 옵티마이저 상태
+        if optimizer is not None:
+            optim_state = torch.load(ckpt_path / "optimizer.pt", map_location=device, weights_only=True)
+            optimizer.load_state_dict(optim_state)
+            del optim_state
+        # 3) 메타 정보
+        with open(ckpt_path / "meta.json", "r") as f:
+            meta = json.load(f)
+        # 4) 메트릭 기록
+        metrics_history = {}
+        metrics_path = ckpt_path / "metrics.pt"
+        if metrics_path.exists():
+            metrics_history = torch.load(metrics_path, weights_only=False)
+        # 5) 랜덤 상태 복원
+        rng_path = ckpt_path / "rng_states.pt"
+        if rng_path.exists():
+            rng_states = torch.load(rng_path, weights_only=False)
+            torch.random.set_rng_state(rng_states["python"])
+            if rng_states["cuda"] is not None and torch.cuda.is_available():
+                torch.cuda.set_rng_state(rng_states["cuda"])
+        elapsed = time.time() - start
+        print(f"   로드 완료: step={meta['step']}, {elapsed:.1f}초")
+        return {
+            "step": meta["step"],
+            "best_val_loss": meta["best_val_loss"],
+            "wandb_run_id": meta.get("wandb_run_id"),
+            "metrics_history": metrics_history,
+        }
+    def _find_latest(self) -> Optional[Path]:
+        """가장 최근 체크포인트 경로를 찾습니다."""
+        ckpts = sorted(self.checkpoint_dir.glob("step_*"))
+        return ckpts[-1] if ckpts else None
+    def _cleanup_old_checkpoints(self):
+        """오래된 체크포인트를 삭제합니다 (롤링)."""
+        ckpts = sorted(self.checkpoint_dir.glob("step_*"))
+        while len(ckpts) > self.max_checkpoints:
+            old = ckpts.pop(0)
+            print(f"   🗑️ 오래된 체크포인트 삭제: {old.name}")
+            shutil.rmtree(old)

llm_lab/training/metrics.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""학습 메트릭 추적 및 로깅."""
+from typing import Dict, Optional
+import torch
+from llm_lab.config import TrainConfig
+class MetricsTracker:
+    """학습 메트릭을 추적하고 로깅합니다.
+    추적 항목:
+      - train/loss:      학습 Loss (Cross-Entropy)
+      - train/lr:        현재 학습률
+      - train/grad_norm: Gradient L2 Norm
+      - train/tokens_per_sec: 처리량
+      - train/gpu_mem_gb: GPU 메모리 사용량
+      - val/loss:        검증 Loss
+      - val/perplexity:  검증 Perplexity (= exp(loss))
+    """
+    def __init__(self, config: TrainConfig):
+        self.config = config
+        self.history: Dict[str, list] = {
+            "step": [],
+            "train_loss": [],
+            "learning_rate": [],
+            "grad_norm": [],
+            "tokens_per_sec": [],
+            "gpu_mem_gb": [],
+            "val_loss": [],
+            "val_ppl": [],
+        }
+        # wandb 초기화
+        self.wandb_run = None
+        if config.use_wandb:
+            self._init_wandb()
+    def _init_wandb(self, resume_id: Optional[str] = None):
+        """wandb 초기화 (세션 간 연속 로깅 지원)."""
+        try:
+            import wandb
+            run_id = resume_id or wandb.util.generate_id()
+            self.wandb_run = wandb.init(
+                project=self.config.wandb_project,
+                name=self.config.wandb_run_name or f"1b-run-{run_id[:6]}",
+                id=run_id,
+                resume="allow",
+                config=self.config.__dict__,
+            )
+            print(f"[wandb] 초기화 완료: {self.wandb_run.url}")
+        except ImportError:
+            print("[wandb] 설치되지 않음. 콘솔 로깅만 사용합니다.")
+            self.config.use_wandb = False
+        except Exception as e:
+            print(f"[wandb] 초기화 실패: {e}. 콘솔 로깅만 사용합니다.")
+            self.config.use_wandb = False
+    def resume_wandb(self, run_id: str):
+        """이전 wandb 실행을 이어서 로깅합니다."""
+        if self.config.use_wandb:
+            self._init_wandb(resume_id=run_id)
+    def log_train_step(
+        self,
+        step: int,
+        loss: float,
+        lr: float,
+        grad_norm: float,
+        tokens_per_sec: float,
+        gpu_mem_gb: float,
+    ):
+        """학습 스텝 메트릭을 기록합니다."""
+        self.history["step"].append(step)
+        self.history["train_loss"].append(loss)
+        self.history["learning_rate"].append(lr)
+        self.history["grad_norm"].append(grad_norm)
+        self.history["tokens_per_sec"].append(tokens_per_sec)
+        self.history["gpu_mem_gb"].append(gpu_mem_gb)
+        if self.config.use_wandb and self.wandb_run:
+            import wandb
+            wandb.log({
+                "train/loss": loss,
+                "train/lr": lr,
+                "train/grad_norm": grad_norm,
+                "train/tokens_per_sec": tokens_per_sec,
+                "train/gpu_mem_gb": gpu_mem_gb,
+            }, step=step)
+    def log_eval(self, step: int, val_loss: float, val_ppl: float):
+        """검증 메트릭을 기록합니다."""
+        self.history["val_loss"].append(val_loss)
+        self.history["val_ppl"].append(val_ppl)
+        if self.config.use_wandb and self.wandb_run:
+            import wandb
+            wandb.log({
+                "val/loss": val_loss,
+                "val/perplexity": val_ppl,
+            }, step=step)
+    @property
+    def wandb_run_id(self) -> Optional[str]:
+        if self.wandb_run:
+            return self.wandb_run.id
+        return None

llm_lab/training/optimizer.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""AdamW 옵티마이저 생성 (Weight Decay 분리)."""
+import torch
+import torch.nn as nn
+from llm_lab.config import TrainConfig
+def create_optimizer(model: nn.Module, config: TrainConfig) -> torch.optim.AdamW:
+    """AdamW 옵티마이저를 생성합니다.
+    Weight Decay 분리 규칙:
+      - Decay 적용: Linear 가중치 (attention proj, FFN 등)
+      - Decay 미적용: Embedding, LayerNorm/RMSNorm, Bias
+    왜 분리하는가?
+      - Weight Decay는 큰 가중치에 패널티를 주어 과적합 방지
+      - 하지만 Norm의 scale 파라미터에 적용하면 정규화 효과를 방해
+      - Embedding에 적용하면 희귀 토큰의 표현이 0으로 수축
+      - 1D 파라미터(bias, norm weight)는 decay에서 제외하는 것이 관례
+    """
+    # 파라미터를 decay/no-decay 그룹으로 분리
+    decay_params = []
+    no_decay_params = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue
+        # 1D 텐서(bias, norm weight) 또는 embedding → no decay
+        if param.dim() <= 1 or "embedding" in name:
+            no_decay_params.append(param)
+        else:
+            decay_params.append(param)
+    param_groups = [
+        {"params": decay_params, "weight_decay": config.weight_decay},
+        {"params": no_decay_params, "weight_decay": 0.0},
+    ]
+    n_decay = sum(p.numel() for p in decay_params)
+    n_no_decay = sum(p.numel() for p in no_decay_params)
+    print(f"[Optimizer] Decay 파라미터: {n_decay:,} ({n_decay/1e6:.1f}M)")
+    print(f"[Optimizer] No-decay 파라미터: {n_no_decay:,} ({n_no_decay/1e6:.1f}M)")
+    optimizer = torch.optim.AdamW(
+        param_groups,
+        lr=config.learning_rate,
+        betas=(config.beta1, config.beta2),
+        eps=config.adam_eps,
+        fused=torch.cuda.is_available(),  # CUDA fused AdamW (더 빠름)
+    )
+    return optimizer

llm_lab/training/runner.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""학습 실행 헬퍼 (Quick Start)."""
+from pathlib import Path
+from typing import Optional
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from llm_lab.config import TrainConfig
+from .trainer import Trainer
+from llm_lab.utils import auto_configure
+def start_training(
+    model: nn.Module,
+    train_dataloader: DataLoader,
+    val_dataloader: Optional[DataLoader] = None,
+    config: Optional[TrainConfig] = None,
+    seq_len: int = 2048,
+    auto_config: bool = True,
+) -> Trainer:
+    """학습을 시작합니다 (한 줄 실행).
+    사용법 (Colab):
+    ```python
+    from model import LLMModel, ModelConfig
+    from data_pipeline import setup_data_pipeline, DataConfig
+    from trainer import start_training, TrainConfig
+    # 1. 모델 생성
+    model_config = ModelConfig.base_1b()
+    model = LLMModel(model_config)
+    # 2. 데이터 파이프라인
+    tok, train_dl, val_dl = setup_data_pipeline("pretrained")
+    # 3. 학습 시작 (체크포인트 자동 복원)
+    trainer = start_training(model, train_dl, val_dl)
+    ```
+    """
+    config = config or TrainConfig()
+    # GPU 자동 감지 및 설정 조정
+    if auto_config:
+        config = auto_configure(config)
+    # Google Drive 마운트 확인 (Colab)
+    if "/content/drive" in config.checkpoint_dir:
+        drive_path = Path("/content/drive/MyDrive")
+        if not drive_path.exists():
+            print("\n⚠️ Google Drive가 마운트되지 않았습니다!")
+            print("  Colab에서 실행: from google.colab import drive; drive.mount('/content/drive')")
+            print("  로컬 경로로 변경합니다.")
+            config.checkpoint_dir = "./checkpoints"
+    # 재현성 시드 설정
+    torch.manual_seed(config.seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(config.seed)
+    # Trainer 생성 (체크포인트 자동 복원 포함)
+    trainer = Trainer(model, train_dataloader, val_dataloader, config, seq_len)
+    # 학습 실행
+    trainer.train()
+    return trainer

llm_lab/training/scheduler.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""Cosine Annealing with Linear Warmup 스케줄러."""
+import math
+import torch
+from llm_lab.config import TrainConfig
+class CosineWarmupScheduler:
+    """Cosine Annealing with Linear Warmup.
+    LR 곡선:
+      ┌─── peak_lr ───────╲
+      │                     ╲  cosine decay
+      │ warmup (linear)      ╲
+      │/                       ╲_______ min_lr
+      └──────────────────────────────────→ steps
+    왜 Cosine Decay인가?
+      - Step decay: 갑작스러운 LR 하락 → Loss 불안정
+      - Linear decay: 후반부 LR이 너무 빨리 감소
+      - Cosine: 부드러운 감소, 학습 후반에도 적절한 LR 유지
+      - GPT-3, LLaMA, Chinchilla 등 대부분의 LLM이 사용
+    구현 참고:
+      PyTorch 내장 스케줄러(CosineAnnealingLR 등)도 있지만,
+      warmup + min_lr + 체크포인트 복원을 위해 직접 구현이 더 유연합니다.
+    """
+    def __init__(self, config: TrainConfig):
+        self.peak_lr = config.learning_rate
+        self.min_lr = config.min_learning_rate
+        self.warmup_steps = config.warmup_steps
+        self.total_steps = config.total_steps
+    def get_lr(self, step: int) -> float:
+        """현재 step에 해당하는 학습률을 반환합니다.
+        Args:
+            step: 현재 optimizer step (0-indexed)
+        Returns:
+            학습률 (float)
+        """
+        # Phase 1: Linear Warmup
+        if step < self.warmup_steps:
+            # 0 → peak_lr 선형 증가
+            return self.peak_lr * (step / self.warmup_steps)
+        # Phase 2: Cosine Decay
+        # warmup 이후 남은 진행률 (0.0 → 1.0)
+        decay_steps = self.total_steps - self.warmup_steps
+        progress = (step - self.warmup_steps) / max(decay_steps, 1)
+        progress = min(progress, 1.0)  # 안전장치
+        # Cosine 공식: min_lr + 0.5 × (peak - min) × (1 + cos(π × progress))
+        cosine_decay = 0.5 * (1.0 + math.cos(math.pi * progress))
+        lr = self.min_lr + (self.peak_lr - self.min_lr) * cosine_decay
+        return lr
+    def set_lr(self, optimizer: torch.optim.Optimizer, step: int):
+        """Optimizer의 학습률을 업데이트합니다."""
+        lr = self.get_lr(step)
+        for param_group in optimizer.param_groups:
+            param_group["lr"] = lr
+        return lr

llm_lab/training/trainer.py ADDED Viewed

	@@ -0,0 +1,351 @@

+"""LLM 사전학습 트레이너."""
+import math
+import time
+from typing import Dict, Optional, Tuple
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from llm_lab.config import TrainConfig
+from .scheduler import CosineWarmupScheduler
+from .checkpoint import CheckpointManager
+from .metrics import MetricsTracker
+from .optimizer import create_optimizer
+class Trainer:
+    """LLM 사전학습 트레이너.
+    학습 루프의 핵심 구조:
+    ```
+    for step in range(total_steps):
+        # ── Gradient Accumulation Loop ──
+        for micro_step in range(accumulation_steps):
+            batch = next(dataloader)
+            with autocast(bf16):
+                logits, loss = model(input_ids, targets)
+            scaled_loss = loss / accumulation_steps
+            scaled_loss.backward()          # gradient 누적
+        # ── Optimizer Step (accumulation 완료 후) ──
+        clip_grad_norm(model, max_norm=1.0)
+        optimizer.step()
+        optimizer.zero_grad()
+        scheduler.set_lr(optimizer, step)
+    ```
+    Gradient Accumulation이란?
+      - GPU 메모리에 큰 배치를 한 번에 올릴 수 없을 때
+      - 작은 micro_batch로 여러 번 forward/backward → gradient를 누적
+      - 누적 후 한 번에 optimizer step
+      - 결과적으로 큰 effective_batch와 동일한 효과
+      - Loss를 accumulation_steps로 나누는 이유:
+        gradient의 평균을 구하기 위해 (합이 아닌 평균)
+    """
+    def __init__(
+        self,
+        model: nn.Module,
+        train_dataloader: DataLoader,
+        val_dataloader: Optional[DataLoader],
+        config: TrainConfig,
+        seq_len: int = 2048,
+    ):
+        self.config = config
+        self.seq_len = seq_len
+        # ── 디바이스 설정 ──
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        print(f"[Trainer] 디바이스: {self.device}")
+        if torch.cuda.is_available():
+            print(f"[Trainer] GPU: {torch.cuda.get_device_name()}")
+            print(f"[Trainer] GPU 메모리: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
+        # ── 모델 ──
+        self.model = model.to(self.device)
+        # torch.compile: PyTorch 2.0+ 그래프 최적화 (속도 10-30% 향상)
+        if torch.cuda.is_available() and hasattr(torch, "compile"):
+            print("[Trainer] torch.compile 적용 중...")
+            self.model = torch.compile(self.model)
+        # ── 데이터 ──
+        self.train_dataloader = train_dataloader
+        self.val_dataloader = val_dataloader
+        self.train_iter = iter(train_dataloader)
+        # ── 옵티마이저 ──
+        self.optimizer = create_optimizer(self.model, config)
+        # ── 스케줄러 ──
+        self.scheduler = CosineWarmupScheduler(config)
+        # ── 체크포인트 ──
+        self.ckpt_manager = CheckpointManager(config)
+        # ── 메트릭 ──
+        self.metrics = MetricsTracker(config)
+        # ── 학습 상태 ──
+        self.global_step = 0
+        self.best_val_loss = float("inf")
+        self.tokens_seen = 0
+        # ── Mixed Precision ──
+        # bf16은 GradScaler가 불필요 (fp16일 때만 필요)
+        self.use_amp = config.dtype != "float32"
+        self.amp_dtype = config.torch_dtype
+        # ── 자동 복원 시도 ──
+        self._try_resume()
+    def _try_resume(self):
+        """이전 체크포인트가 있으면 자동으로 복원합니다."""
+        result = self.ckpt_manager.load_latest(
+            self.model, self.optimizer, self.device
+        )
+        if result is not None:
+            self.global_step = result["step"]
+            self.best_val_loss = result["best_val_loss"]
+            self.metrics.history = result.get("metrics_history", self.metrics.history)
+            # wandb 연속 로깅
+            if result.get("wandb_run_id"):
+                self.metrics.resume_wandb(result["wandb_run_id"])
+            self.tokens_seen = self.global_step * self.config.effective_batch_size * self.seq_len
+            print(f"[Trainer] 학습 재개: step={self.global_step}, "
+                  f"tokens={self.tokens_seen/1e9:.2f}B, "
+                  f"best_val_loss={self.best_val_loss:.4f}")
+    def _get_next_batch(self) -> Dict[str, torch.Tensor]:
+        """다음 학습 배치를 가져옵니다.
+        Streaming DataLoader는 에폭 개념이 없으므로,
+        StopIteration 시 새 이터레이터를 생성합니다.
+        """
+        try:
+            batch = next(self.train_iter)
+        except StopIteration:
+            self.train_iter = iter(self.train_dataloader)
+            batch = next(self.train_iter)
+        return {
+            "input_ids": batch["input_ids"].to(self.device, non_blocking=True),
+            "targets": batch["targets"].to(self.device, non_blocking=True),
+        }
+    def _train_step(self) -> Tuple[float, float]:
+        """하나의 optimizer step을 수행합니다.
+        Returns:
+            (loss, grad_norm)
+        """
+        self.model.train()
+        self.optimizer.zero_grad(set_to_none=True)
+        # set_to_none=True: gradient를 None으로 설정 → 메모리 절약
+        total_loss = 0.0
+        # ── Gradient Accumulation Loop ──
+        for micro_step in range(self.config.gradient_accumulation_steps):
+            batch = self._get_next_batch()
+            # Mixed Precision Forward
+            with torch.amp.autocast(device_type="cuda", dtype=self.amp_dtype, enabled=self.use_amp):
+                logits, loss = self.model(batch["input_ids"], batch["targets"])
+            # Loss 스케일링: effective batch의 평균을 위해
+            scaled_loss = loss / self.config.gradient_accumulation_steps
+            total_loss += loss.item()
+            # Backward (gradient 누적)
+            scaled_loss.backward()
+        # ── Gradient Clipping ──
+        # 모든 파라미터의 gradient를 하나의 벡터로 보고 L2 norm 계산
+        # norm이 max_norm을 초과하면 비례적으로 스케일 다운
+        grad_norm = torch.nn.utils.clip_grad_norm_(
+            self.model.parameters(),
+            max_norm=self.config.grad_clip,
+        ).item()
+        # ── Optimizer Step ──
+        self.optimizer.step()
+        # ── LR 업데이트 ──
+        self.scheduler.set_lr(self.optimizer, self.global_step)
+        avg_loss = total_loss / self.config.gradient_accumulation_steps
+        return avg_loss, grad_norm
+    @torch.no_grad()
+    def _evaluate(self) -> Tuple[float, float]:
+        """검증 데이터에서 Loss와 Perplexity를 측정합니다.
+        Perplexity = exp(loss)
+          - 직관: "모델이 다음 토큰을 평균 몇 개의 후보 중에서 고르는가"
+          - PPL 100 → 100개 중 1개를 균일하게 고르는 수준
+          - PPL 20  → 20개 중 1개 수준 (꽤 좋음)
+          - PPL 10  → 매우 자신있게 예측
+        """
+        if self.val_dataloader is None:
+            return float("inf"), float("inf")
+        self.model.eval()
+        total_loss = 0.0
+        num_batches = 0
+        for i, batch in enumerate(self.val_dataloader):
+            if i >= self.config.eval_steps:
+                break
+            input_ids = batch["input_ids"].to(self.device)
+            targets = batch["targets"].to(self.device)
+            with torch.amp.autocast(device_type="cuda", dtype=self.amp_dtype, enabled=self.use_amp):
+                _, loss = self.model(input_ids, targets)
+            total_loss += loss.item()
+            num_batches += 1
+        avg_loss = total_loss / max(num_batches, 1)
+        perplexity = math.exp(min(avg_loss, 20))  # overflow 방지 (exp(20) ≈ 5억)
+        return avg_loss, perplexity
+    def train(self):
+        """메인 학습 루프.
+        이 메서드가 전체 학습을 실행합니다.
+        Colab 세션 만료 시 중단되어도 체크포인트에서 자동 재개됩니다.
+        """
+        config = self.config
+        print("\n" + "=" * 70)
+        print("🚀 학습 시작")
+        print("=" * 70)
+        print(f"  총 스텝: {config.total_steps:,}")
+        print(f"  시작 스텝: {self.global_step}")
+        print(f"  Effective batch size: {config.effective_batch_size}")
+        print(f"  토큰/스텝: {config.effective_batch_size * self.seq_len:,}")
+        print(f"  총 학습 토큰 (예상): {config.total_steps * config.effective_batch_size * self.seq_len / 1e9:.1f}B")
+        print(f"  Mixed Precision: {config.dtype}")
+        print(f"  Gradient Accumulation: {config.gradient_accumulation_steps}")
+        print(f"  체크포인트: {config.checkpoint_dir}")
+        print("=" * 70 + "\n")
+        step_start_time = time.time()
+        tokens_at_log_start = self.tokens_seen
+        # ════════════════════════════════════════════
+        # 메인 루프
+        # ════════════════════════════════════════════
+        while self.global_step < config.total_steps:
+            # ── Train Step ──
+            loss, grad_norm = self._train_step()
+            self.global_step += 1
+            self.tokens_seen += config.effective_batch_size * self.seq_len
+            # ── Logging ──
+            if self.global_step % config.log_interval == 0:
+                elapsed = time.time() - step_start_time
+                tokens_delta = self.tokens_seen - tokens_at_log_start
+                tokens_per_sec = tokens_delta / max(elapsed, 1e-6)
+                # GPU 메모리
+                gpu_mem_gb = 0.0
+                if torch.cuda.is_available():
+                    gpu_mem_gb = torch.cuda.max_memory_allocated() / 1e9
+                # 현재 LR
+                current_lr = self.scheduler.get_lr(self.global_step)
+                # 남은 시간 추정
+                remaining_steps = config.total_steps - self.global_step
+                steps_per_sec = config.log_interval / max(elapsed, 1e-6)
+                eta_seconds = remaining_steps / max(steps_per_sec, 1e-6)
+                eta_hours = eta_seconds / 3600
+                # 콘솔 출력
+                print(
+                    f"  Step {self.global_step:>6d}/{config.total_steps} │ "
+                    f"Loss {loss:.4f} │ "
+                    f"LR {current_lr:.2e} │ "
+                    f"Grad {grad_norm:.2f} │ "
+                    f"{tokens_per_sec:,.0f} tok/s │ "
+                    f"GPU {gpu_mem_gb:.1f}GB │ "
+                    f"ETA {eta_hours:.1f}h │ "
+                    f"Tokens {self.tokens_seen/1e9:.2f}B"
+                )
+                # wandb 로깅
+                self.metrics.log_train_step(
+                    step=self.global_step,
+                    loss=loss,
+                    lr=current_lr,
+                    grad_norm=grad_norm,
+                    tokens_per_sec=tokens_per_sec,
+                    gpu_mem_gb=gpu_mem_gb,
+                )
+                step_start_time = time.time()
+                tokens_at_log_start = self.tokens_seen
+            # ── Evaluation ──
+            if self.global_step % config.eval_interval == 0:
+                val_loss, val_ppl = self._evaluate()
+                print(f"\n  📊 Eval @ Step {self.global_step}: "
+                      f"Val Loss = {val_loss:.4f}, "
+                      f"Val PPL = {val_ppl:.2f}")
+                self.metrics.log_eval(self.global_step, val_loss, val_ppl)
+                if val_loss < self.best_val_loss:
+                    self.best_val_loss = val_loss
+                    print(f"  🏆 New best val loss: {val_loss:.4f}")
+                print()
+            # ── Checkpoint ──
+            if self.global_step % config.checkpoint_interval == 0:
+                self.ckpt_manager.save(
+                    model=self.model,
+                    optimizer=self.optimizer,
+                    step=self.global_step,
+                    best_val_loss=self.best_val_loss,
+                    metrics_history=self.metrics.history,
+                    wandb_run_id=self.metrics.wandb_run_id,
+                )
+        # ════════════════════════════════════════════
+        # 학습 완료
+        # ════════════════════════════════════════════
+        print("\n" + "=" * 70)
+        print("🎉 학습 완료!")
+        print("=" * 70)
+        print(f"  총 스텝: {self.global_step:,}")
+        print(f"  총 토큰: {self.tokens_seen/1e9:.2f}B")
+        print(f"  최저 Val Loss: {self.best_val_loss:.4f}")
+        print(f"  최저 Val PPL: {math.exp(min(self.best_val_loss, 20)):.2f}")
+        print("=" * 70)
+        # 최종 체크포인트 저장
+        self.ckpt_manager.save(
+            model=self.model,
+            optimizer=self.optimizer,
+            step=self.global_step,
+            best_val_loss=self.best_val_loss,
+            metrics_history=self.metrics.history,
+            wandb_run_id=self.metrics.wandb_run_id,
+        )
+        if self.config.use_wandb and self.metrics.wandb_run:
+            import wandb
+            wandb.finish()

llm_lab/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""공통 유틸리티 — 디바이스 감지, 시드 설정."""
+from .device import get_device, detect_gpu_info, auto_configure
+from .seed import set_seed
+__all__ = ["get_device", "detect_gpu_info", "auto_configure", "set_seed"]

llm_lab/utils/device.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""디바이스 감지 및 자동 설정 유틸리티."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import torch
+if TYPE_CHECKING:
+    from llm_lab.config import TrainConfig
+def get_device() -> torch.device:
+    """사용 가능한 디바이스(cuda 또는 cpu)를 반환합니다."""
+    return torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def detect_gpu_info() -> dict:
+    """GPU 이름과 메모리 정보를 반환합니다.
+    Returns:
+        {"name": str, "memory_gb": float} 또는 GPU가 없으면 빈 dict
+    """
+    if not torch.cuda.is_available():
+        return {}
+    return {
+        "name": torch.cuda.get_device_name(),
+        "memory_gb": round(torch.cuda.get_device_properties(0).total_mem / 1e9, 1),
+    }
+def auto_configure(config: "TrainConfig") -> "TrainConfig":
+    """GPU 종류에 따라 설정을 자동 조정합니다.
+    Colab Pro+에서 A100이 항상 배정되지는 않습니다.
+    T4나 V100이 배정될 경우 자동으로 설정을 조정합니다.
+    Returns:
+        조정된 TrainConfig
+    """
+    if not torch.cuda.is_available():
+        print("⚠️ GPU 없음! CPU 모드 (매우 느림)")
+        config.dtype = "float32"
+        config.micro_batch_size = 1
+        config.gradient_accumulation_steps = 4
+        return config
+    gpu_name = torch.cuda.get_device_name().lower()
+    gpu_mem = torch.cuda.get_device_properties(0).total_mem / 1e9
+    print(f"\n🔍 GPU 감지: {torch.cuda.get_device_name()} ({gpu_mem:.1f} GB)")
+    if "a100" in gpu_name:
+        # A100 40GB: 기본 설정 그대로 (최적)
+        print("  → A100 감지: 기본 설정 사용 (bf16, batch=4)")
+        config.dtype = "bfloat16"
+        config.micro_batch_size = 4
+    elif "v100" in gpu_name:
+        # V100 16GB: bf16 미지원, 배치 축소
+        print("  → V100 감지: fp16 모드, 배치 축소")
+        config.dtype = "float16"
+        config.micro_batch_size = 2
+        config.gradient_accumulation_steps = 64  # effective batch 유지
+    elif "t4" in gpu_name:
+        # T4 16GB: bf16 미지원, 더 작은 배치
+        print("  → T4 감지: fp16 모드, 최소 배치")
+        config.dtype = "float16"
+        config.micro_batch_size = 1
+        config.gradient_accumulation_steps = 128
+    elif "l4" in gpu_name:
+        # L4 24GB: bf16 지원
+        print("  → L4 감지: bf16 모드, 배치 조정")
+        config.dtype = "bfloat16"
+        config.micro_batch_size = 2
+        config.gradient_accumulation_steps = 64
+    else:
+        print(f"  → 알 수 없는 GPU. 메모리 기준으로 설정 조정")
+        if gpu_mem >= 30:
+            config.micro_batch_size = 4
+        elif gpu_mem >= 16:
+            config.micro_batch_size = 2
+        else:
+            config.micro_batch_size = 1
+            config.gradient_accumulation_steps = 128
+    print(f"  → dtype: {config.dtype}")
+    print(f"  → micro_batch: {config.micro_batch_size}")
+    print(f"  → grad_accum: {config.gradient_accumulation_steps}")
+    print(f"  → effective_batch: {config.effective_batch_size}")
+    return config

llm_lab/utils/seed.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""재현성을 위한 시드 유틸리티."""
+import torch
+def set_seed(seed: int = 42):
+    """재현성을 위한 시드 설정."""
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)

notebooks/01_data_pipeline.ipynb ADDED Viewed

	@@ -0,0 +1,169 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 01. 데이터 파이프라인\n",
+    "\n",
+    "토크나이저 준비 → 데이터 스트리밍 → 시퀀스 패킹 → 배치 구성\n",
+    "\n",
+    "**파이프라인 흐름:**\n",
+    "```\n",
+    "FineWeb-Edu (HuggingFace)\n",
+    "  → Streaming으로 로드 (디스크 저장 없음)\n",
+    "  → 토크나이징 (BPE, vocab=32K)\n",
+    "  → 시퀀스 패킹 (여러 문서를 max_seq_len으로 연결)\n",
+    "  → 배치 구성 (input_ids, targets)\n",
+    "  → GPU 전송\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 필요 패키지 설치\n",
+    "!pip install datasets tokenizers sentencepiece transformers -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')\n",
+    "\n",
+    "from llm_lab.config import DataConfig\n",
+    "from llm_lab.data import (\n",
+    "    Tokenizer, setup_data_pipeline,\n",
+    "    DataPipelineDiagnostics\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. 데이터 설정 (Config)\n",
+    "\n",
+    "아래 값들을 환경에 맞게 수정하세요."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_config = DataConfig(\n",
+    "    dataset_name=\"HuggingFaceFW/fineweb-edu\",\n",
+    "    dataset_subset=\"sample-10BT\",\n",
+    "    vocab_size=32_000,\n",
+    "    max_seq_len=2048,\n",
+    "    batch_size=4,\n",
+    "    num_workers=2,\n",
+    ")\n",
+    "\n",
+    "print(f\"데이터셋: {data_config.dataset_name} ({data_config.dataset_subset})\")\n",
+    "print(f\"시퀀스 길이: {data_config.max_seq_len}\")\n",
+    "print(f\"배치 크기: {data_config.batch_size}\")\n",
+    "print(f\"토큰/배치: {data_config.batch_size * data_config.max_seq_len:,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. 토크나이저 설정\n",
+    "\n",
+    "세 가지 방법 중 선택:\n",
+    "- `\"pretrained\"` — HuggingFace 사전학습 토크나이저 (가장 간편)\n",
+    "- `\"train_new\"` — BPE 토크나이저 새로 학습\n",
+    "- `\"load_trained\"` — 이전에 학습한 토크나이저 로드"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer, train_dl, val_dl = setup_data_pipeline(\n",
+    "    tokenizer_mode=\"pretrained\",  # \"train_new\" 또는 \"load_trained\"로 변경 가능\n",
+    "    config=data_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. 파이프라인 진단"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 토크나이저 품질 진단\n",
+    "DataPipelineDiagnostics.check_tokenizer_quality(tokenizer, data_config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 데이터 로딩 처리량 벤치마크\n",
+    "DataPipelineDiagnostics.benchmark_throughput(train_dl, num_batches=50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. 배치 검사"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 첫 배치를 가져와서 상세 검사\n",
+    "batch = next(iter(train_dl))\n",
+    "DataPipelineDiagnostics.inspect_batch(batch, tokenizer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "**다음 단계:** `02_model.ipynb`에서 모델 아키텍처를 생성합니다."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

notebooks/02_model.ipynb ADDED Viewed

	@@ -0,0 +1,212 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 02. 모델 아키텍처\n",
+    "\n",
+    "1.1B 파라미터 LLaMA-style Decoder-Only Transformer 생성 및 검증.\n",
+    "\n",
+    "**모델 구조:**\n",
+    "```\n",
+    "Input Token IDs\n",
+    "  → Token Embedding\n",
+    "  → [TransformerBlock] × num_layers\n",
+    "  │   ├── RMSNorm → GroupedQueryAttention (+ RoPE) → Residual\n",
+    "  │   └── RMSNorm → SwiGLU FFN → Residual\n",
+    "  → RMSNorm (최종)\n",
+    "  → Linear Head (Weight Tying)\n",
+    "  → Vocab Logits\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')\n",
+    "\n",
+    "import torch\n",
+    "import math\n",
+    "from llm_lab.config import ModelConfig\n",
+    "from llm_lab.model import LLMModel, count_parameters_detailed, estimate_memory_gb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. 모델 설정 선택\n",
+    "\n",
+    "| 프리셋 | 파라미터 | 용도 |\n",
+    "|--------|----------|------|\n",
+    "| `debug_10m()` | ~10M | 파이프라인 검증 |\n",
+    "| `small_100m()` | ~100M | 중간 검증 |\n",
+    "| `base_1b()` | ~1.1B | 최종 학습 |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- 모델 스케일 선택 ---\n",
+    "# model_config = ModelConfig.debug_10m()   # ~10M (빠른 검증)\n",
+    "# model_config = ModelConfig.small_100m()  # ~100M (중간 검증)\n",
+    "model_config = ModelConfig.base_1b()       # ~1.1B (최종 목표)\n",
+    "\n",
+    "print(f\"hidden_dim:       {model_config.hidden_dim}\")\n",
+    "print(f\"num_layers:       {model_config.num_layers}\")\n",
+    "print(f\"num_heads:        {model_config.num_heads}\")\n",
+    "print(f\"num_kv_heads:     {model_config.num_kv_heads} (GQA 그룹: {model_config.num_kv_groups})\")\n",
+    "print(f\"intermediate_dim: {model_config.intermediate_dim}\")\n",
+    "print(f\"max_seq_len:      {model_config.max_seq_len}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. 모델 생성 및 파라미터 확인"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Debug 모델 실제 생성 (메모리 확인 용도)\n",
+    "debug_config = ModelConfig.debug_10m()\n",
+    "model = LLMModel(debug_config)\n",
+    "print(f\"Debug 모델 파라미터 수: {model.count_parameters():,}\")\n",
+    "\n",
+    "# 1B 모델은 meta device에서 파라미터 수만 확인\n",
+    "with torch.device(\"meta\"):\n",
+    "    model_1b = LLMModel(ModelConfig.base_1b())\n",
+    "n_params_1b = model_1b.count_parameters()\n",
+    "print(f\"1B 모델 파라미터 수: {n_params_1b:,} ({n_params_1b/1e9:.2f}B)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. 상세 파라미터 분해"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "detail = count_parameters_detailed(model_1b)\n",
+    "cfg_1b = ModelConfig.base_1b()\n",
+    "\n",
+    "print(f\"Token Embedding: {detail['token_embedding']:,}\")\n",
+    "print(f\"Per Layer Total: {detail['per_layer_total']:,}\")\n",
+    "print(f\"All Layers ({cfg_1b.num_layers}): {detail['all_layers_total']:,}\")\n",
+    "print(f\"Final Norm: {detail['final_norm']:,}\")\n",
+    "print(f\"LM Head: {detail['lm_head']}\")\n",
+    "print(f\"{'─' * 30}\")\n",
+    "print(f\"TOTAL: {detail['total']:,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. GPU 메모리 추정"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mem = estimate_memory_gb(ModelConfig.base_1b(), batch_size=4, dtype_bytes=2)\n",
+    "\n",
+    "print(f\"모델 가중치:   {mem['model_weights_gb']} GB\")\n",
+    "print(f\"옵티마이저:    {mem['optimizer_states_gb']} GB\")\n",
+    "print(f\"기울기:        {mem['gradients_gb']} GB\")\n",
+    "print(f\"활성화 (추정): {mem['activations_estimated_gb']} GB\")\n",
+    "print(f\"{'─' * 30}\")\n",
+    "print(f\"총 추정:       {mem['total_estimated_gb']} GB\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Forward Pass 검증"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Debug 모델로 forward/backward 검증\n",
+    "dummy_input = torch.randint(0, debug_config.vocab_size, (2, 64))\n",
+    "dummy_target = torch.randint(0, debug_config.vocab_size, (2, 64))\n",
+    "logits, loss = model(dummy_input, dummy_target)\n",
+    "\n",
+    "print(f\"Input shape:  {dummy_input.shape}\")\n",
+    "print(f\"Logits shape: {logits.shape}\")\n",
+    "print(f\"Loss:         {loss.item():.4f}\")\n",
+    "expected_loss = math.log(debug_config.vocab_size)\n",
+    "print(f\"Expected initial loss (ln({debug_config.vocab_size})): {expected_loss:.2f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. 텍스트 생성 테스트 (랜덤 가중치)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = torch.randint(0, debug_config.vocab_size, (1, 10))\n",
+    "generated = model.generate(prompt, max_new_tokens=20, temperature=1.0, top_k=50)\n",
+    "\n",
+    "print(f\"Prompt length:    {prompt.shape[1]}\")\n",
+    "print(f\"Generated length: {generated.shape[1]}\")\n",
+    "print(f\"Token IDs: {generated[0].tolist()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "**다음 단계:** `03_training.ipynb`에서 학습을 실행합니다."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

notebooks/03_training.ipynb ADDED Viewed

	@@ -0,0 +1,211 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 03. 학습 (Training)\n",
+    "\n",
+    "Gradient Accumulation, Mixed Precision, Cosine LR Scheduling,\n",
+    "체크포인트 저장/복원, wandb 로깅을 포함한 학습 파이프라인.\n",
+    "\n",
+    "**학습 흐름:**\n",
+    "```\n",
+    "배치 가져오기\n",
+    "  → Forward (bf16 autocast)\n",
+    "  → Loss / accumulation_steps\n",
+    "  → Backward (gradient 누적)\n",
+    "  → [accumulation_steps마다] Gradient Clipping → Optimizer Step → LR Update\n",
+    "  → [checkpoint_interval마다] 체크포인트 저장 (Google Drive)\n",
+    "  → [eval_interval마다] 검증 Loss/Perplexity 측정\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install wandb -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')\n",
+    "\n",
+    "from llm_lab.config import ModelConfig, DataConfig, TrainConfig\n",
+    "from llm_lab.model import LLMModel\n",
+    "from llm_lab.data import setup_data_pipeline\n",
+    "from llm_lab.training import start_training, Trainer\n",
+    "from llm_lab.utils import auto_configure, get_device"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. Google Drive 마운트 (Colab)\n",
+    "\n",
+    "체크포인트를 Google Drive에 저장하여 세션 만료 시에도 보존합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Colab에서만 실행\n",
+    "# from google.colab import drive\n",
+    "# drive.mount('/content/drive')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. 설정"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- 모델 설정 ---\n",
+    "model_config = ModelConfig.debug_10m()  # 검증 시 debug, 실제 학습 시 base_1b()\n",
+    "\n",
+    "# --- 데이터 설정 ---\n",
+    "data_config = DataConfig(\n",
+    "    max_seq_len=model_config.max_seq_len,\n",
+    "    batch_size=4,\n",
+    ")\n",
+    "\n",
+    "# --- 학습 설정 ---\n",
+    "train_config = TrainConfig(\n",
+    "    total_steps=20_000,\n",
+    "    warmup_steps=2_000,\n",
+    "    learning_rate=3e-4,\n",
+    "    min_learning_rate=3e-5,\n",
+    "    weight_decay=0.1,\n",
+    "    grad_clip=1.0,\n",
+    "    micro_batch_size=4,\n",
+    "    gradient_accumulation_steps=32,\n",
+    "    checkpoint_dir=\"/content/drive/MyDrive/llm-1b-lab/checkpoints\",\n",
+    "    checkpoint_interval=500,\n",
+    "    eval_interval=500,\n",
+    "    log_interval=10,\n",
+    "    use_wandb=True,\n",
+    ")\n",
+    "\n",
+    "print(f\"Effective batch size: {train_config.effective_batch_size}\")\n",
+    "print(f\"Total steps: {train_config.total_steps:,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. GPU 자동 감지\n",
+    "\n",
+    "GPU 종류(A100/V100/T4/L4)에 따라 dtype, batch_size, gradient_accumulation을 자동 조정합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_config = auto_configure(train_config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. 모델 + 데이터 초기화"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 모델 생성\n",
+    "model = LLMModel(model_config)\n",
+    "print(f\"모델 파라미터: {model.count_parameters():,}\")\n",
+    "\n",
+    "# 데이터 파이프라인\n",
+    "tokenizer, train_dl, val_dl = setup_data_pipeline(\n",
+    "    tokenizer_mode=\"pretrained\",\n",
+    "    config=data_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. 학습 시작\n",
+    "\n",
+    "체크포인트가 있으면 자동으로 복원하여 이어서 학습합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer = start_training(\n",
+    "    model=model,\n",
+    "    train_dataloader=train_dl,\n",
+    "    val_dataloader=val_dl,\n",
+    "    config=train_config,\n",
+    "    seq_len=model_config.max_seq_len,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. 학습 재개 (세션 만료 후)\n",
+    "\n",
+    "Colab 세션이 만료된 후 다시 실행하면 CheckpointManager가 자동으로 최신 체크포인트를 찾아 복원합니다.\n",
+    "\n",
+    "위의 셀들을 순서대로 다시 실행하면 됩니다."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "**다음 단계:** `04_evaluation.ipynb`에서 학습된 모델을 평가합니다."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

notebooks/04_evaluation.ipynb ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 04. 평가 (Evaluation)\n",
+    "\n",
+    "학습된 모델의 품질을 다각도로 평가합니다.\n",
+    "\n",
+    "**평가 영역:**\n",
+    "1. Perplexity 측정 — 언어 모델의 표준 정량 지표\n",
+    "2. 텍스트 생성 품질 — 다양한 프롬프트로 정성적 평가\n",
+    "3. Scaling Law 분석 — 10M → 100M → 1B 비교\n",
+    "4. Attention 시각화 — 모델이 \"어디를 보는지\" 분석\n",
+    "5. 인사이트 체크리스트 — 학습 목표 달성 확인"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install matplotlib numpy -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')\n",
+    "\n",
+    "import torch\n",
+    "from llm_lab.config import ModelConfig, EvalConfig\n",
+    "from llm_lab.model import LLMModel\n",
+    "from llm_lab.evaluation import (\n",
+    "    run_evaluation, PerplexityEvaluator, GenerationEvaluator,\n",
+    "    ScalingAnalyzer, AttentionVisualizer, InsightChecklist\n",
+    ")\n",
+    "from llm_lab.utils import get_device"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. 모델 로드\n",
+    "\n",
+    "학습된 체크포인트에서 모델 가중치를 로드합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = get_device()\n",
+    "model_config = ModelConfig.base_1b()\n",
+    "model = LLMModel(model_config).to(device)\n",
+    "\n",
+    "# 체크포인트 로드 (경로를 실제 체크포인트 경로로 변경)\n",
+    "# ckpt = torch.load(\"path/to/step_XXXXXX/model.pt\", map_location=device)\n",
+    "# model.load_state_dict(ckpt)\n",
+    "\n",
+    "print(f\"모델 파라미터: {model.count_parameters():,}\")\n",
+    "print(f\"디바이스: {device}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. 종합 평가 (한 줄 실행)\n",
+    "\n",
+    "Perplexity, 텍스트 생성, 학습 역학, Attention 시각화를 한 번에 실행합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 학습 시 사용한 tokenizer, val_dl, metrics_history가 필요합니다\n",
+    "# report = run_evaluation(\n",
+    "#     model=model,\n",
+    "#     tokenizer=tokenizer,\n",
+    "#     val_dataloader=val_dl,\n",
+    "#     metrics_history=trainer.metrics.history,\n",
+    "# )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Scaling Law 분석\n",
+    "\n",
+    "10M → 100M → 1B 모델의 성능을 비교하여 Scaling Law를 확인합니다.\n",
+    "\n",
+    "Chinchilla Scaling Law: 최적 학습 토큰 수 ≈ 20 × 파라미터 수"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "analyzer = ScalingAnalyzer()\n",
+    "\n",
+    "# 각 모델의 결과를 입력 (실제 학습 결과로 대체)\n",
+    "scaling_results = [\n",
+    "    {\"name\": \"10M\",  \"params\": 10e6,  \"tokens\": 1e9,  \"loss\": 4.2, \"ppl\": 66.7},\n",
+    "    {\"name\": \"100M\", \"params\": 100e6, \"tokens\": 5e9,  \"loss\": 3.5, \"ppl\": 33.1},\n",
+    "    {\"name\": \"1B\",   \"params\": 1.1e9, \"tokens\": 10e9, \"loss\": 3.0, \"ppl\": 20.1},\n",
+    "]\n",
+    "\n",
+    "analysis = analyzer.analyze(scaling_results)\n",
+    "analyzer.plot_scaling_curves(scaling_results)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Attention 시각화\n",
+    "\n",
+    "모델이 각 토큰에 대해 \"어디를 보는지\" 시각화합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# viz = AttentionVisualizer()\n",
+    "# sample_text = \"The cat sat on the mat and looked at the bird.\"\n",
+    "# token_ids = tokenizer.encode(sample_text)\n",
+    "# input_tensor = torch.tensor([token_ids], dtype=torch.long)\n",
+    "# \n",
+    "# attn_weights = viz.extract_attention(model, input_tensor, layer_idx=0, device=device)\n",
+    "# if attn_weights is not None:\n",
+    "#     tokens_str = [tokenizer.decode([tid]) for tid in token_ids]\n",
+    "#     viz.plot_attention_heatmap(attn_weights, tokens_str, head_idx=0)\n",
+    "#     viz.plot_multi_head_summary(attn_weights)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. 인사이트 체크리스트\n",
+    "\n",
+    "학습 목표 달성 여부를 자동/수동으로 확인합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# report가 있는 경우 체크리스트 실행\n",
+    "# InsightChecklist.run_checklist(report, metrics_history)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+torch>=2.0.0
+datasets
+tokenizers
+sentencepiece
+transformers
+wandb
+matplotlib
+numpy