drlee1 commited on
Commit
2a8d8b7
·
verified ·
1 Parent(s): ae24256

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - korean
8
+ - causal-lm
9
+ - chat
10
+ - conversational
11
+ - knowledge-distillation
12
+ - small-language-model
13
+ pipeline_tag: text-generation
14
+ base_model: drlee1/hanforge-base
15
+ ---
16
+
17
+ # HanForge 47M SFT — Korean Conversational Model
18
+
19
+ A Korean chat model fine-tuned from [`drlee1/hanforge-base`](https://huggingface.co/drlee1/hanforge-base) with **knowledge distillation** on **24,693 Korean question-answer pairs** spanning five everyday domains.
20
+
21
+ The model produces longer, more naturally phrased Korean responses than a templated baseline, but comes with reduced reliability under greedy decoding — **sampled decoding is recommended**.
22
+
23
+ ## Highlights
24
+
25
+ - **Longer, more natural Korean responses** — averaging 130 characters (2–3 sentences)
26
+ - **Five everyday domains**: greetings & conversation, food & cooking, Korean culture & geography, health & habits, emotional support
27
+ - **Pure Korean output** — 100% Hangul ratio, zero foreign-script leakage
28
+ - **Compact** — 47M parameters
29
+
30
+ ## Intended Use
31
+
32
+ Suitable for:
33
+
34
+ - **Korean chat applications** within everyday-conversation domains, where natural-sounding replies matter
35
+ - **Resource-constrained deployments** needing a small Korean model
36
+ - **Research** into small-LM knowledge distillation and instruction tuning
37
+
38
+ Not suitable for:
39
+
40
+ - Factual question answering requiring high accuracy (the synthetic data is not fact-checked)
41
+ - Multi-step reasoning, coding, or technical tasks
42
+ - Open-domain conversation outside the five training domains
43
+ - Any safety-critical application
44
+
45
+ ## How to Use
46
+
47
+ ```python
48
+ from transformers import AutoModelForCausalLM, AutoTokenizer
49
+ import torch
50
+
51
+ model_id = "drlee1/hanforge-47M-SFT"
52
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
53
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval()
54
+
55
+ USER, ASSISTANT = "<|user|>", "<|assistant|>"
56
+
57
+ def chat(prompt: str, max_new_tokens: int = 200, seed: int = 42) -> str:
58
+ torch.manual_seed(seed)
59
+ text = f"{USER}\n{prompt}\n{ASSISTANT}\n"
60
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
61
+
62
+ # Add BOS manually
63
+ bos = inputs["input_ids"].new_full((1, 1), tokenizer.bos_token_id)
64
+ inputs["input_ids"] = torch.cat([bos, inputs["input_ids"]], dim=1)
65
+ inputs["attention_mask"] = torch.cat(
66
+ [inputs["attention_mask"].new_ones((1, 1)), inputs["attention_mask"]], dim=1
67
+ )
68
+
69
+ out = model.generate(
70
+ **inputs,
71
+ max_new_tokens=max_new_tokens,
72
+ do_sample=True, # Sampled decoding is recommended
73
+ temperature=0.8,
74
+ top_p=0.9,
75
+ pad_token_id=tokenizer.pad_token_id,
76
+ eos_token_id=tokenizer.eos_token_id,
77
+ )
78
+ return tokenizer.decode(out[0, inputs["input_ids"].size(1):], skip_special_tokens=True).strip()
79
+
80
+ print(chat("한국에서 가 볼 만한 여행지를 추천해 주세요."))
81
+ ```
82
+
83
+ ### Decoding tips
84
+
85
+ - **Use sampling, not greedy.** Greedy decoding is prone to repetition with this model. Recommended settings: `temperature=0.8`, `top_p=0.9`.
86
+ - **Try multiple seeds.** Some prompts produce a noticeably better answer on the second or third sampled generation.
87
+ - **Cap output length.** 150–200 new tokens is usually enough; longer generations rarely improve quality.
88
+
89
+ ## Training Data
90
+
91
+ Fine-tuned on **24,693 Korean question-answer pairs** prepared through a **knowledge-distillation** approach. The dataset spans 200 (domain, topic) pairs across five everyday domains, with each pair contributing roughly 100 diverse user-style questions paired with concise polite Korean answers.
92
+
93
+ The five training domains are:
94
+
95
+ | Domain | Topics covered |
96
+ |---|---|
97
+ | Daily greetings & conversation | greetings, thanks, apologies, introductions, mood, comfort, requests |
98
+ | Food & cooking basics | Korean dishes, ingredients, simple recipes, recommendations |
99
+ | Korean culture & geography | cities, mountains, traditional clothing, holidays, traditions |
100
+ | Health & lifestyle habits | exercise, sleep, nutrition, stress, daily routines |
101
+ | Emotions & empathy | sadness, loneliness, anxiety, joy, gratitude, comfort |
102
+
103
+ After filtering for polite-ending and language-purity constraints (about 8.5% drop rate), the final training set carries 100% Hangul ratio, a consistent polite voice, and an average response length of ~134 characters.
104
+
105
+ ## Training Procedure
106
+
107
+ Fine-tuned on top of [`drlee1/hanforge-base`](https://huggingface.co/drlee1/hanforge-base) using full-parameter SFT with response-only loss masking.
108
+
109
+ | | |
110
+ |---|---|
111
+ | **Training samples** | 24,693 |
112
+ | **Epochs** | 5 |
113
+ | **Effective batch size** | 16 |
114
+ | **Learning rate** | 5e-5 (cosine, 3% warmup) |
115
+ | **Sequence length** | 384 |
116
+ | **Precision** | bf16 mixed |
117
+ | **Final training loss** | 10.4 |
118
+ | **Validation perplexity** | ~25 |
119
+ | **Wall-clock time** | ~19 minutes (Mac MPS) |
120
+
121
+ ## Evaluation
122
+
123
+ Evaluated on 20 prompts (14 in-distribution, 6 out-of-distribution) under both greedy and sampled decoding.
124
+
125
+ | Metric (sampled, t=0.8) | Result |
126
+ |---|---|
127
+ | Korean character ratio | 100% |
128
+ | Foreign-script leakage | 0% |
129
+ | End-of-sequence within 128 tokens | 90% |
130
+ | Average response length | ~120 chars |
131
+
132
+ | Metric (greedy) | Result |
133
+ |---|---|
134
+ | Korean character ratio | 100% |
135
+ | Foreign-script leakage | 0% |
136
+ | End-of-sequence within 128 tokens | 55% |
137
+ | Maximum repeated-token run | up to ~200 (collapse risk) |
138
+
139
+ The model is reliable on in-distribution Korean conversation but **not on out-of-distribution topics**. For abstract or domain-specific questions, responses are often well-formed Korean but semantically off.
140
+
141
+ ## Limitations and Bias
142
+
143
+ - **Distilled-data origin**: Training answers were prepared via knowledge distillation. Facts, recommendations, and explanations may be incorrect, stale, or biased — do not rely on the model for accurate information.
144
+ - **Domain restriction**: The five training domains define the model's reliable scope. Out-of-domain prompts produce responses that may look fluent but are often off-topic.
145
+ - **Greedy decoding instability**: Small-scale models trained on longer responses tend to fall into repetition under greedy decoding. This model is no exception — always use sampling.
146
+ - **No alignment / safety tuning**: Not RLHF'd, no harmful-content filtering. Inputs designed to elicit unsafe content may produce unsafe Korean text.
147
+ - **Distillation bias**: Any biases present in the distillation source are inherited by the model.
148
+
149
+ ## License
150
+
151
+ Released under the **Apache License 2.0**.
152
+
153
+ ## Citation
154
+
155
+ ```bibtex
156
+ @misc{hanforge_47m_sft_2026,
157
+ author = {DongRyeol Lee},
158
+ title = {HanForge 47M SFT: A Korean Conversational Model Trained via Knowledge Distillation},
159
+ year = {2026},
160
+ note = {Fine-tuned from drlee1/hanforge-base on 24.7k Korean Q\&A pairs across five everyday domains}
161
+ }
162
+ ```
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HanForgeForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "hidden_dropout_prob": 0.0,
9
+ "hidden_size": 512,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 1408,
12
+ "max_position_embeddings": 4096,
13
+ "model_type": "hanforge",
14
+ "num_attention_heads": 8,
15
+ "num_hidden_layers": 8,
16
+ "num_key_value_heads": 2,
17
+ "pad_token_id": 0,
18
+ "rms_norm_eps": 1e-06,
19
+ "rope_theta": 50000.0,
20
+ "tie_word_embeddings": false,
21
+ "transformers_version": "5.5.1",
22
+ "unk_token_id": 3,
23
+ "use_cache": false,
24
+ "vocab_size": 24000,
25
+ "auto_map": {
26
+ "AutoConfig": "configuration_hanforge.HanForgeConfig",
27
+ "AutoModelForCausalLM": "modeling_hanforge.HanForgeForCausalLM"
28
+ },
29
+ "torch_dtype": "float32"
30
+ }
configuration_hanforge.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from transformers import PretrainedConfig
4
+
5
+
6
+ class HanForgeConfig(PretrainedConfig):
7
+ model_type = "hanforge"
8
+
9
+ # <<< disabled (refactor 20260423, §4.1 hybrid local/global attention 미사용)
10
+ # 보존된 설계 자산: sliding_window / global_layer_interval / is_global_layer.
11
+ # 본 refactor에서는 full causal attention만 사용한다.
12
+ # sliding_window: int = 256
13
+ # global_layer_interval: int = 4
14
+ # def is_global_layer(self, layer_idx: int) -> bool:
15
+ # return layer_idx % self.global_layer_interval == 0
16
+ # >>> end disabled
17
+
18
+ # <<< disabled (refactor 20260423, §4.2 YaRN 미사용)
19
+ # rope_scaling / original_max_position_embeddings 는 YaRN 확장 전제 필드였다.
20
+ # from-scratch 4k context 학습에서는 단순 RoPE 로 충분하다.
21
+ # original_max_position_embeddings: int = 4096
22
+ # rope_scaling: dict | None = None
23
+ # >>> end disabled
24
+
25
+ def __init__(
26
+ self,
27
+ vocab_size: int = 32000,
28
+ hidden_size: int = 384,
29
+ intermediate_size: int = 1024,
30
+ num_hidden_layers: int = 8,
31
+ num_attention_heads: int = 6,
32
+ num_key_value_heads: int = 2,
33
+ max_position_embeddings: int = 4096,
34
+ rope_theta: float = 50_000.0,
35
+ rms_norm_eps: float = 1e-6,
36
+ hidden_dropout_prob: float = 0.0,
37
+ attention_dropout: float = 0.0,
38
+ initializer_range: float = 0.02,
39
+ pad_token_id: int = 0,
40
+ bos_token_id: int = 1,
41
+ eos_token_id: int = 2,
42
+ unk_token_id: int = 3,
43
+ use_cache: bool = False,
44
+ **kwargs,
45
+ ):
46
+ # Back-compat: 과거 스크립트/체크포인트가 비활성화된 필드를 넘기더라도 무시한다.
47
+ kwargs.pop("sliding_window", None)
48
+ kwargs.pop("global_layer_interval", None)
49
+ kwargs.pop("original_max_position_embeddings", None)
50
+ kwargs.pop("rope_scaling", None)
51
+
52
+ self.vocab_size = vocab_size
53
+ self.hidden_size = hidden_size
54
+ self.intermediate_size = intermediate_size
55
+ self.num_hidden_layers = num_hidden_layers
56
+ self.num_attention_heads = num_attention_heads
57
+ self.num_key_value_heads = num_key_value_heads
58
+ self.max_position_embeddings = max_position_embeddings
59
+ self.rope_theta = rope_theta
60
+ self.rms_norm_eps = rms_norm_eps
61
+ self.hidden_dropout_prob = hidden_dropout_prob
62
+ self.attention_dropout = attention_dropout
63
+ self.initializer_range = initializer_range
64
+ self.use_cache = use_cache
65
+ tie_word_embeddings = kwargs.pop("tie_word_embeddings", True)
66
+
67
+ if hidden_size % num_attention_heads != 0:
68
+ raise ValueError("hidden_size must be divisible by num_attention_heads")
69
+ if num_attention_heads % num_key_value_heads != 0:
70
+ raise ValueError("num_attention_heads must be divisible by num_key_value_heads")
71
+
72
+ super().__init__(
73
+ pad_token_id=pad_token_id,
74
+ bos_token_id=bos_token_id,
75
+ eos_token_id=eos_token_id,
76
+ unk_token_id=unk_token_id,
77
+ tie_word_embeddings=tie_word_embeddings,
78
+ **kwargs,
79
+ )
80
+
81
+ @property
82
+ def head_dim(self) -> int:
83
+ return self.hidden_size // self.num_attention_heads
84
+
85
+ @property
86
+ def num_key_value_groups(self) -> int:
87
+ return self.num_attention_heads // self.num_key_value_heads
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1a1332ef89814c58a2fbfa3a76dff9667f8844bd118f0f8b27c34a28dbcf5e2
3
+ size 188524624
modeling_hanforge.py ADDED
@@ -0,0 +1,338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import math
4
+ from typing import Optional
5
+
6
+ import torch
7
+ import torch.nn as nn
8
+ import torch.nn.functional as F
9
+ from transformers import PreTrainedModel
10
+ from transformers.generation import GenerationMixin
11
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
12
+
13
+ try:
14
+ from configuration_hanforge import HanForgeConfig
15
+ except ImportError:
16
+ from .configuration_hanforge import HanForgeConfig
17
+
18
+
19
+ def rotate_half(x: torch.Tensor) -> torch.Tensor:
20
+ x1 = x[..., : x.shape[-1] // 2]
21
+ x2 = x[..., x.shape[-1] // 2 :]
22
+ return torch.cat((-x2, x1), dim=-1)
23
+
24
+
25
+ def apply_rotary_pos_emb(q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
26
+ cos = cos.unsqueeze(1)
27
+ sin = sin.unsqueeze(1)
28
+ q = (q * cos) + (rotate_half(q) * sin)
29
+ k = (k * cos) + (rotate_half(k) * sin)
30
+ return q, k
31
+
32
+
33
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
34
+ if n_rep == 1:
35
+ return hidden_states
36
+ batch, num_key_value_heads, seq_len, head_dim = hidden_states.shape
37
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, seq_len, head_dim)
38
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, seq_len, head_dim)
39
+
40
+
41
+ # DISABLED (refactor 20260423, §4.2): YaRN 본문 비활성화. from-scratch 4k context에서는 불필요.
42
+ # 후일 context 확장 시 참조할 수 있도록 시그니처는 남기고 본문만 주석 처리한다.
43
+ def _compute_yarn_parameters(config: HanForgeConfig, device=None):
44
+ raise NotImplementedError(
45
+ "YaRN is disabled in this refactor (see research/refactor_plan_20260423.md §4.2)."
46
+ )
47
+ # <<< disabled (refactor 20260423, §4.2)
48
+ # rope_params = dict(config.rope_scaling or {})
49
+ # dim = config.head_dim
50
+ # base = config.rope_theta
51
+ # if not rope_params or rope_params.get("rope_type", "default") == "default":
52
+ # inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
53
+ # return inv_freq, 1.0
54
+ #
55
+ # factor = float(rope_params["factor"])
56
+ # beta_fast = float(rope_params.get("beta_fast", 32.0))
57
+ # beta_slow = float(rope_params.get("beta_slow", 1.0))
58
+ # mscale = rope_params.get("mscale")
59
+ # mscale_all_dim = rope_params.get("mscale_all_dim")
60
+ # original_max = int(rope_params["original_max_position_embeddings"])
61
+ #
62
+ # def get_mscale(scale, scale_factor=1.0):
63
+ # if scale <= 1:
64
+ # return 1.0
65
+ # return 0.1 * scale_factor * math.log(scale) + 1.0
66
+ #
67
+ # if mscale is not None and mscale_all_dim is not None:
68
+ # attention_factor = float(get_mscale(factor, mscale) / get_mscale(factor, mscale_all_dim))
69
+ # else:
70
+ # attention_factor = float(get_mscale(factor))
71
+ #
72
+ # def find_correction_dim(num_rotations, local_dim, local_base, max_position_embeddings):
73
+ # return (local_dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (
74
+ # 2 * math.log(local_base)
75
+ # )
76
+ #
77
+ # def find_correction_range(low_rot, high_rot, local_dim, local_base, max_position_embeddings):
78
+ # low = math.floor(find_correction_dim(low_rot, local_dim, local_base, max_position_embeddings))
79
+ # high = math.ceil(find_correction_dim(high_rot, local_dim, local_base, max_position_embeddings))
80
+ # return max(low, 0), min(high, local_dim - 1)
81
+ #
82
+ # def linear_ramp_factor(min_idx, max_idx, local_dim):
83
+ # if min_idx == max_idx:
84
+ # max_idx += 0.001
85
+ # linear_func = (torch.arange(local_dim, dtype=torch.float32, device=device) - min_idx) / (max_idx - min_idx)
86
+ # return torch.clamp(linear_func, 0, 1)
87
+ #
88
+ # pos_freqs = base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim)
89
+ # inv_freq_extrapolation = 1.0 / pos_freqs
90
+ # inv_freq_interpolation = 1.0 / (factor * pos_freqs)
91
+ # low, high = find_correction_range(beta_fast, beta_slow, dim, base, original_max)
92
+ # ramp = 1.0 - linear_ramp_factor(low, high, dim // 2)
93
+ # inv_freq = (inv_freq_interpolation * (1.0 - ramp)) + (inv_freq_extrapolation * ramp)
94
+ # return inv_freq, attention_factor
95
+ # >>> end disabled
96
+
97
+
98
+ def _compute_rope_parameters(config: HanForgeConfig, device=None):
99
+ dim = config.head_dim
100
+ base = config.rope_theta
101
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
102
+ return inv_freq
103
+
104
+
105
+ class HanForgeRMSNorm(nn.Module):
106
+ def __init__(self, hidden_size: int, eps: float = 1e-6):
107
+ super().__init__()
108
+ self.weight = nn.Parameter(torch.ones(hidden_size))
109
+ self.eps = eps
110
+
111
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
112
+ input_dtype = hidden_states.dtype
113
+ hidden_states = hidden_states.to(torch.float32)
114
+ variance = hidden_states.pow(2).mean(dim=-1, keepdim=True)
115
+ hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
116
+ return self.weight * hidden_states.to(input_dtype)
117
+
118
+
119
+ class HanForgeRotaryEmbedding(nn.Module):
120
+ def __init__(self, config: HanForgeConfig):
121
+ super().__init__()
122
+ inv_freq = _compute_rope_parameters(config)
123
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
124
+
125
+ def forward(self, x: torch.Tensor, position_ids: torch.Tensor):
126
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
127
+ position_ids_expanded = position_ids[:, None, :].float()
128
+ freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
129
+ emb = torch.cat((freqs, freqs), dim=-1)
130
+ cos = emb.cos()
131
+ sin = emb.sin()
132
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
133
+
134
+
135
+ class HanForgeAttention(nn.Module):
136
+ def __init__(self, config: HanForgeConfig, layer_idx: int):
137
+ super().__init__()
138
+ self.layer_idx = layer_idx
139
+ self.num_heads = config.num_attention_heads
140
+ self.num_key_value_heads = config.num_key_value_heads
141
+ self.num_key_value_groups = config.num_key_value_groups
142
+ self.head_dim = config.head_dim
143
+ # DISABLED (refactor 20260423, §4.1): hybrid local/global attention 비활성화
144
+ # self.is_global = config.is_global_layer(layer_idx)
145
+ # self.sliding_window = config.sliding_window
146
+ self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
147
+ kv_hidden = config.num_key_value_heads * self.head_dim
148
+ self.k_proj = nn.Linear(config.hidden_size, kv_hidden, bias=False)
149
+ self.v_proj = nn.Linear(config.hidden_size, kv_hidden, bias=False)
150
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
151
+ self.dropout = nn.Dropout(config.attention_dropout)
152
+
153
+ def forward(
154
+ self,
155
+ hidden_states: torch.Tensor,
156
+ cos: torch.Tensor,
157
+ sin: torch.Tensor,
158
+ attention_mask: Optional[torch.Tensor],
159
+ ) -> torch.Tensor:
160
+ batch_size, seq_len, hidden_size = hidden_states.shape
161
+ q = self.q_proj(hidden_states).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
162
+ k = self.k_proj(hidden_states).view(batch_size, seq_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
163
+ v = self.v_proj(hidden_states).view(batch_size, seq_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
164
+ q, k = apply_rotary_pos_emb(q, k, cos, sin)
165
+ k = repeat_kv(k, self.num_key_value_groups)
166
+ v = repeat_kv(v, self.num_key_value_groups)
167
+ scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
168
+ if attention_mask is not None:
169
+ scores = scores.masked_fill(~attention_mask, torch.finfo(scores.dtype).min)
170
+ attn = F.softmax(scores, dim=-1)
171
+ attn = self.dropout(attn)
172
+ out = attn @ v
173
+ out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, hidden_size)
174
+ return self.o_proj(out)
175
+
176
+
177
+ class HanForgeMLP(nn.Module):
178
+ def __init__(self, config: HanForgeConfig):
179
+ super().__init__()
180
+ self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
181
+ self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
182
+ self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
183
+
184
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
185
+ return self.down_proj(F.silu(self.gate_proj(hidden_states)) * self.up_proj(hidden_states))
186
+
187
+
188
+ class HanForgeDecoderLayer(nn.Module):
189
+ def __init__(self, config: HanForgeConfig, layer_idx: int):
190
+ super().__init__()
191
+ # DISABLED (refactor 20260423, §4.1): hybrid local/global 레이어 분기 비활성화.
192
+ # 모든 레이어가 causal full attention 경로로 동작한다.
193
+ # self.is_global = config.is_global_layer(layer_idx)
194
+ self.input_layernorm = HanForgeRMSNorm(config.hidden_size, config.rms_norm_eps)
195
+ self.self_attn = HanForgeAttention(config, layer_idx)
196
+ self.post_attention_layernorm = HanForgeRMSNorm(config.hidden_size, config.rms_norm_eps)
197
+ self.mlp = HanForgeMLP(config)
198
+
199
+ def forward(self, hidden_states: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, attention_mask: torch.Tensor):
200
+ hidden_states = hidden_states + self.self_attn(self.input_layernorm(hidden_states), cos, sin, attention_mask)
201
+ hidden_states = hidden_states + self.mlp(self.post_attention_layernorm(hidden_states))
202
+ return hidden_states
203
+
204
+
205
+ class HanForgePreTrainedModel(PreTrainedModel):
206
+ config_class = HanForgeConfig
207
+ base_model_prefix = "model"
208
+ _no_split_modules = ["HanForgeDecoderLayer"]
209
+
210
+ def _init_weights(self, module):
211
+ if isinstance(module, nn.Linear):
212
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
213
+ if module.bias is not None:
214
+ module.bias.data.zero_()
215
+ elif isinstance(module, nn.Embedding):
216
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
217
+
218
+
219
+ class HanForgeModel(HanForgePreTrainedModel):
220
+ def __init__(self, config: HanForgeConfig):
221
+ super().__init__(config)
222
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
223
+ self.layers = nn.ModuleList([HanForgeDecoderLayer(config, idx) for idx in range(config.num_hidden_layers)])
224
+ self.norm = HanForgeRMSNorm(config.hidden_size, config.rms_norm_eps)
225
+ self.rotary_emb = HanForgeRotaryEmbedding(config)
226
+ self.post_init()
227
+
228
+ def _build_causal_mask(self, batch_size: int, seq_len: int, device: torch.device) -> torch.Tensor:
229
+ base = torch.tril(torch.ones(seq_len, seq_len, device=device, dtype=torch.bool))
230
+ return base.unsqueeze(0).unsqueeze(0).expand(batch_size, 1, seq_len, seq_len)
231
+
232
+ # DISABLED (refactor 20260423, §4.1): sliding window local mask 비활성화.
233
+ # def _build_local_mask(self, batch_size: int, seq_len: int, device: torch.device) -> torch.Tensor:
234
+ # row = torch.arange(seq_len, device=device)[:, None]
235
+ # col = torch.arange(seq_len, device=device)[None, :]
236
+ # causal = col <= row
237
+ # window = col >= (row - self.config.sliding_window + 1)
238
+ # mask = (causal & window).to(torch.bool)
239
+ # return mask.unsqueeze(0).unsqueeze(0).expand(batch_size, 1, seq_len, seq_len)
240
+
241
+ def forward(
242
+ self,
243
+ input_ids: torch.Tensor,
244
+ attention_mask: Optional[torch.Tensor] = None,
245
+ position_ids: Optional[torch.Tensor] = None,
246
+ return_dict: bool = True,
247
+ **_: dict,
248
+ ):
249
+ batch_size, seq_len = input_ids.shape
250
+ hidden_states = self.embed_tokens(input_ids)
251
+ if position_ids is None:
252
+ position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand(batch_size, -1)
253
+ cos, sin = self.rotary_emb(hidden_states, position_ids)
254
+ full_mask = self._build_causal_mask(batch_size, seq_len, hidden_states.device)
255
+ if attention_mask is not None:
256
+ key_mask = attention_mask[:, None, None, :].to(torch.bool)
257
+ full_mask = full_mask & key_mask
258
+
259
+ # DISABLED (refactor 20260423, §4.1): 모든 layer가 full causal mask 사용.
260
+ # local_mask 분기는 hybrid attention 재도입 시에만 사용한다.
261
+ for layer in self.layers:
262
+ hidden_states = layer(hidden_states, cos, sin, full_mask)
263
+
264
+ hidden_states = self.norm(hidden_states)
265
+ if not return_dict:
266
+ return (hidden_states,)
267
+ return BaseModelOutputWithPast(last_hidden_state=hidden_states)
268
+
269
+
270
+ class HanForgeForCausalLM(HanForgePreTrainedModel, GenerationMixin):
271
+ # refactor 20260507 (§format/EOS): _tied_weights_keys 완전 제거.
272
+ # transformers 5.x의 _tied_weights_keys 메커니즘이 Phase 1 디버깅에서 from_pretrained 시
273
+ # .bin 파일의 학습된 weight를 silent하게 무시하고 random init 그대로 사용하는 버그를
274
+ # 일으킴. config tie_word_embeddings=False와 결합해서 두 weight를 별개로 명시 처리.
275
+ # (가능하면 학습 모델은 tie_word_embeddings=False로 저장. base 모델은 일시적으로 위험.)
276
+ _tied_weights_keys = None
277
+
278
+ def __init__(self, config: HanForgeConfig):
279
+ super().__init__(config)
280
+ self.model = HanForgeModel(config)
281
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
282
+ self.post_init()
283
+ # refactor 20260423 (§9): tie lm_head.weight to embed_tokens.weight
284
+ # post_init 안에서 PreTrainedModel.tie_weights()가 동일 작업을 시도하지만,
285
+ # 작은 모델 + 32k vocab에서 파라미터 절약을 보장하기 위해 명시적으로 한다.
286
+ if getattr(config, "tie_word_embeddings", True):
287
+ self.lm_head.weight = self.model.embed_tokens.weight
288
+
289
+ def get_input_embeddings(self):
290
+ return self.model.embed_tokens
291
+
292
+ def set_input_embeddings(self, value):
293
+ self.model.embed_tokens = value
294
+
295
+ def get_output_embeddings(self):
296
+ return self.lm_head
297
+
298
+ def set_output_embeddings(self, new_embeddings):
299
+ self.lm_head = new_embeddings
300
+
301
+ def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **kwargs):
302
+ if attention_mask is None:
303
+ attention_mask = torch.ones_like(input_ids, dtype=torch.long)
304
+ position_ids = attention_mask.long().cumsum(-1) - 1
305
+ position_ids = position_ids.clamp_min(0)
306
+ return {
307
+ "input_ids": input_ids,
308
+ "attention_mask": attention_mask,
309
+ "position_ids": position_ids,
310
+ }
311
+
312
+
313
+ def forward(
314
+ self,
315
+ input_ids: torch.Tensor,
316
+ attention_mask: Optional[torch.Tensor] = None,
317
+ labels: Optional[torch.Tensor] = None,
318
+ return_dict: bool = True,
319
+ **kwargs,
320
+ ):
321
+ outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, return_dict=True, **kwargs)
322
+ hidden_states = outputs.last_hidden_state
323
+ logits = self.lm_head(hidden_states)
324
+ loss = None
325
+ if labels is not None:
326
+ shift_logits = logits[:, :-1, :].contiguous()
327
+ shift_labels = labels[:, 1:].contiguous()
328
+ loss = F.cross_entropy(
329
+ shift_logits.view(-1, shift_logits.size(-1)),
330
+ shift_labels.view(-1),
331
+ ignore_index=-100,
332
+ )
333
+ if not return_dict:
334
+ result = (logits,)
335
+ if loss is not None:
336
+ result = (loss,) + result
337
+ return result
338
+ return CausalLMOutputWithPast(loss=loss, logits=logits)
special_tokens_map.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "unk_token": "<unk>",
5
+ "pad_token": "<pad>",
6
+ "additional_special_tokens": [
7
+ "<|user|>",
8
+ "<|assistant|>",
9
+ "<|mode:direct|>",
10
+ "<|mode:think|>",
11
+ "<think>",
12
+ "</think>",
13
+ "<answer>",
14
+ "</answer>"
15
+ ]
16
+ }
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c095b5e3b1b804920c2e482323371d204fbb234d9b07c8d0fe34dca93bb21b89
3
+ size 647397
tokenizer.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<pad>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<|user|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "<|assistant|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "<|mode:direct|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "<|mode:think|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "8": {
68
+ "content": "<think>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "9": {
76
+ "content": "</think>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "10": {
84
+ "content": "<answer>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "11": {
92
+ "content": "</answer>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ }
99
+ },
100
+ "backend": "custom",
101
+ "bos_token": "<s>",
102
+ "eos_token": "</s>",
103
+ "extra_special_tokens": [
104
+ "<|user|>",
105
+ "<|assistant|>",
106
+ "<|mode:direct|>",
107
+ "<|mode:think|>",
108
+ "<think>",
109
+ "</think>",
110
+ "<answer>",
111
+ "</answer>"
112
+ ],
113
+ "model_max_length": 1000000000000000019884624838656,
114
+ "pad_token": "<pad>",
115
+ "tokenizer_class": "HanForgeTokenizer",
116
+ "unk_token": "<unk>",
117
+ "vocab_size": 24000,
118
+ "auto_map": {
119
+ "AutoTokenizer": [
120
+ "tokenizer_hanforge.HanForgeTokenizer",
121
+ null
122
+ ]
123
+ }
124
+ }
tokenizer_hanforge.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import shutil
4
+ from pathlib import Path
5
+
6
+ import sentencepiece as spm
7
+ from transformers import PreTrainedTokenizer
8
+
9
+
10
+ class HanForgeTokenizer(PreTrainedTokenizer):
11
+ vocab_files_names = {"vocab_file": "tokenizer.model"}
12
+ model_input_names = ["input_ids", "attention_mask"]
13
+
14
+ def __init__(
15
+ self,
16
+ vocab_file: str,
17
+ bos_token: str = "<s>",
18
+ eos_token: str = "</s>",
19
+ unk_token: str = "<unk>",
20
+ pad_token: str = "<pad>",
21
+ additional_special_tokens: list[str] | None = None,
22
+ **kwargs,
23
+ ):
24
+ self.vocab_file = vocab_file
25
+ self.sp_model = spm.SentencePieceProcessor(model_file=vocab_file)
26
+ super().__init__(
27
+ bos_token=bos_token,
28
+ eos_token=eos_token,
29
+ unk_token=unk_token,
30
+ pad_token=pad_token,
31
+ additional_special_tokens=additional_special_tokens or [],
32
+ **kwargs,
33
+ )
34
+
35
+ @property
36
+ def vocab_size(self) -> int:
37
+ return int(self.sp_model.vocab_size())
38
+
39
+ def get_vocab(self) -> dict[str, int]:
40
+ vocab = {self.sp_model.id_to_piece(i): i for i in range(self.vocab_size)}
41
+ vocab.update(self.added_tokens_encoder)
42
+ return vocab
43
+
44
+ def _tokenize(self, text: str) -> list[str]:
45
+ return list(self.sp_model.encode(text, out_type=str))
46
+
47
+ def _convert_token_to_id(self, token: str) -> int:
48
+ return int(self.sp_model.piece_to_id(token))
49
+
50
+ def _convert_id_to_token(self, index: int) -> str:
51
+ return str(self.sp_model.id_to_piece(index))
52
+
53
+ def convert_tokens_to_string(self, tokens: list[str]) -> str:
54
+ return self.sp_model.decode_pieces(tokens)
55
+
56
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
57
+ output = [self.bos_token_id] + list(token_ids_0)
58
+ if token_ids_1 is not None:
59
+ output += list(token_ids_1)
60
+ output += [self.eos_token_id]
61
+ return output
62
+
63
+ def save_vocabulary(self, save_directory: str, filename_prefix: str | None = None):
64
+ save_dir = Path(save_directory)
65
+ save_dir.mkdir(parents=True, exist_ok=True)
66
+ out_name = f"{filename_prefix + '-' if filename_prefix else ''}tokenizer.model"
67
+ out_path = save_dir / out_name
68
+ if Path(self.vocab_file).resolve() != out_path.resolve():
69
+ shutil.copy2(self.vocab_file, out_path)
70
+ vocab_src = Path(self.vocab_file).with_suffix(".vocab")
71
+ if vocab_src.exists():
72
+ vocab_out = save_dir / f"{filename_prefix + '-' if filename_prefix else ''}tokenizer.vocab"
73
+ if vocab_src.resolve() != vocab_out.resolve():
74
+ shutil.copy2(vocab_src, vocab_out)
75
+ return (str(out_path),)