HanForge 47M SFT β Korean Conversational Model
A Korean chat model fine-tuned from drlee1/HanForge-base with knowledge distillation on 24,693 Korean question-answer pairs spanning five everyday domains.
The model produces longer, more naturally phrased Korean responses than a templated baseline, but comes with reduced reliability under greedy decoding β sampled decoding is recommended.
Highlights
- Longer, more natural Korean responses β averaging 130 characters (2β3 sentences)
- Five everyday domains: greetings & conversation, food & cooking, Korean culture & geography, health & habits, emotional support
- Pure Korean output β 100% Hangul ratio, zero foreign-script leakage
- Compact β 47M parameters
Intended Use
Suitable for:
- Korean chat applications within everyday-conversation domains, where natural-sounding replies matter
- Resource-constrained deployments needing a small Korean model
- Research into small-LM knowledge distillation and instruction tuning
Not suitable for:
- Factual question answering requiring high accuracy (the synthetic data is not fact-checked)
- Multi-step reasoning, coding, or technical tasks
- Open-domain conversation outside the five training domains
- Any safety-critical application
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "drlee1/hanforge-47M-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval()
USER, ASSISTANT = "<|user|>", "<|assistant|>"
def chat(prompt: str, max_new_tokens: int = 200, seed: int = 42) -> str:
torch.manual_seed(seed)
text = f"{USER}\n{prompt}\n{ASSISTANT}\n"
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
# Add BOS manually
bos = inputs["input_ids"].new_full((1, 1), tokenizer.bos_token_id)
inputs["input_ids"] = torch.cat([bos, inputs["input_ids"]], dim=1)
inputs["attention_mask"] = torch.cat(
[inputs["attention_mask"].new_ones((1, 1)), inputs["attention_mask"]], dim=1
)
out = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True, # Sampled decoding is recommended
temperature=0.8,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(out[0, inputs["input_ids"].size(1):], skip_special_tokens=True).strip()
print(chat("νκ΅μμ κ° λ³Ό λ§ν μ¬νμ§λ₯Ό μΆμ²ν΄ μ£ΌμΈμ."))
Decoding tips
- Use sampling, not greedy. Greedy decoding is prone to repetition with this model. Recommended settings:
temperature=0.8,top_p=0.9. - Try multiple seeds. Some prompts produce a noticeably better answer on the second or third sampled generation.
- Cap output length. 150β200 new tokens is usually enough; longer generations rarely improve quality.
Training Data
Fine-tuned on 24,693 Korean question-answer pairs prepared through a knowledge-distillation approach. The dataset spans 200 (domain, topic) pairs across five everyday domains, with each pair contributing roughly 100 diverse user-style questions paired with concise polite Korean answers.
The five training domains are:
| Domain | Topics covered |
|---|---|
| Daily greetings & conversation | greetings, thanks, apologies, introductions, mood, comfort, requests |
| Food & cooking basics | Korean dishes, ingredients, simple recipes, recommendations |
| Korean culture & geography | cities, mountains, traditional clothing, holidays, traditions |
| Health & lifestyle habits | exercise, sleep, nutrition, stress, daily routines |
| Emotions & empathy | sadness, loneliness, anxiety, joy, gratitude, comfort |
After filtering for polite-ending and language-purity constraints (about 8.5% drop rate), the final training set carries 100% Hangul ratio, a consistent polite voice, and an average response length of ~134 characters.
Training Procedure
Fine-tuned on top of drlee1/HanForge-base using full-parameter SFT with response-only loss masking.
| Training samples | 24,693 |
| Epochs | 5 |
| Effective batch size | 16 |
| Learning rate | 5e-5 (cosine, 3% warmup) |
| Sequence length | 384 |
| Precision | bf16 mixed |
| Final training loss | 10.4 |
| Validation perplexity | ~25 |
| Wall-clock time | ~19 minutes (Mac MPS) |
Evaluation
Evaluated on 20 prompts (14 in-distribution, 6 out-of-distribution) under both greedy and sampled decoding.
| Metric (sampled, t=0.8) | Result |
|---|---|
| Korean character ratio | 100% |
| Foreign-script leakage | 0% |
| End-of-sequence within 128 tokens | 90% |
| Average response length | ~120 chars |
| Metric (greedy) | Result |
|---|---|
| Korean character ratio | 100% |
| Foreign-script leakage | 0% |
| End-of-sequence within 128 tokens | 55% |
| Maximum repeated-token run | up to ~200 (collapse risk) |
The model is reliable on in-distribution Korean conversation but not on out-of-distribution topics. For abstract or domain-specific questions, responses are often well-formed Korean but semantically off.
Limitations and Bias
- Distilled-data origin: Training answers were prepared via knowledge distillation. Facts, recommendations, and explanations may be incorrect, stale, or biased β do not rely on the model for accurate information.
- Domain restriction: The five training domains define the model's reliable scope. Out-of-domain prompts produce responses that may look fluent but are often off-topic.
- Greedy decoding instability: Small-scale models trained on longer responses tend to fall into repetition under greedy decoding. This model is no exception β always use sampling.
- No alignment / safety tuning: Not RLHF'd, no harmful-content filtering. Inputs designed to elicit unsafe content may produce unsafe Korean text.
- Distillation bias: Any biases present in the distillation source are inherited by the model.
License
Released under the Apache License 2.0.
Citation
@misc{hanforge_47m_sft_2026,
author = {DongRyeol Lee},
title = {HanForge 47M SFT: A Korean Conversational Model Trained via Knowledge Distillation},
year = {2026},
note = {Fine-tuned from drlee1/HanForge-base on 24.7k Korean Q\&A pairs across five everyday domains}
}
- Downloads last month
- -
Model tree for drlee1/HanForge-47M-SFT
Base model
drlee1/HanForge-base