KoHRM-Text-1.4B
Language / ์ธ์ด: English | ํ๊ตญ์ด
English
KoHRM-Text-1.4B is a scratch-pretrained Korean/English/code/terminal/tool-use model built from the sapientinc/HRM-Text PrefixLM training stack.
This is not a continued finetune of sapientinc/HRM-Text-1B. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.
Current Status
This repository is a rolling latest public model export. Training is still in progress.
- Main repo:
LLM-OS-Models/KoHRM-Text-1.4B - Current public files:
model.safetensors,config.json, tokenizer files, and thisREADME.md - Raw FSDP2 resume checkpoints:
LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints - Prepared data:
LLM-OS-Models/KoHRM-Text-1.4B-prepared-data - Project code: https://github.com/LLM-OS-Models/KoHRM-text
- Upstream HRM-Text code: https://github.com/sapientinc/HRM-Text
- HRM-Text paper: https://arxiv.org/html/2605.20613
- Tokenizer repo:
LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K
The main branch is overwritten with the newest converted EMA safetensors export as training checkpoints are uploaded. To test the latest public weight, download revision="main".
Training Method At A Glance
KoHRM-Text is best understood as instruction pretraining from scratch.
It is not ordinary raw-text causal LM pretraining, and it is not only a small SFT pass on top of an existing base model.
raw data -> tokenizer -> V1Dataset -> PrefixLM batches
-> HRM H/L recurrence -> LM head -> response-only loss
The input context is handled as a PrefixLM prefix:
instruction / prefix: bidirectional attention, no loss
response: causal attention, response-only CE loss
The architecture keeps the upstream HRM-Text recurrent design:
H module: slower strategic state
L module: faster execution state
schedule: H2L3 recurrent computation
For a readable full explanation of the training method, architecture, PT/SFT distinction, staged continuation, and checkpoint naming, see the project document:
MODEL_TRAINING_ARCHITECTURE_GUIDE_2026-05-28.md in https://github.com/LLM-OS-Models/KoHRM-text
Important Compatibility Note
The public repo currently contains the converted model weights and tokenizer, but it does not yet include a Hugging Face trust_remote_code modeling implementation for HrmTextForCausalLM.
What works today:
- Download the latest public weights.
- Load the tokenizer directly with
tokenizers.Tokenizer.from_file("tokenizer.json"). - Inspect
config.json. - Verify
model.safetensorson CPU or Colab T4.
What is not supported yet in plain Transformers:
AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")- One-line hosted text generation from this repo
Expected reason: model_type: "hrm_text" is a custom HRM-Text architecture. Public generation will require adding the compatible HrmTextForCausalLM remote-code files to this model repo or releasing a standard wrapper.
Model Details
| Field | Value |
|---|---|
| Model id | LLM-OS-Models/KoHRM-Text-1.4B |
| Standard name | KoHRM-Text-1.4B |
| Training origin | scratch |
| Architecture family | HRM-Text PrefixLM |
| Architecture size | XL |
| Parameters | 1,384,120,320 |
| Context length | 4,096 tokens |
| Training dtype | bfloat16 |
| Public export dtype | bfloat16 EMA safetensors |
| Tokenizer | byte-level BPE, NFC normalization |
| Vocabulary size | 131,072 |
| Objective | PrefixLM response-only loss |
| Optimizer | Adam-atan2 from upstream HRM-Text |
| EMA | 0.9999 |
Converted config highlights:
{
"model_type": "hrm_text",
"architectures": ["HrmTextForCausalLM"],
"vocab_size": 131072,
"hidden_size": 1536,
"num_hidden_layers": 32,
"num_attention_heads": 12,
"max_position_embeddings": 4096,
"prefix_lm": true
}
Compared With The HRM-Text Paper
This run can take longer than the paper recipe even on 8 x H200 because the setup is not identical:
- The paper reference used 16 x H100; this run uses 8 x H200.
- KoHRM uses a larger 131K tokenizer vocabulary, compared with the upstream 65K tokenizer.
- The public KoHRM size is about 1.38B parameters.
- The stable long-run batch is
180,224tokens/step after OOM probing; larger batches were possible briefly but not chosen for reliability. - The continuation includes extra Korean, terminal, tool-call, legal, finance, wiki, and repeated HRM-cleaned stages.
This does not automatically guarantee better benchmark scores. The expected upside is domain-specific: Korean tokenization efficiency, Korean legal/finance/wiki coverage, terminal trajectories, tool-call formatting, and code-oriented behavior should have a better chance than the upstream English/general checkpoint. Final claims require evaluation after the planned continuation and SFT finish.
Tokenizer
The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It keeps common chat/tool special tokens as stable single tokens where possible.
| Sample bucket | chars/token |
|---|---|
| Korean general text | 2.60 |
| Korean legal text | 2.36 |
| Korean terminal instruction | 2.18 |
| shell command | 2.68 |
| tool-call JSON | 3.32 |
| Python code | 3.37 |
| English | 4.40 |
Formatting tokens:
<|im_start|> instruction start
<|im_end|> instruction end
<|box_end|> response/end marker
<|object_ref_start|> direct condition
<|object_ref_end|> chain-of-thought style condition
<|quad_start|> noisy condition
<|quad_end|> synthetic condition
Prompt format used by the project-side inference code:
<|im_start|><|object_ref_start|>YOUR_PROMPT_HERE<|im_end|>
Colab T4 Long Knowledge Probe
A ready-to-run Colab notebook is available in the project repo:
The notebook downloads the latest public files and runs long-form generation prompts that match the current pretraining data style. It is intended to inspect knowledge signal, Korean fluency, repetition, and runtime correctness after pretraining-stage checkpoints.
This is not a final chat/SFT benchmark. It intentionally avoids format-constrained SFT-style tests because the public checkpoint is still a pretraining-stage model and has not been behavior-aligned by SFT/LoRA/RL.
It intentionally avoids transformers, AutoTokenizer, and AutoModelForCausalLM. Instead, it uses:
tokenizers.Tokenizer.from_file("tokenizer.json")safetensors.torch.load_file("model.safetensors")kohrm_colab_generate.py, a small PyTorch SDPA runtime for the HRM-Text architecture
!pip -q install -U huggingface_hub hf_transfer safetensors
!pip -q install --force-reinstall -q "tokenizers>=0.22.0,<0.23.1"
from pathlib import Path
import json
import importlib.util
import sys
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
repo_id = "LLM-OS-Models/KoHRM-Text-1.4B"
repo_dir = Path(snapshot_download(
repo_id,
revision="main",
allow_patterns=[
"README.md",
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"model.safetensors",
"kohrm_colab_generate.py",
],
))
print("Downloaded to:", repo_dir)
config = json.loads((repo_dir / "config.json").read_text())
print("model_type:", config["model_type"])
print("hidden_size:", config["hidden_size"])
print("vocab_size:", config["vocab_size"])
print("context:", config["max_position_embeddings"])
spec = importlib.util.spec_from_file_location(
"kohrm_colab_generate",
repo_dir / "kohrm_colab_generate.py",
)
kohrm = importlib.util.module_from_spec(spec)
sys.modules["kohrm_colab_generate"] = kohrm
spec.loader.exec_module(kohrm)
model, tokenizer, cfg = kohrm.load_kohrm(repo_dir, max_gpu_memory_gib=14.0)
settings = dict(
max_seq_len=1536,
temperature=0.65,
top_p=0.92,
repetition_penalty=1.05,
no_repeat_ngram_size=0,
condition="direct",
)
prompts = {
"finance": "ํ์จ ๋ณ๋์ด ๊ฐ์ธ ํฌ์์ ๋ฏธ์น๋ ์ํฅ๊ณผ ๋๋น ์ ๋ต์ ๋ฌด์์ธ๊ฐ์?",
"kowiki_style": """๋ค์์ ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ ๋ฌธ์ ์๋ฌธ ์ผ๋ถ์
๋๋ค. ๋ฐฑ๊ณผ์ฌ์ ์ ํ๊ตญ์ด, ๊ณ ์ ๋ช
์ฌ, ๋ ์ง, ๊ธฐ์ /์ฌํ/๋ฌธํ ์ง์์ ๊ทธ๋๋ก ํ์ตํ์ญ์์ค.
[๋ฌธ์๋ช
]
ํ๋ฏผ์ ์
[๋ถ๋ถ]
1/1""",
"legal_style": """๋ค์์ ๋ํ๋ฏผ๊ตญ ๋ฒ๋ น/์์น๋ฒ๊ท ์๋ฌธ ์ผ๋ถ์
๋๋ค. ๋ฒ๋ฅ ํ๊ตญ์ด, ์กฐ๋ฌธ ๊ตฌ์กฐ, ๋ฒํธ ์ฒด๊ณ, ๊ธฐ๊ด๋ช
, ์ํ์ผ์ ํํ์ ๊ทธ๋๋ก ํ์ตํ์ญ์์ค.
[์๋ฃ์ข
๋ฅ]
law
[๋ฌธ์๋ช
]
ํ๋ฒ
[๊ฒฝ๋ก]
kr/ํ๋ฒ/๋ฒ๋ฅ .md
[๋ถ๋ถ]
1/1""",
}
for name, prompt in prompts.items():
print("=" * 80)
print(name)
output = kohrm.generate_from_loaded(
model,
tokenizer,
cfg,
prompt,
max_new_tokens=384,
min_new_tokens=160,
**settings,
)
print(output)
Expected result:
model_typeshould behrm_text.vocab_sizeshould be131072.- The helper should load the 1.38B public
model.safetensorsexport. - On Colab T4, generation runs in fp16 through PyTorch scaled-dot-product attention.
- First generation can take a few minutes because it downloads and loads the full weight file.
- This is a rolling pretraining checkpoint. Compare later checkpoints with the same long prompts before drawing final conclusions.
Prompt format used by the helper, matching upstream InferenceCheckpoint.tokenize_prompt():
<|im_start|><|object_ref_start|>PROMPT<|im_end|>
Plain AutoModelForCausalLM.generate() is still not the supported path. This model is a custom hrm_text architecture, so ordinary Transformers generation requires a future trust_remote_code wrapper. Use the notebook/helper above for public model.safetensors generation today.
Internal Raw-Checkpoint Generation
For training-machine debugging and exact raw FSDP2 checkpoint recovery, the project still includes the upstream-style inference path:
simple_inference_engine.py- raw checkpoints from
LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints - CUDA/FlashAttention-oriented execution
That path is mainly for internal continuation/evaluation, not the easiest Colab test.
Training Data
Prepared data artifacts are uploaded to:
https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
The training objective is PrefixLM response-only loss. Instruction/prompt tokens are visible as context, while loss is applied to the response span.
Major prepared data groups:
| Dataset group | Tokens | Use |
|---|---|---|
koterm_pretrain_mix_v1 |
711.3M | stage-0/stage0b |
| HRM cleaned fast-cap stage1/stage1b | 14.55B | HRM-style instruction pretraining |
| HRM cleaned full/no-cap stage2 | 14.55B | completed continuation |
| HRM cleaned full/no-cap extra stage2b | 14.55B | active continuation |
| Local terminal conversations | 9.39B | terminal/code/tool-heavy continuation |
| Korean tool/legal/wiki/finance mix | 3.02B | Korean domain and tool continuation |
| BCAI Finance Korean | 857.7M | Korean finance/domain data |
| Korean legal/admin task data | 629.0M | Korean legal/admin data |
| Korean Wikipedia | 462.5M | Korean general text |
| ToolBench train tool-call data | 127.0M | tool-call pretraining |
| SWE-ZERO + GLM reasoning subsets | 251.2M | code/reasoning data |
Evaluation-like datasets are excluded where identified, including ToolBench eval, Terminal Bench style evaluation data, and benchmark-oriented chi-bench data.
Training Run
The current run uses staged continuation:
stage0
-> stage0b
-> stage1
-> stage2
-> stage3
-> stage4
-> stage1b
-> stage2b
-> stage3b
-> stage4b
-> stage1c
-> stage2c
-> stage3c
-> stage4c
The checkpoint carries model weights, optimizer state, EMA weights, and recurrent carry state. resume_step_offset and total_steps_override are used so the learning-rate schedule follows the intended longer run instead of resetting at each stage.
As of 2026-05-27, stage2b is active. The continuation watcher is scheduled to launch stage3b -> stage4b -> stage1c -> stage2c -> stage3c -> stage4c after each completed checkpoint. The handoff reads the actual epoch_1_info.json global_step from each completed checkpoint before starting the next stage.
Intended Use
This checkpoint is intended for:
- continued pretraining experiments
- Korean tokenizer and HRM-Text architecture experiments
- terminal/tool-call/code pretraining research
- checkpoint conversion and evaluation work
It is not yet intended as a finished assistant model.
Limitations
- This is an intermediate checkpoint, not a final aligned instruct model.
- The full planned continuation has not finished.
- Final SFT and safety tuning have not been completed.
- Public benchmark scores for this new checkpoint are not final.
- Plain Transformers generation requires adding the custom
hrm_textmodeling wrapper or remote-code files. - Tool-call JSON validity and terminal action safety must be evaluated before production use.
Citation
This work builds on HRM-Text:
- Paper: https://arxiv.org/html/2605.20613
- Upstream code: https://github.com/sapientinc/HRM-Text
ํ๊ตญ์ด
KoHRM-Text-1.4B๋ sapientinc/HRM-Text์ PrefixLM ํ์ต ์คํ์ ๊ธฐ๋ฐ์ผ๋ก ์ฒ์๋ถํฐ ํ์ต ์ค์ธ ํ๊ตญ์ด/์์ด/์ฝ๋/ํฐ๋ฏธ๋/ํด์ฝ ๋ชจ๋ธ์
๋๋ค.
์ด ๋ชจ๋ธ์ sapientinc/HRM-Text-1B๋ฅผ ์ด์ด์ ํ์ธํ๋ํ ๋ชจ๋ธ์ด ์๋๋๋ค. ํ๊ตญ์ด์ ํฐ๋ฏธ๋/ํด์ฝ ํ์์ ๋ง์ถฐ ์๋ก ๋ง๋ 131K byte-level BPE tokenizer๋ฅผ ์ฌ์ฉํ๋ฉฐ, ๊ฐ์ค์น๋ scratch pretraining์ผ๋ก ํ์ตํฉ๋๋ค.
ํ์ฌ ์ํ
์ด ์ ์ฅ์๋ ์ต์ ๊ณต๊ฐ ๋ณํ๋ณธ์ ๊ณ์ ๋ฎ์ด์ฐ๋ rolling latest model repo์ ๋๋ค. ํ์ต์ ์์ง ์งํ ์ค์ ๋๋ค.
- ๋ฉ์ธ ๋ชจ๋ธ repo:
LLM-OS-Models/KoHRM-Text-1.4B - ํ์ฌ ๊ณต๊ฐ ํ์ผ:
model.safetensors,config.json, tokenizer ํ์ผ,README.md - raw FSDP2 resume checkpoint:
LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints - prepared data:
LLM-OS-Models/KoHRM-Text-1.4B-prepared-data - ํ๋ก์ ํธ ์ฝ๋: https://github.com/LLM-OS-Models/KoHRM-text
- ์๋ณธ HRM-Text ์ฝ๋: https://github.com/sapientinc/HRM-Text
- HRM-Text ๋ ผ๋ฌธ: https://arxiv.org/html/2605.20613
- tokenizer repo:
LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K
์ต์ ๊ณต๊ฐ weight๋ฅผ ํ
์คํธํ๋ ค๋ฉด revision="main"์ผ๋ก ๋ค์ด๋ก๋ํ๋ฉด ๋ฉ๋๋ค. ํ์ต ์ค 10,000 step ๋จ์๋ก ์ checkpoint๊ฐ ๋ณํ๋์ด ์ฌ๋ผ์ค๋ฉด ๊ฐ์ ํ์ผ๋ช
์ด ์ต์ EMA safetensors๋ก ๊ฐฑ์ ๋ฉ๋๋ค.
ํ์ต ๋ฐฉ์ ํ๋์ ๋ณด๊ธฐ
KoHRM-Text๋ scratch instruction pretraining์ผ๋ก ๋ณด๋ ๊ฒ์ด ๊ฐ์ฅ ์ ํํฉ๋๋ค.
์ผ๋ฐ์ ์ธ raw-text causal LM ์ฌ์ ํ์ต๋ ์๋๊ณ , ์ด๋ฏธ ์์ฑ๋ base model ์์ ์งง๊ฒ ์น๋ SFT๋ง๋ ์๋๋๋ค.
raw data -> tokenizer -> V1Dataset -> PrefixLM batches
-> HRM H/L recurrence -> LM head -> response-only loss
์ ๋ ฅ ์ปจํ ์คํธ๋ PrefixLM prefix๋ก ์ฒ๋ฆฌํฉ๋๋ค.
instruction / prefix: ์๋ฐฉํฅ attention, loss ์์
response: causal attention, response-only CE loss
์ํคํ ์ฒ๋ ์๋ณธ HRM-Text recurrent design์ ์ ์งํฉ๋๋ค.
H module: ๋๋ฆฌ๊ฒ ๋ณํ๋ ์ ๋ต state
L module: ๋น ๋ฅด๊ฒ ๋ณํ๋ ์คํ state
schedule: H2L3 recurrent computation
ํ์ต ๋ฐฉ์, ์ํคํ ์ฒ, PT/SFT ์ฐจ์ด, staged continuation, checkpoint ์ด๋ฆ์ ์ฝ๊ฒ ํ์ด ์ด ์ ์ฒด ์ค๋ช ์ ํ๋ก์ ํธ ๋ฌธ์๋ฅผ ๊ธฐ์ค์ผ๋ก ๋ณด๋ฉด ๋ฉ๋๋ค.
MODEL_TRAINING_ARCHITECTURE_GUIDE_2026-05-28.md in https://github.com/LLM-OS-Models/KoHRM-text
์ค์ํ ํธํ์ฑ ์๋ด
ํ์ฌ ๊ณต๊ฐ repo์๋ ๋ณํ๋ model weight์ tokenizer๊ฐ ์์ง๋ง, ์์ง Hugging Face trust_remote_code์ฉ HrmTextForCausalLM ๊ตฌํ ํ์ผ์ ํฌํจ๋์ด ์์ง ์์ต๋๋ค.
ํ์ฌ ๋ฐ๋ก ๊ฐ๋ฅํ ๊ฒ:
- ์ต์ ๊ณต๊ฐ weight ๋ค์ด๋ก๋
tokenizers.Tokenizer.from_file("tokenizer.json")๋ก tokenizer ๋ก๋config.jsonํ์ธ- CPU ๋๋ Colab T4์์
model.safetensors๋ฌด๊ฒฐ์ฑ ํ์ธ
์์ง ์ผ๋ฐ Transformers์์ ๋ฐ๋ก ์ ๋๋ ๊ฒ:
AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")- ์ด repo๋ง์ผ๋ก one-line text generation ์คํ
์ด์ ๋ model_type: "hrm_text"๊ฐ custom HRM-Text architecture์ด๊ธฐ ๋๋ฌธ์
๋๋ค. ๊ณต๊ฐ generation์ ํ๋ ค๋ฉด ์ด model repo์ HrmTextForCausalLM remote-code wrapper๊ฐ ์ถ๊ฐ๋์ด์ผ ํฉ๋๋ค.
๋ชจ๋ธ ์์ธ
| ํญ๋ชฉ | ๊ฐ |
|---|---|
| ๋ชจ๋ธ ID | LLM-OS-Models/KoHRM-Text-1.4B |
| ํ์ค ์ด๋ฆ | KoHRM-Text-1.4B |
| ํ์ต ์ถ๋ฐ์ | scratch |
| ์ํคํ ์ฒ ๊ณ์ด | HRM-Text PrefixLM |
| ์ํคํ ์ฒ ํฌ๊ธฐ | XL |
| ํ๋ผ๋ฏธํฐ | 1,384,120,320 |
| ์ปจํ ์คํธ ๊ธธ์ด | 4,096 tokens |
| ํ์ต dtype | bfloat16 |
| ๊ณต๊ฐ ๋ณํ๋ณธ dtype | bfloat16 EMA safetensors |
| tokenizer | byte-level BPE, NFC normalization |
| vocabulary size | 131,072 |
| objective | PrefixLM response-only loss |
| optimizer | HRM-Text์ Adam-atan2 |
| EMA | 0.9999 |
๋ณํ๋ config ์ฃผ์ ๊ฐ:
{
"model_type": "hrm_text",
"architectures": ["HrmTextForCausalLM"],
"vocab_size": 131072,
"hidden_size": 1536,
"num_hidden_layers": 32,
"num_attention_heads": 12,
"max_position_embeddings": 4096,
"prefix_lm": true
}
HRM-Text ๋ ผ๋ฌธ ๋๋น
ํ์ฌ run์ ๋ ผ๋ฌธ recipe๋ณด๋ค ๋ ์ค๋ ๊ฑธ๋ฆด ์ ์์ต๋๋ค. ์ค์ ์ด ์์ ํ ๊ฐ์ง ์๊ธฐ ๋๋ฌธ์ ๋๋ค.
- ๋ ผ๋ฌธ ๊ธฐ์ค์ 16 x H100์ด๊ณ , ํ์ฌ run์ 8 x H200์ ๋๋ค.
- KoHRM์ ์๋ณธ 65K tokenizer๋ณด๋ค ํฐ 131K tokenizer vocab์ ์๋๋ค.
- ๊ณต๊ฐ KoHRM ํฌ๊ธฐ๋ ์ฝ 1.38B parameters์ ๋๋ค.
- ์์ ์ฅ๊ธฐ run batch๋ OOM probe ์ดํ
180,224tokens/step์ผ๋ก ์ก์์ต๋๋ค. ๋ ํฐ batch๋ ์ด๋ฐ์ ๊ฐ๋ฅํด ๋ณด์ฌ๋ ์ฅ๊ธฐ ์์ ์ฑ์ด ๋จ์ด์ก์ต๋๋ค. - ํ๊ตญ์ด, ํฐ๋ฏธ๋, ํด์ฝ, ๋ฒ๋ฅ , ๊ธ์ต, ์ํค, HRM-cleaned ๋ฐ๋ณต stage๊ฐ ์ถ๊ฐ๋์ต๋๋ค.
์ด๊ฒ์ด ์๋์ผ๋ก ๋ชจ๋ benchmark ์ ์ ์์น์ ๋ณด์ฅํ์ง๋ ์์ต๋๋ค. ๋ค๋ง ํ๊ตญ์ด ํ ํฌ๋์ด์ ํจ์จ, ํ๊ตญ์ด ๋ฒ๋ฅ /๊ธ์ต/์ํค coverage, ํฐ๋ฏธ๋ trajectory, tool-call formatting, code-oriented behavior ์ชฝ์ ์๋ณธ ์์ด/general checkpoint๋ณด๋ค ์ข์์ง ๊ฐ๋ฅ์ฑ์ด ์์ต๋๋ค. ์ต์ข ์ฃผ์ฅ์ continuation๊ณผ SFT๊ฐ ๋๋ ๋ค ํ๊ฐ๋ก ํ์ธํด์ผ ํฉ๋๋ค.
ํ ํฌ๋์ด์
ํ ํฌ๋์ด์ ๋ ํ๊ตญ์ด, ์์ด, ์ฝ๋, shell/terminal ํ ์คํธ, JSON/tool-call ํ์์ ๊ณ ๋ คํด์ ๋ง๋ค์์ต๋๋ค. ์์ฃผ ์ฐ๋ chat/tool special token์ ๊ฐ๋ฅํ ํ ์์ ์ ์ธ ๋จ์ผ token์ผ๋ก ์ ์งํฉ๋๋ค.
| ์ํ ์ข ๋ฅ | chars/token |
|---|---|
| ํ๊ตญ์ด ์ผ๋ฐ | 2.60 |
| ํ๊ตญ์ด ๋ฒ๋ฅ | 2.36 |
| ํ๊ตญ์ด ํฐ๋ฏธ๋ ์ง์ | 2.18 |
| shell command | 2.68 |
| tool-call JSON | 3.32 |
| Python code | 3.37 |
| ์์ด | 4.40 |
ํฌ๋งท token:
<|im_start|> instruction ์์
<|im_end|> instruction ์ข
๋ฃ
<|box_end|> response/end marker
<|object_ref_start|> direct condition
<|object_ref_end|> chain-of-thought style condition
<|quad_start|> noisy condition
<|quad_end|> synthetic condition
ํ๋ก์ ํธ ๋ด๋ถ inference code๊ฐ ์ฐ๋ prompt ํ์:
<|im_start|><|object_ref_start|>์ฌ๊ธฐ์_ํ๋กฌํํธ๋ฅผ_๋ฃ์ต๋๋ค<|im_end|>
Colab T4 ๊ธด ์ง์ ์์ฑ ํ์ธ
๋ฐ๋ก ์คํํ ์ ์๋ Colab ๋ ธํธ๋ถ์ project repo์ ์์ต๋๋ค.
์ด ๋
ธํธ๋ถ์ Colab T4์์ ์ต์ ๊ณต๊ฐ ํ์ผ์ ๋ค์ด๋ก๋ํ๊ณ ํ์ฌ ์ฌ์ ํ์ต ๋ฐ์ดํฐ์ ๊ฐ์ ์คํ์ผ์ ๊ธด ์์ฑ prompt๋ฅผ ์คํํฉ๋๋ค. ๋ชฉ์ ์ pretraining stage checkpoint์ ์ง์ ์ ํธ, ํ๊ตญ์ด ์ ์ฐฝ์ฑ, ๋ฐ๋ณต ์ฌ๋ถ, ๊ณต๊ฐ model.safetensors runtime ๋์์ ์ง์ ํ์ธํ๋ ๊ฒ์
๋๋ค.
์ด ๋ ธํธ๋ถ์ ์ต์ข chat/SFT benchmark๊ฐ ์๋๋๋ค. ๊ณต๊ฐ checkpoint๋ ์์ง SFT/LoRA/RL๋ก ํ๋ ์ ๋ ฌ์ ๋๋ธ ๋ชจ๋ธ์ด ์๋๋ฏ๋ก, ํฌ๋งท ์ค์ ์ค์ฌ์ SFT์ ๊ณผ์ ๋ ์๋์ ์ผ๋ก ์ ์ธํ์ต๋๋ค.
์ผ๋ถ Colab ํ๊ฒฝ์์ transformers๊ฐ torchvision::nms import ์ค๋ฅ๋ฅผ ๋ด๊ฑฐ๋ custom architecture๋ฅผ ๋ชป ์ฐพ๋ ๋ฌธ์ ๊ฐ ์๊ธธ ์ ์์ผ๋ฏ๋ก, ์ด ๋
ธํธ๋ถ์ AutoTokenizer์ AutoModelForCausalLM์ ์ฐ์ง ์์ต๋๋ค. ๋์ ์๋ ๊ฒฝ๋ก๋ฅผ ์ฌ์ฉํฉ๋๋ค.
tokenizers.Tokenizer.from_file("tokenizer.json")safetensors.torch.load_file("model.safetensors")- HRM-Text ๊ตฌ์กฐ๋ฅผ ์ง์ ๊ตฌํํ
kohrm_colab_generate.py
!pip -q install -U huggingface_hub hf_transfer safetensors
!pip -q install --force-reinstall -q "tokenizers>=0.22.0,<0.23.1"
from pathlib import Path
import json
import importlib.util
import sys
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
repo_id = "LLM-OS-Models/KoHRM-Text-1.4B"
repo_dir = Path(snapshot_download(
repo_id,
revision="main",
allow_patterns=[
"README.md",
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"model.safetensors",
"kohrm_colab_generate.py",
],
))
print("Downloaded to:", repo_dir)
config = json.loads((repo_dir / "config.json").read_text())
print("model_type:", config["model_type"])
print("hidden_size:", config["hidden_size"])
print("vocab_size:", config["vocab_size"])
print("context:", config["max_position_embeddings"])
spec = importlib.util.spec_from_file_location(
"kohrm_colab_generate",
repo_dir / "kohrm_colab_generate.py",
)
kohrm = importlib.util.module_from_spec(spec)
sys.modules["kohrm_colab_generate"] = kohrm
spec.loader.exec_module(kohrm)
model, tokenizer, cfg = kohrm.load_kohrm(repo_dir, max_gpu_memory_gib=14.0)
settings = dict(
max_seq_len=1536,
temperature=0.65,
top_p=0.92,
repetition_penalty=1.05,
no_repeat_ngram_size=0,
condition="direct",
)
prompts = {
"finance": "ํ์จ ๋ณ๋์ด ๊ฐ์ธ ํฌ์์ ๋ฏธ์น๋ ์ํฅ๊ณผ ๋๋น ์ ๋ต์ ๋ฌด์์ธ๊ฐ์?",
"kowiki_style": """๋ค์์ ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ ๋ฌธ์ ์๋ฌธ ์ผ๋ถ์
๋๋ค. ๋ฐฑ๊ณผ์ฌ์ ์ ํ๊ตญ์ด, ๊ณ ์ ๋ช
์ฌ, ๋ ์ง, ๊ธฐ์ /์ฌํ/๋ฌธํ ์ง์์ ๊ทธ๋๋ก ํ์ตํ์ญ์์ค.
[๋ฌธ์๋ช
]
ํ๋ฏผ์ ์
[๋ถ๋ถ]
1/1""",
"legal_style": """๋ค์์ ๋ํ๋ฏผ๊ตญ ๋ฒ๋ น/์์น๋ฒ๊ท ์๋ฌธ ์ผ๋ถ์
๋๋ค. ๋ฒ๋ฅ ํ๊ตญ์ด, ์กฐ๋ฌธ ๊ตฌ์กฐ, ๋ฒํธ ์ฒด๊ณ, ๊ธฐ๊ด๋ช
, ์ํ์ผ์ ํํ์ ๊ทธ๋๋ก ํ์ตํ์ญ์์ค.
[์๋ฃ์ข
๋ฅ]
law
[๋ฌธ์๋ช
]
ํ๋ฒ
[๊ฒฝ๋ก]
kr/ํ๋ฒ/๋ฒ๋ฅ .md
[๋ถ๋ถ]
1/1""",
}
for name, prompt in prompts.items():
print("=" * 80)
print(name)
output = kohrm.generate_from_loaded(
model,
tokenizer,
cfg,
prompt,
max_new_tokens=384,
min_new_tokens=160,
**settings,
)
print(output)
์ ์ ๊ฒฐ๊ณผ:
model_type์hrm_text์ ๋๋ค.vocab_size๋131072์ ๋๋ค.- helper๊ฐ 1.38B ๊ณต๊ฐ
model.safetensors๋ณํ๋ณธ์ ๋ก๋ํฉ๋๋ค. - Colab T4์์๋ fp16 PyTorch scaled-dot-product attention์ผ๋ก ์์ฑํฉ๋๋ค.
- ์ฒซ ์คํ์ 2.8 GiB๊ธ weight ๋ค์ด๋ก๋์ ๋ก๋ ๋๋ฌธ์ ๋ช ๋ถ ๊ฑธ๋ฆด ์ ์์ต๋๋ค.
- ํ์ฌ repo๋ rolling pretraining checkpoint์ ๋๋ค. ๊ฐ์ ๊ธด prompt๋ก ์ดํ checkpoint์ ๋น๊ตํด์ ์ง์, ๋ฌธ์ฒด, ๋ฐ๋ณต ์ฌ๋ถ๋ฅผ ๋ด์ผ ํฉ๋๋ค.
helper๊ฐ ์ฐ๋ prompt ํ์์ upstream InferenceCheckpoint.tokenize_prompt()์ ๋ง์ถฅ๋๋ค.
<|im_start|><|object_ref_start|>PROMPT<|im_end|>
์ผ๋ฐ AutoModelForCausalLM.generate()๋ ์์ง ์ง์ ๊ฒฝ๋ก๊ฐ ์๋๋๋ค. ์ด ๋ชจ๋ธ์ custom hrm_text architecture์ด๋ฏ๋ก, ์ผ๋ฐ Transformers generation์ ์ถํ trust_remote_code wrapper๊ฐ ์ถ๊ฐ๋ ๋ค ์ง์ํ๋ ๊ฒ์ด ๋ง์ต๋๋ค. ์ง๊ธ ๊ณต๊ฐ model.safetensors๋ก ๋ฐ๋ก ์์ฑํ๋ ค๋ฉด ์ ๋
ธํธ๋ถ/helper๋ฅผ ์ฐ๋ฉด ๋ฉ๋๋ค.
๋ด๋ถ raw-checkpoint ์์ฑ
ํ์ต ๋จธ์ ์์ ๋๋ฒ๊น ํ๊ฑฐ๋ raw FSDP2 checkpoint๋ฅผ ์ ํํ ๋ณต๊ตฌํด์ ํ๊ฐํ ๋๋ upstream ์คํ์ผ inference ๊ฒฝ๋ก๋ ์ ์งํฉ๋๋ค.
simple_inference_engine.pyLLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints์ raw checkpoints- CUDA/FlashAttention ์ค์ฌ ์คํ
์ด ๊ฒฝ๋ก๋ ๋ด๋ถ continuation/evaluation์ฉ์ ๊ฐ๊น๊ณ , Colab์์ ๊ฐ์ฅ ์ฝ๊ฒ ํ์ธํ๋ ค๋ฉด ์ ๊ณต๊ฐ model.safetensors helper๋ฅผ ์ฐ๋ ๊ฒ์ด ๋ซ์ต๋๋ค.
ํ์ต ๋ฐ์ดํฐ
prepared data๋ ์๋ dataset repo์ ์ ๋ก๋ํฉ๋๋ค.
https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
ํ์ต objective๋ PrefixLM response-only loss์ ๋๋ค. instruction/prompt token์ context๋ก ๋ณด๊ณ , loss๋ response span์๋ง ์ ์ฉํฉ๋๋ค.
์ฃผ์ prepared data group:
| ๋ฐ์ดํฐ ๊ทธ๋ฃน | Tokens | ์ฉ๋ |
|---|---|---|
koterm_pretrain_mix_v1 |
711.3M | stage-0/stage0b |
| HRM cleaned fast-cap stage1/stage1b | 14.55B | HRM-style instruction pretraining |
| HRM cleaned full/no-cap stage2 | 14.55B | ์๋ฃ๋ continuation |
| HRM cleaned full/no-cap extra stage2b | 14.55B | ์งํ ์ค์ธ continuation |
| local terminal conversations | 9.39B | terminal/code/tool-heavy continuation |
| Korean tool/legal/wiki/finance mix | 3.02B | ํ๊ตญ์ด domain/tool continuation |
| BCAI Finance Korean | 857.7M | ํ๊ตญ์ด ๊ธ์ต/domain data |
| Korean legal/admin task data | 629.0M | ํ๊ตญ์ด ๋ฒ๋ฅ /ํ์ data |
| Korean Wikipedia | 462.5M | ํ๊ตญ์ด ์ผ๋ฐ ํ ์คํธ |
| ToolBench train tool-call data | 127.0M | tool-call pretraining |
| SWE-ZERO + GLM reasoning subsets | 251.2M | code/reasoning data |
ํ๊ฐ ์ฑ๊ฒฉ ๋ฐ์ดํฐ๋ ํ์ธ๋๋ ๋ฒ์์์ train์์ ์ ์ธํฉ๋๋ค. ์์๋ ToolBench eval, Terminal Bench ๊ณ์ด ํ๊ฐ ๋ฐ์ดํฐ, benchmark ์ฑ๊ฒฉ์ chi-bench์
๋๋ค.
ํ์ต ์งํ
ํ์ฌ run์ staged continuation ๋ฐฉ์์ ๋๋ค.
stage0
-> stage0b
-> stage1
-> stage2
-> stage3
-> stage4
-> stage1b
-> stage2b
-> stage3b
-> stage4b
-> stage1c
-> stage2c
-> stage3c
-> stage4c
checkpoint๋ model weights, optimizer state, EMA weights, recurrent carry state๋ฅผ ์ด์ด๊ฐ๋๋ค. resume_step_offset๊ณผ total_steps_override๋ฅผ ์จ์ stage๋ง๋ค learning-rate schedule์ด ๋ฆฌ์
๋์ง ์๊ณ ๊ธด pretraining run์ฒ๋ผ ์ด์ด์ง๊ฒ ํฉ๋๋ค.
2026-05-27 ๊ธฐ์ค stage2b๊ฐ ์งํ ์ค์
๋๋ค. continuation watcher๊ฐ ์ดํ stage3b -> stage4b -> stage1c -> stage2c -> stage3c -> stage4c๋ฅผ ์ด์ด์ ์คํํ๋๋ก ์์ฝ๋์ด ์์ต๋๋ค. handoff๋ ๊ฐ stage์ ์ค์ epoch_1_info.json global_step์ ์ฝ๊ณ ๋ค์ stage๋ฅผ ์์ํฉ๋๋ค.
์ฌ์ฉ ๋ชฉ์
์ด checkpoint๋ ๋ค์ ๋ชฉ์ ์ ์ ํฉํฉ๋๋ค.
- continued pretraining ์คํ
- ํ๊ตญ์ด tokenizer ๋ฐ HRM-Text architecture ์คํ
- terminal/tool-call/code pretraining ์ฐ๊ตฌ
- checkpoint conversion ๋ฐ evaluation ์์
์์ง ์์ฑ๋ assistant model์ ์๋๋๋ค.
์ ํ ์ฌํญ
- ์ค๊ฐ checkpoint์ด๋ฉฐ ์ต์ข aligned instruct model์ด ์๋๋๋ค.
- ์ ์ฒด planned continuation์ด ์์ง ๋๋์ง ์์์ต๋๋ค.
- ์ต์ข SFT์ safety tuning์ด ์์ง ๋๋์ง ์์์ต๋๋ค.
- ์ checkpoint์ public benchmark score๋ ์์ง final์ด ์๋๋๋ค.
- ์ผ๋ฐ Transformers generation์ custom
hrm_textmodeling wrapper ๋๋ remote-code file์ด ์ถ๊ฐ๋์ด์ผ ๊ฐ๋ฅํฉ๋๋ค. - tool-call JSON ์ ํจ์ฑ๊ณผ terminal action safety๋ ์ค์ ์ฌ์ฉ ์ ์ ๋ณ๋ ํ๊ฐ๊ฐ ํ์ํฉ๋๋ค.
์ธ์ฉ
์ด ์์ ์ HRM-Text architecture์ training stack์ ๊ธฐ๋ฐ์ผ๋ก ํฉ๋๋ค.
- ๋ ผ๋ฌธ: https://arxiv.org/html/2605.20613
- ์๋ณธ ์ฝ๋: https://github.com/sapientinc/HRM-Text
- Downloads last month
- 1,286