KoHRM-Text-1.4B

Language / ์–ธ์–ด: English | ํ•œ๊ตญ์–ด

English

KoHRM-Text-1.4B is a scratch-pretrained Korean/English/code/terminal/tool-use model built from the sapientinc/HRM-Text PrefixLM training stack.

This is not a continued finetune of sapientinc/HRM-Text-1B. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.

Current Status

This repository is a rolling latest public model export. Training is still in progress.

The main branch is overwritten with the newest converted EMA safetensors export as training checkpoints are uploaded. To test the latest public weight, download revision="main".

Training Method At A Glance

KoHRM-Text is best understood as instruction pretraining from scratch.

It is not ordinary raw-text causal LM pretraining, and it is not only a small SFT pass on top of an existing base model.

raw data -> tokenizer -> V1Dataset -> PrefixLM batches
         -> HRM H/L recurrence -> LM head -> response-only loss

The input context is handled as a PrefixLM prefix:

instruction / prefix: bidirectional attention, no loss
response:             causal attention, response-only CE loss

The architecture keeps the upstream HRM-Text recurrent design:

H module: slower strategic state
L module: faster execution state
schedule: H2L3 recurrent computation

For a readable full explanation of the training method, architecture, PT/SFT distinction, staged continuation, and checkpoint naming, see the project document:

MODEL_TRAINING_ARCHITECTURE_GUIDE_2026-05-28.md in https://github.com/LLM-OS-Models/KoHRM-text

Important Compatibility Note

The public repo currently contains the converted model weights and tokenizer, but it does not yet include a Hugging Face trust_remote_code modeling implementation for HrmTextForCausalLM.

What works today:

  • Download the latest public weights.
  • Load the tokenizer directly with tokenizers.Tokenizer.from_file("tokenizer.json").
  • Inspect config.json.
  • Verify model.safetensors on CPU or Colab T4.

What is not supported yet in plain Transformers:

  • AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")
  • One-line hosted text generation from this repo

Expected reason: model_type: "hrm_text" is a custom HRM-Text architecture. Public generation will require adding the compatible HrmTextForCausalLM remote-code files to this model repo or releasing a standard wrapper.

Model Details

Field Value
Model id LLM-OS-Models/KoHRM-Text-1.4B
Standard name KoHRM-Text-1.4B
Training origin scratch
Architecture family HRM-Text PrefixLM
Architecture size XL
Parameters 1,384,120,320
Context length 4,096 tokens
Training dtype bfloat16
Public export dtype bfloat16 EMA safetensors
Tokenizer byte-level BPE, NFC normalization
Vocabulary size 131,072
Objective PrefixLM response-only loss
Optimizer Adam-atan2 from upstream HRM-Text
EMA 0.9999

Converted config highlights:

{
  "model_type": "hrm_text",
  "architectures": ["HrmTextForCausalLM"],
  "vocab_size": 131072,
  "hidden_size": 1536,
  "num_hidden_layers": 32,
  "num_attention_heads": 12,
  "max_position_embeddings": 4096,
  "prefix_lm": true
}

Compared With The HRM-Text Paper

This run can take longer than the paper recipe even on 8 x H200 because the setup is not identical:

  • The paper reference used 16 x H100; this run uses 8 x H200.
  • KoHRM uses a larger 131K tokenizer vocabulary, compared with the upstream 65K tokenizer.
  • The public KoHRM size is about 1.38B parameters.
  • The stable long-run batch is 180,224 tokens/step after OOM probing; larger batches were possible briefly but not chosen for reliability.
  • The continuation includes extra Korean, terminal, tool-call, legal, finance, wiki, and repeated HRM-cleaned stages.

This does not automatically guarantee better benchmark scores. The expected upside is domain-specific: Korean tokenization efficiency, Korean legal/finance/wiki coverage, terminal trajectories, tool-call formatting, and code-oriented behavior should have a better chance than the upstream English/general checkpoint. Final claims require evaluation after the planned continuation and SFT finish.

Tokenizer

The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It keeps common chat/tool special tokens as stable single tokens where possible.

Sample bucket chars/token
Korean general text 2.60
Korean legal text 2.36
Korean terminal instruction 2.18
shell command 2.68
tool-call JSON 3.32
Python code 3.37
English 4.40

Formatting tokens:

<|im_start|>         instruction start
<|im_end|>           instruction end
<|box_end|>          response/end marker
<|object_ref_start|> direct condition
<|object_ref_end|>   chain-of-thought style condition
<|quad_start|>       noisy condition
<|quad_end|>         synthetic condition

Prompt format used by the project-side inference code:

<|im_start|><|object_ref_start|>YOUR_PROMPT_HERE<|im_end|>

Colab T4 Long Knowledge Probe

A ready-to-run Colab notebook is available in the project repo:

https://github.com/LLM-OS-Models/KoHRM-text/blob/main/notebooks/KoHRM_Text_1_4B_Colab_T4_Long_Knowledge_Probe.ipynb

The notebook downloads the latest public files and runs long-form generation prompts that match the current pretraining data style. It is intended to inspect knowledge signal, Korean fluency, repetition, and runtime correctness after pretraining-stage checkpoints.

This is not a final chat/SFT benchmark. It intentionally avoids format-constrained SFT-style tests because the public checkpoint is still a pretraining-stage model and has not been behavior-aligned by SFT/LoRA/RL.

It intentionally avoids transformers, AutoTokenizer, and AutoModelForCausalLM. Instead, it uses:

  • tokenizers.Tokenizer.from_file("tokenizer.json")
  • safetensors.torch.load_file("model.safetensors")
  • kohrm_colab_generate.py, a small PyTorch SDPA runtime for the HRM-Text architecture
!pip -q install -U huggingface_hub hf_transfer safetensors
!pip -q install --force-reinstall -q "tokenizers>=0.22.0,<0.23.1"
from pathlib import Path
import json
import importlib.util
import sys
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer

repo_id = "LLM-OS-Models/KoHRM-Text-1.4B"

repo_dir = Path(snapshot_download(
    repo_id,
    revision="main",
    allow_patterns=[
        "README.md",
        "config.json",
        "tokenizer.json",
        "tokenizer_config.json",
        "special_tokens_map.json",
        "model.safetensors",
        "kohrm_colab_generate.py",
    ],
))

print("Downloaded to:", repo_dir)
config = json.loads((repo_dir / "config.json").read_text())
print("model_type:", config["model_type"])
print("hidden_size:", config["hidden_size"])
print("vocab_size:", config["vocab_size"])
print("context:", config["max_position_embeddings"])

spec = importlib.util.spec_from_file_location(
    "kohrm_colab_generate",
    repo_dir / "kohrm_colab_generate.py",
)
kohrm = importlib.util.module_from_spec(spec)
sys.modules["kohrm_colab_generate"] = kohrm
spec.loader.exec_module(kohrm)

model, tokenizer, cfg = kohrm.load_kohrm(repo_dir, max_gpu_memory_gib=14.0)

settings = dict(
    max_seq_len=1536,
    temperature=0.65,
    top_p=0.92,
    repetition_penalty=1.05,
    no_repeat_ngram_size=0,
    condition="direct",
)

prompts = {
    "finance": "ํ™˜์œจ ๋ณ€๋™์ด ๊ฐœ์ธ ํˆฌ์ž์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๊ณผ ๋Œ€๋น„ ์ „๋žต์€ ๋ฌด์—‡์ธ๊ฐ€์š”?",
    "kowiki_style": """๋‹ค์Œ์€ ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ ๋ฌธ์„œ ์›๋ฌธ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค. ๋ฐฑ๊ณผ์‚ฌ์ „์‹ ํ•œ๊ตญ์–ด, ๊ณ ์œ ๋ช…์‚ฌ, ๋‚ ์งœ, ๊ธฐ์ˆ /์‚ฌํšŒ/๋ฌธํ™” ์ง€์‹์„ ๊ทธ๋Œ€๋กœ ํ•™์Šตํ•˜์‹ญ์‹œ์˜ค.

[๋ฌธ์„œ๋ช…]
ํ›ˆ๋ฏผ์ •์Œ

[๋ถ€๋ถ„]
1/1""",
    "legal_style": """๋‹ค์Œ์€ ๋Œ€ํ•œ๋ฏผ๊ตญ ๋ฒ•๋ น/์ž์น˜๋ฒ•๊ทœ ์›๋ฌธ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค. ๋ฒ•๋ฅ  ํ•œ๊ตญ์–ด, ์กฐ๋ฌธ ๊ตฌ์กฐ, ๋ฒˆํ˜ธ ์ฒด๊ณ„, ๊ธฐ๊ด€๋ช…, ์‹œํ–‰์ผ์ž ํ‘œํ˜„์„ ๊ทธ๋Œ€๋กœ ํ•™์Šตํ•˜์‹ญ์‹œ์˜ค.

[์ž๋ฃŒ์ข…๋ฅ˜]
law

[๋ฌธ์„œ๋ช…]
ํ˜•๋ฒ•

[๊ฒฝ๋กœ]
kr/ํ˜•๋ฒ•/๋ฒ•๋ฅ .md

[๋ถ€๋ถ„]
1/1""",
}

for name, prompt in prompts.items():
    print("=" * 80)
    print(name)
    output = kohrm.generate_from_loaded(
        model,
        tokenizer,
        cfg,
        prompt,
        max_new_tokens=384,
        min_new_tokens=160,
        **settings,
    )
    print(output)

Expected result:

  • model_type should be hrm_text.
  • vocab_size should be 131072.
  • The helper should load the 1.38B public model.safetensors export.
  • On Colab T4, generation runs in fp16 through PyTorch scaled-dot-product attention.
  • First generation can take a few minutes because it downloads and loads the full weight file.
  • This is a rolling pretraining checkpoint. Compare later checkpoints with the same long prompts before drawing final conclusions.

Prompt format used by the helper, matching upstream InferenceCheckpoint.tokenize_prompt():

<|im_start|><|object_ref_start|>PROMPT<|im_end|>

Plain AutoModelForCausalLM.generate() is still not the supported path. This model is a custom hrm_text architecture, so ordinary Transformers generation requires a future trust_remote_code wrapper. Use the notebook/helper above for public model.safetensors generation today.

Internal Raw-Checkpoint Generation

For training-machine debugging and exact raw FSDP2 checkpoint recovery, the project still includes the upstream-style inference path:

  • simple_inference_engine.py
  • raw checkpoints from LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints
  • CUDA/FlashAttention-oriented execution

That path is mainly for internal continuation/evaluation, not the easiest Colab test.

Training Data

Prepared data artifacts are uploaded to:

https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data

The training objective is PrefixLM response-only loss. Instruction/prompt tokens are visible as context, while loss is applied to the response span.

Major prepared data groups:

Dataset group Tokens Use
koterm_pretrain_mix_v1 711.3M stage-0/stage0b
HRM cleaned fast-cap stage1/stage1b 14.55B HRM-style instruction pretraining
HRM cleaned full/no-cap stage2 14.55B completed continuation
HRM cleaned full/no-cap extra stage2b 14.55B active continuation
Local terminal conversations 9.39B terminal/code/tool-heavy continuation
Korean tool/legal/wiki/finance mix 3.02B Korean domain and tool continuation
BCAI Finance Korean 857.7M Korean finance/domain data
Korean legal/admin task data 629.0M Korean legal/admin data
Korean Wikipedia 462.5M Korean general text
ToolBench train tool-call data 127.0M tool-call pretraining
SWE-ZERO + GLM reasoning subsets 251.2M code/reasoning data

Evaluation-like datasets are excluded where identified, including ToolBench eval, Terminal Bench style evaluation data, and benchmark-oriented chi-bench data.

Training Run

The current run uses staged continuation:

stage0
-> stage0b
-> stage1
-> stage2
-> stage3
-> stage4
-> stage1b
-> stage2b
-> stage3b
-> stage4b
-> stage1c
-> stage2c
-> stage3c
-> stage4c

The checkpoint carries model weights, optimizer state, EMA weights, and recurrent carry state. resume_step_offset and total_steps_override are used so the learning-rate schedule follows the intended longer run instead of resetting at each stage.

As of 2026-05-27, stage2b is active. The continuation watcher is scheduled to launch stage3b -> stage4b -> stage1c -> stage2c -> stage3c -> stage4c after each completed checkpoint. The handoff reads the actual epoch_1_info.json global_step from each completed checkpoint before starting the next stage.

Intended Use

This checkpoint is intended for:

  • continued pretraining experiments
  • Korean tokenizer and HRM-Text architecture experiments
  • terminal/tool-call/code pretraining research
  • checkpoint conversion and evaluation work

It is not yet intended as a finished assistant model.

Limitations

  • This is an intermediate checkpoint, not a final aligned instruct model.
  • The full planned continuation has not finished.
  • Final SFT and safety tuning have not been completed.
  • Public benchmark scores for this new checkpoint are not final.
  • Plain Transformers generation requires adding the custom hrm_text modeling wrapper or remote-code files.
  • Tool-call JSON validity and terminal action safety must be evaluated before production use.

Citation

This work builds on HRM-Text:

ํ•œ๊ตญ์–ด

KoHRM-Text-1.4B๋Š” sapientinc/HRM-Text์˜ PrefixLM ํ•™์Šต ์Šคํƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต ์ค‘์ธ ํ•œ๊ตญ์–ด/์˜์–ด/์ฝ”๋“œ/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ sapientinc/HRM-Text-1B๋ฅผ ์ด์–ด์„œ ํŒŒ์ธํŠœ๋‹ํ•œ ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด์™€ ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ํ˜•์‹์— ๋งž์ถฐ ์ƒˆ๋กœ ๋งŒ๋“  131K byte-level BPE tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ฐ€์ค‘์น˜๋„ scratch pretraining์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ์ƒํƒœ

์ด ์ €์žฅ์†Œ๋Š” ์ตœ์‹  ๊ณต๊ฐœ ๋ณ€ํ™˜๋ณธ์„ ๊ณ„์† ๋ฎ์–ด์“ฐ๋Š” rolling latest model repo์ž…๋‹ˆ๋‹ค. ํ•™์Šต์€ ์•„์ง ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

  • ๋ฉ”์ธ ๋ชจ๋ธ repo: LLM-OS-Models/KoHRM-Text-1.4B
  • ํ˜„์žฌ ๊ณต๊ฐœ ํŒŒ์ผ: model.safetensors, config.json, tokenizer ํŒŒ์ผ, README.md
  • raw FSDP2 resume checkpoint: LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints
  • prepared data: LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
  • ํ”„๋กœ์ ํŠธ ์ฝ”๋“œ: https://github.com/LLM-OS-Models/KoHRM-text
  • ์›๋ณธ HRM-Text ์ฝ”๋“œ: https://github.com/sapientinc/HRM-Text
  • HRM-Text ๋…ผ๋ฌธ: https://arxiv.org/html/2605.20613
  • tokenizer repo: LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K

์ตœ์‹  ๊ณต๊ฐœ weight๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด revision="main"์œผ๋กœ ๋‹ค์šด๋กœ๋“œํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์ค‘ 10,000 step ๋‹จ์œ„๋กœ ์ƒˆ checkpoint๊ฐ€ ๋ณ€ํ™˜๋˜์–ด ์˜ฌ๋ผ์˜ค๋ฉด ๊ฐ™์€ ํŒŒ์ผ๋ช…์ด ์ตœ์‹  EMA safetensors๋กœ ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐฉ์‹ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

KoHRM-Text๋Š” scratch instruction pretraining์œผ๋กœ ๋ณด๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ raw-text causal LM ์‚ฌ์ „ํ•™์Šต๋„ ์•„๋‹ˆ๊ณ , ์ด๋ฏธ ์™„์„ฑ๋œ base model ์œ„์— ์งง๊ฒŒ ์–น๋Š” SFT๋งŒ๋„ ์•„๋‹™๋‹ˆ๋‹ค.

raw data -> tokenizer -> V1Dataset -> PrefixLM batches
         -> HRM H/L recurrence -> LM head -> response-only loss

์ž…๋ ฅ ์ปจํ…์ŠคํŠธ๋Š” PrefixLM prefix๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

instruction / prefix: ์–‘๋ฐฉํ–ฅ attention, loss ์—†์Œ
response:             causal attention, response-only CE loss

์•„ํ‚คํ…์ฒ˜๋Š” ์›๋ณธ HRM-Text recurrent design์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

H module: ๋А๋ฆฌ๊ฒŒ ๋ณ€ํ•˜๋Š” ์ „๋žต state
L module: ๋น ๋ฅด๊ฒŒ ๋ณ€ํ•˜๋Š” ์‹คํ–‰ state
schedule: H2L3 recurrent computation

ํ•™์Šต ๋ฐฉ์‹, ์•„ํ‚คํ…์ฒ˜, PT/SFT ์ฐจ์ด, staged continuation, checkpoint ์ด๋ฆ„์„ ์‰ฝ๊ฒŒ ํ’€์–ด ์“ด ์ „์ฒด ์„ค๋ช…์€ ํ”„๋กœ์ ํŠธ ๋ฌธ์„œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ณด๋ฉด ๋ฉ๋‹ˆ๋‹ค.

MODEL_TRAINING_ARCHITECTURE_GUIDE_2026-05-28.md in https://github.com/LLM-OS-Models/KoHRM-text

์ค‘์š”ํ•œ ํ˜ธํ™˜์„ฑ ์•ˆ๋‚ด

ํ˜„์žฌ ๊ณต๊ฐœ repo์—๋Š” ๋ณ€ํ™˜๋œ model weight์™€ tokenizer๊ฐ€ ์žˆ์ง€๋งŒ, ์•„์ง Hugging Face trust_remote_code์šฉ HrmTextForCausalLM ๊ตฌํ˜„ ํŒŒ์ผ์€ ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํ˜„์žฌ ๋ฐ”๋กœ ๊ฐ€๋Šฅํ•œ ๊ฒƒ:

  • ์ตœ์‹  ๊ณต๊ฐœ weight ๋‹ค์šด๋กœ๋“œ
  • tokenizers.Tokenizer.from_file("tokenizer.json")๋กœ tokenizer ๋กœ๋“œ
  • config.json ํ™•์ธ
  • CPU ๋˜๋Š” Colab T4์—์„œ model.safetensors ๋ฌด๊ฒฐ์„ฑ ํ™•์ธ

์•„์ง ์ผ๋ฐ˜ Transformers์—์„œ ๋ฐ”๋กœ ์•ˆ ๋˜๋Š” ๊ฒƒ:

  • AutoModelForCausalLM.from_pretrained("LLM-OS-Models/KoHRM-Text-1.4B")
  • ์ด repo๋งŒ์œผ๋กœ one-line text generation ์‹คํ–‰

์ด์œ ๋Š” model_type: "hrm_text"๊ฐ€ custom HRM-Text architecture์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ณต๊ฐœ generation์„ ํ•˜๋ ค๋ฉด ์ด model repo์— HrmTextForCausalLM remote-code wrapper๊ฐ€ ์ถ”๊ฐ€๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ƒ์„ธ

ํ•ญ๋ชฉ ๊ฐ’
๋ชจ๋ธ ID LLM-OS-Models/KoHRM-Text-1.4B
ํ‘œ์ค€ ์ด๋ฆ„ KoHRM-Text-1.4B
ํ•™์Šต ์ถœ๋ฐœ์  scratch
์•„ํ‚คํ…์ฒ˜ ๊ณ„์—ด HRM-Text PrefixLM
์•„ํ‚คํ…์ฒ˜ ํฌ๊ธฐ XL
ํŒŒ๋ผ๋ฏธํ„ฐ 1,384,120,320
์ปจํ…์ŠคํŠธ ๊ธธ์ด 4,096 tokens
ํ•™์Šต dtype bfloat16
๊ณต๊ฐœ ๋ณ€ํ™˜๋ณธ dtype bfloat16 EMA safetensors
tokenizer byte-level BPE, NFC normalization
vocabulary size 131,072
objective PrefixLM response-only loss
optimizer HRM-Text์˜ Adam-atan2
EMA 0.9999

๋ณ€ํ™˜๋œ config ์ฃผ์š” ๊ฐ’:

{
  "model_type": "hrm_text",
  "architectures": ["HrmTextForCausalLM"],
  "vocab_size": 131072,
  "hidden_size": 1536,
  "num_hidden_layers": 32,
  "num_attention_heads": 12,
  "max_position_embeddings": 4096,
  "prefix_lm": true
}

HRM-Text ๋…ผ๋ฌธ ๋Œ€๋น„

ํ˜„์žฌ run์€ ๋…ผ๋ฌธ recipe๋ณด๋‹ค ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ค์ •์ด ์™„์ „ํžˆ ๊ฐ™์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

  • ๋…ผ๋ฌธ ๊ธฐ์ค€์€ 16 x H100์ด๊ณ , ํ˜„์žฌ run์€ 8 x H200์ž…๋‹ˆ๋‹ค.
  • KoHRM์€ ์›๋ณธ 65K tokenizer๋ณด๋‹ค ํฐ 131K tokenizer vocab์„ ์”๋‹ˆ๋‹ค.
  • ๊ณต๊ฐœ KoHRM ํฌ๊ธฐ๋Š” ์•ฝ 1.38B parameters์ž…๋‹ˆ๋‹ค.
  • ์•ˆ์ • ์žฅ๊ธฐ run batch๋Š” OOM probe ์ดํ›„ 180,224 tokens/step์œผ๋กœ ์žก์•˜์Šต๋‹ˆ๋‹ค. ๋” ํฐ batch๋Š” ์ดˆ๋ฐ˜์— ๊ฐ€๋Šฅํ•ด ๋ณด์—ฌ๋„ ์žฅ๊ธฐ ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์กŒ์Šต๋‹ˆ๋‹ค.
  • ํ•œ๊ตญ์–ด, ํ„ฐ๋ฏธ๋„, ํˆด์ฝœ, ๋ฒ•๋ฅ , ๊ธˆ์œต, ์œ„ํ‚ค, HRM-cleaned ๋ฐ˜๋ณต stage๊ฐ€ ์ถ”๊ฐ€๋์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ์ž๋™์œผ๋กœ ๋ชจ๋“  benchmark ์ ์ˆ˜ ์ƒ์Šน์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ํ•œ๊ตญ์–ด ํ† ํฌ๋‚˜์ด์ € ํšจ์œจ, ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /๊ธˆ์œต/์œ„ํ‚ค coverage, ํ„ฐ๋ฏธ๋„ trajectory, tool-call formatting, code-oriented behavior ์ชฝ์€ ์›๋ณธ ์˜์–ด/general checkpoint๋ณด๋‹ค ์ข‹์•„์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ์ข… ์ฃผ์žฅ์€ continuation๊ณผ SFT๊ฐ€ ๋๋‚œ ๋’ค ํ‰๊ฐ€๋กœ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ† ํฌ๋‚˜์ด์ €

ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ์–ด, ์˜์–ด, ์ฝ”๋“œ, shell/terminal ํ…์ŠคํŠธ, JSON/tool-call ํ˜•์‹์„ ๊ณ ๋ คํ•ด์„œ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์ž์ฃผ ์“ฐ๋Š” chat/tool special token์€ ๊ฐ€๋Šฅํ•œ ํ•œ ์•ˆ์ •์ ์ธ ๋‹จ์ผ token์œผ๋กœ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์ƒ˜ํ”Œ ์ข…๋ฅ˜ chars/token
ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ 2.60
ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ  2.36
ํ•œ๊ตญ์–ด ํ„ฐ๋ฏธ๋„ ์ง€์‹œ 2.18
shell command 2.68
tool-call JSON 3.32
Python code 3.37
์˜์–ด 4.40

ํฌ๋งท token:

<|im_start|>         instruction ์‹œ์ž‘
<|im_end|>           instruction ์ข…๋ฃŒ
<|box_end|>          response/end marker
<|object_ref_start|> direct condition
<|object_ref_end|>   chain-of-thought style condition
<|quad_start|>       noisy condition
<|quad_end|>         synthetic condition

ํ”„๋กœ์ ํŠธ ๋‚ด๋ถ€ inference code๊ฐ€ ์“ฐ๋Š” prompt ํ˜•์‹:

<|im_start|><|object_ref_start|>์—ฌ๊ธฐ์—_ํ”„๋กฌํ”„ํŠธ๋ฅผ_๋„ฃ์Šต๋‹ˆ๋‹ค<|im_end|>

Colab T4 ๊ธด ์ง€์‹ ์ƒ์„ฑ ํ™•์ธ

๋ฐ”๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Colab ๋…ธํŠธ๋ถ์€ project repo์— ์žˆ์Šต๋‹ˆ๋‹ค.

https://github.com/LLM-OS-Models/KoHRM-text/blob/main/notebooks/KoHRM_Text_1_4B_Colab_T4_Long_Knowledge_Probe.ipynb

์ด ๋…ธํŠธ๋ถ์€ Colab T4์—์„œ ์ตœ์‹  ๊ณต๊ฐœ ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ํ˜„์žฌ ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๊ฐ™์€ ์Šคํƒ€์ผ์˜ ๊ธด ์ƒ์„ฑ prompt๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉ์ ์€ pretraining stage checkpoint์˜ ์ง€์‹ ์‹ ํ˜ธ, ํ•œ๊ตญ์–ด ์œ ์ฐฝ์„ฑ, ๋ฐ˜๋ณต ์—ฌ๋ถ€, ๊ณต๊ฐœ model.safetensors runtime ๋™์ž‘์„ ์ง์ ‘ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ๋…ธํŠธ๋ถ์€ ์ตœ์ข… chat/SFT benchmark๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ณต๊ฐœ checkpoint๋Š” ์•„์ง SFT/LoRA/RL๋กœ ํ–‰๋™ ์ •๋ ฌ์„ ๋๋‚ธ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ฏ€๋กœ, ํฌ๋งท ์ค€์ˆ˜ ์ค‘์‹ฌ์˜ SFT์‹ ๊ณผ์ œ๋Š” ์˜๋„์ ์œผ๋กœ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

์ผ๋ถ€ Colab ํ™˜๊ฒฝ์—์„œ transformers๊ฐ€ torchvision::nms import ์˜ค๋ฅ˜๋ฅผ ๋‚ด๊ฑฐ๋‚˜ custom architecture๋ฅผ ๋ชป ์ฐพ๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ด ๋…ธํŠธ๋ถ์€ AutoTokenizer์™€ AutoModelForCausalLM์„ ์“ฐ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  ์•„๋ž˜ ๊ฒฝ๋กœ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • tokenizers.Tokenizer.from_file("tokenizer.json")
  • safetensors.torch.load_file("model.safetensors")
  • HRM-Text ๊ตฌ์กฐ๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•œ kohrm_colab_generate.py
!pip -q install -U huggingface_hub hf_transfer safetensors
!pip -q install --force-reinstall -q "tokenizers>=0.22.0,<0.23.1"
from pathlib import Path
import json
import importlib.util
import sys
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer

repo_id = "LLM-OS-Models/KoHRM-Text-1.4B"

repo_dir = Path(snapshot_download(
    repo_id,
    revision="main",
    allow_patterns=[
        "README.md",
        "config.json",
        "tokenizer.json",
        "tokenizer_config.json",
        "special_tokens_map.json",
        "model.safetensors",
        "kohrm_colab_generate.py",
    ],
))

print("Downloaded to:", repo_dir)
config = json.loads((repo_dir / "config.json").read_text())
print("model_type:", config["model_type"])
print("hidden_size:", config["hidden_size"])
print("vocab_size:", config["vocab_size"])
print("context:", config["max_position_embeddings"])

spec = importlib.util.spec_from_file_location(
    "kohrm_colab_generate",
    repo_dir / "kohrm_colab_generate.py",
)
kohrm = importlib.util.module_from_spec(spec)
sys.modules["kohrm_colab_generate"] = kohrm
spec.loader.exec_module(kohrm)

model, tokenizer, cfg = kohrm.load_kohrm(repo_dir, max_gpu_memory_gib=14.0)

settings = dict(
    max_seq_len=1536,
    temperature=0.65,
    top_p=0.92,
    repetition_penalty=1.05,
    no_repeat_ngram_size=0,
    condition="direct",
)

prompts = {
    "finance": "ํ™˜์œจ ๋ณ€๋™์ด ๊ฐœ์ธ ํˆฌ์ž์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๊ณผ ๋Œ€๋น„ ์ „๋žต์€ ๋ฌด์—‡์ธ๊ฐ€์š”?",
    "kowiki_style": """๋‹ค์Œ์€ ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ ๋ฌธ์„œ ์›๋ฌธ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค. ๋ฐฑ๊ณผ์‚ฌ์ „์‹ ํ•œ๊ตญ์–ด, ๊ณ ์œ ๋ช…์‚ฌ, ๋‚ ์งœ, ๊ธฐ์ˆ /์‚ฌํšŒ/๋ฌธํ™” ์ง€์‹์„ ๊ทธ๋Œ€๋กœ ํ•™์Šตํ•˜์‹ญ์‹œ์˜ค.

[๋ฌธ์„œ๋ช…]
ํ›ˆ๋ฏผ์ •์Œ

[๋ถ€๋ถ„]
1/1""",
    "legal_style": """๋‹ค์Œ์€ ๋Œ€ํ•œ๋ฏผ๊ตญ ๋ฒ•๋ น/์ž์น˜๋ฒ•๊ทœ ์›๋ฌธ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค. ๋ฒ•๋ฅ  ํ•œ๊ตญ์–ด, ์กฐ๋ฌธ ๊ตฌ์กฐ, ๋ฒˆํ˜ธ ์ฒด๊ณ„, ๊ธฐ๊ด€๋ช…, ์‹œํ–‰์ผ์ž ํ‘œํ˜„์„ ๊ทธ๋Œ€๋กœ ํ•™์Šตํ•˜์‹ญ์‹œ์˜ค.

[์ž๋ฃŒ์ข…๋ฅ˜]
law

[๋ฌธ์„œ๋ช…]
ํ˜•๋ฒ•

[๊ฒฝ๋กœ]
kr/ํ˜•๋ฒ•/๋ฒ•๋ฅ .md

[๋ถ€๋ถ„]
1/1""",
}

for name, prompt in prompts.items():
    print("=" * 80)
    print(name)
    output = kohrm.generate_from_loaded(
        model,
        tokenizer,
        cfg,
        prompt,
        max_new_tokens=384,
        min_new_tokens=160,
        **settings,
    )
    print(output)

์ •์ƒ ๊ฒฐ๊ณผ:

  • model_type์€ hrm_text์ž…๋‹ˆ๋‹ค.
  • vocab_size๋Š” 131072์ž…๋‹ˆ๋‹ค.
  • helper๊ฐ€ 1.38B ๊ณต๊ฐœ model.safetensors ๋ณ€ํ™˜๋ณธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
  • Colab T4์—์„œ๋Š” fp16 PyTorch scaled-dot-product attention์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ฒซ ์‹คํ–‰์€ 2.8 GiB๊ธ‰ weight ๋‹ค์šด๋กœ๋“œ์™€ ๋กœ๋“œ ๋•Œ๋ฌธ์— ๋ช‡ ๋ถ„ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ˜„์žฌ repo๋Š” rolling pretraining checkpoint์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๊ธด prompt๋กœ ์ดํ›„ checkpoint์™€ ๋น„๊ตํ•ด์„œ ์ง€์‹, ๋ฌธ์ฒด, ๋ฐ˜๋ณต ์—ฌ๋ถ€๋ฅผ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

helper๊ฐ€ ์“ฐ๋Š” prompt ํ˜•์‹์€ upstream InferenceCheckpoint.tokenize_prompt()์™€ ๋งž์ถฅ๋‹ˆ๋‹ค.

<|im_start|><|object_ref_start|>PROMPT<|im_end|>

์ผ๋ฐ˜ AutoModelForCausalLM.generate()๋Š” ์•„์ง ์ง€์› ๊ฒฝ๋กœ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ custom hrm_text architecture์ด๋ฏ€๋กœ, ์ผ๋ฐ˜ Transformers generation์€ ์ถ”ํ›„ trust_remote_code wrapper๊ฐ€ ์ถ”๊ฐ€๋œ ๋’ค ์ง€์›ํ•˜๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ ๊ณต๊ฐœ model.safetensors๋กœ ๋ฐ”๋กœ ์ƒ์„ฑํ•˜๋ ค๋ฉด ์œ„ ๋…ธํŠธ๋ถ/helper๋ฅผ ์“ฐ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๋‚ด๋ถ€ raw-checkpoint ์ƒ์„ฑ

ํ•™์Šต ๋จธ์‹ ์—์„œ ๋””๋ฒ„๊น…ํ•˜๊ฑฐ๋‚˜ raw FSDP2 checkpoint๋ฅผ ์ •ํ™•ํžˆ ๋ณต๊ตฌํ•ด์„œ ํ‰๊ฐ€ํ•  ๋•Œ๋Š” upstream ์Šคํƒ€์ผ inference ๊ฒฝ๋กœ๋„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  • simple_inference_engine.py
  • LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints์˜ raw checkpoints
  • CUDA/FlashAttention ์ค‘์‹ฌ ์‹คํ–‰

์ด ๊ฒฝ๋กœ๋Š” ๋‚ด๋ถ€ continuation/evaluation์šฉ์— ๊ฐ€๊น๊ณ , Colab์—์„œ ๊ฐ€์žฅ ์‰ฝ๊ฒŒ ํ™•์ธํ•˜๋ ค๋ฉด ์œ„ ๊ณต๊ฐœ model.safetensors helper๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ด ๋‚ซ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ

prepared data๋Š” ์•„๋ž˜ dataset repo์— ์—…๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data

ํ•™์Šต objective๋Š” PrefixLM response-only loss์ž…๋‹ˆ๋‹ค. instruction/prompt token์€ context๋กœ ๋ณด๊ณ , loss๋Š” response span์—๋งŒ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” prepared data group:

๋ฐ์ดํ„ฐ ๊ทธ๋ฃน Tokens ์šฉ๋„
koterm_pretrain_mix_v1 711.3M stage-0/stage0b
HRM cleaned fast-cap stage1/stage1b 14.55B HRM-style instruction pretraining
HRM cleaned full/no-cap stage2 14.55B ์™„๋ฃŒ๋œ continuation
HRM cleaned full/no-cap extra stage2b 14.55B ์ง„ํ–‰ ์ค‘์ธ continuation
local terminal conversations 9.39B terminal/code/tool-heavy continuation
Korean tool/legal/wiki/finance mix 3.02B ํ•œ๊ตญ์–ด domain/tool continuation
BCAI Finance Korean 857.7M ํ•œ๊ตญ์–ด ๊ธˆ์œต/domain data
Korean legal/admin task data 629.0M ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /ํ–‰์ • data
Korean Wikipedia 462.5M ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ ํ…์ŠคํŠธ
ToolBench train tool-call data 127.0M tool-call pretraining
SWE-ZERO + GLM reasoning subsets 251.2M code/reasoning data

ํ‰๊ฐ€ ์„ฑ๊ฒฉ ๋ฐ์ดํ„ฐ๋Š” ํ™•์ธ๋˜๋Š” ๋ฒ”์œ„์—์„œ train์—์„œ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ๋Š” ToolBench eval, Terminal Bench ๊ณ„์—ด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ, benchmark ์„ฑ๊ฒฉ์˜ chi-bench์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ์ง„ํ–‰

ํ˜„์žฌ run์€ staged continuation ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

stage0
-> stage0b
-> stage1
-> stage2
-> stage3
-> stage4
-> stage1b
-> stage2b
-> stage3b
-> stage4b
-> stage1c
-> stage2c
-> stage3c
-> stage4c

checkpoint๋Š” model weights, optimizer state, EMA weights, recurrent carry state๋ฅผ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค. resume_step_offset๊ณผ total_steps_override๋ฅผ ์จ์„œ stage๋งˆ๋‹ค learning-rate schedule์ด ๋ฆฌ์…‹๋˜์ง€ ์•Š๊ณ  ๊ธด pretraining run์ฒ˜๋Ÿผ ์ด์–ด์ง€๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

2026-05-27 ๊ธฐ์ค€ stage2b๊ฐ€ ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค. continuation watcher๊ฐ€ ์ดํ›„ stage3b -> stage4b -> stage1c -> stage2c -> stage3c -> stage4c๋ฅผ ์ด์–ด์„œ ์‹คํ–‰ํ•˜๋„๋ก ์˜ˆ์•ฝ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. handoff๋Š” ๊ฐ stage์˜ ์‹ค์ œ epoch_1_info.json global_step์„ ์ฝ๊ณ  ๋‹ค์Œ stage๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ๋ชฉ์ 

์ด checkpoint๋Š” ๋‹ค์Œ ๋ชฉ์ ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

  • continued pretraining ์‹คํ—˜
  • ํ•œ๊ตญ์–ด tokenizer ๋ฐ HRM-Text architecture ์‹คํ—˜
  • terminal/tool-call/code pretraining ์—ฐ๊ตฌ
  • checkpoint conversion ๋ฐ evaluation ์ž‘์—…

์•„์ง ์™„์„ฑ๋œ assistant model์€ ์•„๋‹™๋‹ˆ๋‹ค.

์ œํ•œ ์‚ฌํ•ญ

  • ์ค‘๊ฐ„ checkpoint์ด๋ฉฐ ์ตœ์ข… aligned instruct model์ด ์•„๋‹™๋‹ˆ๋‹ค.
  • ์ „์ฒด planned continuation์ด ์•„์ง ๋๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
  • ์ตœ์ข… SFT์™€ safety tuning์ด ์•„์ง ๋๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
  • ์ƒˆ checkpoint์˜ public benchmark score๋Š” ์•„์ง final์ด ์•„๋‹™๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜ Transformers generation์€ custom hrm_text modeling wrapper ๋˜๋Š” remote-code file์ด ์ถ”๊ฐ€๋˜์–ด์•ผ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • tool-call JSON ์œ ํšจ์„ฑ๊ณผ terminal action safety๋Š” ์‹ค์ œ ์‚ฌ์šฉ ์ „์— ๋ณ„๋„ ํ‰๊ฐ€๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ธ์šฉ

์ด ์ž‘์—…์€ HRM-Text architecture์™€ training stack์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

Downloads last month
1,286
Safetensors
Model size
1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support