KoHRM-Text-1.4B FullSFT Top2 Terminal Tool Merge Epoch1

This is an experimental full-SFT checkpoint for the KoHRM-Text 1.4B PrefixLM runtime.

Base Model

Base model: LLM-OS-Models/KoHRM-Text-1.4B
Relation: full fine-tune (base_model_relation: finetune)
This repository contains full fine-tuned KoHRM-Text weights exported as model.safetensors.

It is a fine-tuned version of LLM-OS-Models/KoHRM-Text-1.4B. The training resumed from the stage4d KoHRM checkpoint with a merged terminal/tool dataset built from the current top LFM2.5 terminal SFT runs. The goal is to move KoHRM from generic PrefixLM generation toward TB2-lite terminal next-action JSON outputs.

Training

Base model: LLM-OS-Models/KoHRM-Text-1.4B
Dataset: kohrm_sft_top2_terminal_tool_raw8192_v1
Context length: 8192
Approximate training tokens: 245M
Training type: full SFT, not LoRA
Epochs: 1
GPUs: 4 x H200
Global batch size: 90112 tokens
Learning rate: 2e-5
Export format: single-file model.safetensors plus tokenizer/config files

Evaluation

TB2-lite full replay evaluation completed on 2026-06-05 KST.

Checkpoint	Steps	Score	Cmd F1	Precision	Recall	First Cmd	Valid JSON
`303/303 full replay`	303	31.59	0.3159	0.3859	0.3415	24.8%	73.3%

This final score is above the best completed KoHRM LoRA result (29.11, +2.48) and just below LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT (31.74, -0.15). The main gain over the LoRA runs is better command recall and JSON stability after moving the base weights directly with full SFT. The remaining gap to the stronger Qwen/LFM terminal SFT models is mostly command coverage and first-action accuracy.

The previous KoHRM stage4d results before this full SFT were:

Model	Score	Cmd F1	Precision	Recall	First Cmd	Valid JSON
`KoHRM-Text-1.4B-stage4d direct`	11.48	0.1148	0.1995	0.0961	5.9%	38.9%
`stage4d + terminal-tool-core-r64 LoRA`	29.11	0.2911	0.3988	0.2768	22.1%	63.4%
`LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT`	31.74	0.3174	0.4062	0.3410	24.8%	63.7%

This full-SFT checkpoint is the best completed KoHRM full-weight result in this repository at upload time, but it should still be treated as experimental because the local HRM-Text PrefixLM runtime is slower than the vLLM chat-model path used by most leaderboard entries.

Usage Notes

This is not a standard Hugging Face AutoModelForCausalLM export yet. It is a KoHRM/HRM-Text PrefixLM checkpoint and currently requires the local HRM-Text runtime.

The local evaluation path used for this export is:

python tb2_lite/scripts/replay_eval_hrm_text.py \
  --model /path/to/KoHRM-Text-1.4B-fullsft-top2-terminal-tool-merge-epoch1 \
  --model-short KoHRM-Text-1.4B-fullsft-top2-terminal-tool-merge-epoch1 \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir tb2_lite/results/kohrm_fullsft_top2_export \
  --local-hrm-export \
  --base-ckpt-path /path/to/KoHRM-Text-1.4B-fullsft-top2-terminal-tool-merge-gbs180k-4gpu \
  --max-model-len 4096 \
  --max-tokens 1024 \
  --batch-size 16

For now, use the repository README benchmark table as the source of truth for completed scores.

Downloads last month: 16

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for LLM-OS-Models/KoHRM-Text-1.4B-FullSFT-Top2-Terminal-Tool-Merge-Epoch1

Base model

LLM-OS-Models/KoHRM-Text-1.4B

Finetuned

(4)

this model