KoHRM-Text-1.4B FullSFT Top2 Terminal Tool Merge Epoch1

This is an experimental full-SFT checkpoint for the KoHRM-Text 1.4B PrefixLM runtime.

Base Model

  • Base model: LLM-OS-Models/KoHRM-Text-1.4B
  • Relation: full fine-tune (base_model_relation: finetune)
  • This repository contains full fine-tuned KoHRM-Text weights exported as model.safetensors.

It is a fine-tuned version of LLM-OS-Models/KoHRM-Text-1.4B. The training resumed from the stage4d KoHRM checkpoint with a merged terminal/tool dataset built from the current top LFM2.5 terminal SFT runs. The goal is to move KoHRM from generic PrefixLM generation toward TB2-lite terminal next-action JSON outputs.

Training

  • Base model: LLM-OS-Models/KoHRM-Text-1.4B
  • Dataset: kohrm_sft_top2_terminal_tool_raw8192_v1
  • Context length: 8192
  • Approximate training tokens: 245M
  • Training type: full SFT, not LoRA
  • Epochs: 1
  • GPUs: 4 x H200
  • Global batch size: 90112 tokens
  • Learning rate: 2e-5
  • Export format: single-file model.safetensors plus tokenizer/config files

Evaluation

TB2-lite full replay evaluation completed on 2026-06-05 KST.

Checkpoint Steps Score Cmd F1 Precision Recall First Cmd Valid JSON
303/303 full replay 303 31.59 0.3159 0.3859 0.3415 24.8% 73.3%

This final score is above the best completed KoHRM LoRA result (29.11, +2.48) and just below LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT (31.74, -0.15). The main gain over the LoRA runs is better command recall and JSON stability after moving the base weights directly with full SFT. The remaining gap to the stronger Qwen/LFM terminal SFT models is mostly command coverage and first-action accuracy.

The previous KoHRM stage4d results before this full SFT were:

Model Score Cmd F1 Precision Recall First Cmd Valid JSON
KoHRM-Text-1.4B-stage4d direct 11.48 0.1148 0.1995 0.0961 5.9% 38.9%
stage4d + terminal-tool-core-r64 LoRA 29.11 0.2911 0.3988 0.2768 22.1% 63.4%
LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT 31.74 0.3174 0.4062 0.3410 24.8% 63.7%

This full-SFT checkpoint is the best completed KoHRM full-weight result in this repository at upload time, but it should still be treated as experimental because the local HRM-Text PrefixLM runtime is slower than the vLLM chat-model path used by most leaderboard entries.

Usage Notes

This is not a standard Hugging Face AutoModelForCausalLM export yet. It is a KoHRM/HRM-Text PrefixLM checkpoint and currently requires the local HRM-Text runtime.

The local evaluation path used for this export is:

python tb2_lite/scripts/replay_eval_hrm_text.py \
  --model /path/to/KoHRM-Text-1.4B-fullsft-top2-terminal-tool-merge-epoch1 \
  --model-short KoHRM-Text-1.4B-fullsft-top2-terminal-tool-merge-epoch1 \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir tb2_lite/results/kohrm_fullsft_top2_export \
  --local-hrm-export \
  --base-ckpt-path /path/to/KoHRM-Text-1.4B-fullsft-top2-terminal-tool-merge-gbs180k-4gpu \
  --max-model-len 4096 \
  --max-tokens 1024 \
  --batch-size 16

For now, use the repository README benchmark table as the source of truth for completed scores.

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LLM-OS-Models/KoHRM-Text-1.4B-FullSFT-Top2-Terminal-Tool-Merge-Epoch1

Finetuned
(1)
this model