KoHRM-Text-1.4B FullSFT LFM25 Terminal ToolBench Epoch2

This is an experimental second-epoch full-SFT checkpoint for the KoHRM-Text 1.4B PrefixLM runtime.

It is a fine-tuned version of LLM-OS-Models/KoHRM-Text-1.4B. It continues from the Epoch1 LFM25/ToolBench full-SFT checkpoint and is intended to test whether an additional pass improves TB2-lite terminal next-action behavior.

Base Model

  • Base model: LLM-OS-Models/KoHRM-Text-1.4B
  • Relation: full fine-tune (base_model_relation: finetune)
  • Parent checkpoint: LLM-OS-Models/KoHRM-Text-1.4B-FullSFT-LFM25-Terminal-ToolBench-Epoch1
  • Export format: single-file model.safetensors plus tokenizer/config files
  • Runtime: local KoHRM/HRM-Text PrefixLM runtime

Training

  • Dataset: kohrm_sft_lfm25_terminal_toolbench_full_v1
  • Source style: LFM2.5 terminal/tool successful data plus ToolBench terminal turns, reprocessed into KoHRM PrefixLM targets
  • Context length: 8192
  • Approximate training tokens per full pass: 1.51B
  • Training type: full SFT, not LoRA
  • Epochs: 2 total on this SFT dataset (epoch2 continues from epoch1)
  • Epoch2 GPUs: 8 x H200
  • Epoch2 global batch size: 180224 tokens
  • Learning rate: 2e-5
  • Epoch2 checkpoint: /home/work/.data/hrm_text_checkpoints/KoHRM-Text-1.4B-fullsft-lfm25-terminal-toolbench-epoch2-from-epoch1-gbs180k-8gpu/fsdp2_epoch_1

Evaluation

TB2-lite full replay evaluation completed on 2026-06-06 KST.

Result JSON:

tb2_lite/results/20260606T_kohrm_lfm25_epoch2_eval_sdpa8_b16/KoHRM-Text-1.4B-fullsft-lfm25-terminal-toolbench-epoch2-sdpa8-b16-merged.json

Checkpoint Steps Score Cmd F1 Precision Recall First Cmd Valid JSON Avg Pred Cmds Status
epoch1 full replay 303/303 38.56 0.3856 0.4262 0.4341 37.0% 55.1% 27.33 completed
epoch2 full replay 303/303 45.90 0.4590 0.5031 0.5098 44.9% 68.3% 25.16 completed

Score = 100 * avg_command_f1.

Result Analysis

Epoch2 is a large gain over Epoch1:

  • Score: 38.56 -> 45.90 (+7.34)
  • Precision: 0.4262 -> 0.5031
  • Recall: 0.4341 -> 0.5098
  • First command exact: 37.0% -> 44.9%
  • Valid JSON: 55.1% -> 68.3%

The improvement is not just JSON repair. Command precision, command recall, first-action selection, and JSON validity all improved together. That indicates the second pass made the terminal next-action distribution itself closer to the TB2-lite references.

Compared with other local runs:

  • KoHRM-Text-1.4B direct base: 11.48
  • Best KoHRM LoRA: 29.11
  • KoHRM Top2 full SFT Epoch1: 31.59
  • KoHRM LFM25 full SFT Epoch1: 38.56
  • KoHRM LFM25 full SFT Epoch2: 45.90
  • Qwen3.5-2B fast-continue fullconv: 44.79
  • LFM2.5-8B-A1B ToolBench SFT Epoch2: 50.48
  • LFM2.5-8B-A1B ToolBench SFT Epoch1: 52.30

Strong areas in the Epoch2 run:

  • data_querying: 0.6881 average command F1
  • data_science: 0.4901
  • debugging: 0.4857
  • math: 0.4845
  • software_engineering: 0.4770
  • file_operations: 0.4710

Remaining weak areas:

  • swe: 0.3590
  • data_processing: 0.4017
  • dependency_management: 0.4025
  • security: 0.4220
  • model_training: 0.4283

The main remaining gap to LFM2.5 is first-action accuracy and late-step command coverage. Epoch2 bucket F1 is 0.5458 early, 0.4533 mid, and 0.3910 late. The model is much better than Epoch1, but late repair/verification steps are still weaker than early exploration steps.

Usage

Use the local HRM-Text PrefixLM evaluator/runtime. Example evaluation command:

python tb2_lite/scripts/replay_eval_hrm_text.py \
  --model /path/to/KoHRM-Text-1.4B-fullsft-lfm25-terminal-toolbench-epoch2 \
  --model-short KoHRM-Text-1.4B-fullsft-lfm25-terminal-toolbench-epoch2 \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir tb2_lite/results/kohrm_lfm25_epoch2_eval \
  --local-hrm-export \
  --base-ckpt-path /path/to/KoHRM-Text-1.4B-fullsft-lfm25-terminal-toolbench-epoch2-from-epoch1-gbs180k-8gpu \
  --max-model-len 8192 \
  --max-tokens 1024 \
  --condition synth,cot \
  --batch-size 16

This is not a standard Hugging Face AutoModelForCausalLM chat-model export yet. It currently requires the local KoHRM/HRM-Text PrefixLM runtime.

For final leaderboard context and completed scores, use the root project README.md as the source of truth.

Downloads last month
46
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LLM-OS-Models/KoHRM-Text-1.4B-FullSFT-LFM25-Terminal-ToolBench-Epoch2

Finetuned
(3)
this model