gyung's picture
Add KoHRM CPU quantized runtime pack
24cac6a verified
|
raw
history blame contribute delete
5.07 kB
metadata
license: apache-2.0
base_model: LLM-OS-Models/KoHRM-Text-1.4B
base_model_relation: quantized
library_name: pytorch
tags:
  - kohrm
  - hrm-text
  - cpu
  - int8
  - int4
  - korean
  - terminal

KoHRM-Text-1.4B CPU Runtime

This repository contains a CPU-oriented inference runtime for LLM-OS-Models/KoHRM-Text-1.4B.

It does not duplicate the original model weights. The runtime downloads the base model from Hugging Face and applies CPU quantization at load time.

KoHRM-Text CPU Runtime Pack

์ž‘์„ฑ์ผ: 2026-06-09

๊ฒฐ๋ก 

LLM-OS-Models/KoHRM-Text-1.4B๋Š” ํ˜„์žฌ GGUF๋กœ ๋ฐ”๋กœ ๋งŒ๋“ค ์ˆ˜ ์—†๋‹ค.

์ด์œ ๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ์ผ๋ฐ˜ Llama/Qwen/Gemma ๊ณ„์—ด์ด ์•„๋‹ˆ๋ผ ์•„๋ž˜ ์ „์šฉ ๊ตฌ์กฐ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

model_type: hrm_text
architectures: HrmTextForCausalLM
H_cycles: 2
L_cycles: 3
prefix_lm: true

llama.cpp ๋ณ€ํ™˜๊ธฐ๋กœ ์ง์ ‘ ์‹œ๋„ํ•˜๋ฉด ๋‹ค์Œ ์ง€์ ์—์„œ ๋ง‰ํžŒ๋‹ค.

ERROR:hf-to-gguf:Model HrmTextForCausalLM is not supported

๋”ฐ๋ผ์„œ ์ง€๊ธˆ ํ˜„์‹ค์ ์ธ CPU ๊ฒฝ๋กœ๋Š” GGUF๊ฐ€ ์•„๋‹ˆ๋ผ PyTorch ์ „์šฉ runtime์ด๋‹ค.

์ถ”๊ฐ€ํ•œ ํŒŒ์ผ

HRM-Text/inference/kohrm_cpu_runtime.py
HRM-Text/inference/requirements-cpu.txt
HRM-Text/scripts/upload_kohrm_cpu_runtime_pack.py

์ด runtime์€ ๊ธฐ์กด HRM-Text/notebooks/kohrm_colab_generate.py์˜ safetensors ์ง์ ‘ ๋กœ๋”ฉ ๊ฒฝ๋กœ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๊ณ , CPU์šฉ ์–‘์žํ™”์™€ H/L cycle override๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.

์‚ฌ์šฉ๋ฒ•

๊ธฐ๋ณธ ๊ถŒ์žฅ๊ฐ’์€ dynamic-int8์ด๋‹ค.

cd /home/work/.projects/LLM-OS-Models/Terminal

CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \
python HRM-Text/inference/kohrm_cpu_runtime.py \
  --model LLM-OS-Models/KoHRM-Text-1.4B \
  --quant dynamic-int8 \
  --prompt "๋ฆฌ๋ˆ…์Šค์—์„œ ํ˜„์žฌ ๋””๋ ‰ํ† ๋ฆฌ ํŒŒ์ผ ๋ชฉ๋ก์„ ๋ณด๋Š” ๋ช…๋ น์–ด๋Š”?" \
  --max-new-tokens 128 \
  --max-seq-len 768 \
  --temperature 0

16GB CPU RAM ํ™˜๊ฒฝ์—์„œ๋Š” ์•„๋ž˜ ์ˆœ์„œ๋กœ ์“ฐ๋ฉด ๋œ๋‹ค.

1์ˆœ์œ„: dynamic-int8
2์ˆœ์œ„: none
3์ˆœ์œ„: weight-int4

dynamic-int8์€ PyTorch CPU dynamic quantization์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์™€ ์†๋„ ๊ท ํ˜•์ด ๊ฐ€์žฅ ๋‚ซ๋‹ค.

weight-int4๋Š” ์ง์ ‘ ๊ตฌํ˜„ํ•œ portable 4bit weight-only fallback์ด๋‹ค. ๋ฉ”๋ชจ๋ฆฌ๋Š” ์ค„์ง€๋งŒ ๋งค forward๋งˆ๋‹ค unpack/dequantize๊ฐ€ ๋“ค์–ด๊ฐ€์„œ ๋งค์šฐ ๋А๋ฆฌ๋‹ค. โ€œ๋ฐ˜๋“œ์‹œ ์ž‘์€ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋Œ์•„๊ฐ€์•ผ ํ•œ๋‹คโ€๋Š” ๊ฒฝ์šฐ์—๋งŒ ์“ด๋‹ค.

H/L cycle override

KoHRM์€ ๊ฐ™์€ H/L module์„ ๋ฐ˜๋ณต ์ ์šฉํ•œ๋‹ค. ๊ธฐ๋ณธ์€ H=2, L=3์ด๋‹ค.

CPU์—์„œ๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์ค„์—ฌ ์†๋„๋ฅผ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค.

CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \
python HRM-Text/inference/kohrm_cpu_runtime.py \
  --model LLM-OS-Models/KoHRM-Text-1.4B \
  --quant dynamic-int8 \
  --h-cycles 1 \
  --l-cycles 1 \
  --prompt "๋ฆฌ๋ˆ…์Šค์—์„œ ํ˜„์žฌ ๋””๋ ‰ํ† ๋ฆฌ ํŒŒ์ผ ๋ชฉ๋ก์„ ๋ณด๋Š” ๋ช…๋ น์–ด๋Š”?" \
  --max-new-tokens 128 \
  --max-seq-len 768 \
  --temperature 0

์ฃผ์˜ํ•  ์ ์€ ๋ช…ํ™•ํ•˜๋‹ค.

  • H=2,L=3: ์›๋ž˜ ํ’ˆ์งˆ ๊ฒฝ๋กœ.
  • H=1,L=1: CPU ์†๋„ ์šฐ์„  ๊ฒฝ๋กœ.
  • cycle์„ ์ค„์ด๋ฉด ํ’ˆ์งˆ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.

Smoke test ๊ฒฐ๊ณผ

๊ฐ™์€ ์งง์€ prompt, max_new_tokens=4, max_seq_len=128, OMP_NUM_THREADS=8 ๊ธฐ์ค€์ด๋‹ค.

none:
  elapsed: 1.48s
  speed:   2.69 tok/s
  cycles:  H=2, L=3

dynamic-int8:
  elapsed: 0.53s
  speed:   7.59 tok/s
  cycles:  H=2, L=3

dynamic-int8 + H=1,L=1:
  elapsed: 0.24s
  speed:   8.18 tok/s
  cycles:  H=1, L=1

weight-int4:
  elapsed: 23.25s
  speed:   0.17 tok/s
  cycles:  H=2, L=3

์งง์€ smoke test๋ผ ์ ˆ๋Œ€ ์„ฑ๋Šฅ ์ˆซ์ž๋Š” ์ฐธ๊ณ ์šฉ์ด๋‹ค. ํ•˜์ง€๋งŒ ๋ฐฉํ–ฅ์€ ๋ถ„๋ช…ํ•˜๋‹ค.

์‹ค์‚ฌ์šฉ: dynamic-int8
๋ฉ”๋ชจ๋ฆฌ ๊ฐ•์ œ ์ ˆ์•ฝ: weight-int4
ํ’ˆ์งˆ ์œ ์ง€: H=2,L=3
์†๋„ ์šฐ์„ : H=1,L=1

์™œ GGUF๊ฐ€ ์–ด๋ ค์šด๊ฐ€

GGUF ํŒŒ์ผ์€ ๋‹จ์ˆœํžˆ weight๋ฅผ ๋‹ด๋Š” ํฌ๋งท์ด ์•„๋‹ˆ๋‹ค. llama.cpp๊ฐ€ ํ•ด๋‹น architecture์˜ forward pass๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.

KoHRM์€ ์ผ๋ฐ˜ Transformer block์„ ํ•œ ๋ฒˆ์”ฉ ์Œ“๋Š” ๋ชจ๋ธ์ด ์•„๋‹ˆ๋‹ค.

  • H module๊ณผ L module์ด ์žˆ๋‹ค.
  • H_cycles, L_cycles๋งŒํผ recurrentํ•˜๊ฒŒ ๋ฐ˜๋ณตํ•œ๋‹ค.
  • PrefixLM formatting๊ณผ stop token ์ฒ˜๋ฆฌ๊ฐ€ ๋‹ค๋ฅด๋‹ค.
  • KV cache ๊ตฌ์กฐ๋„ ์ผ๋ฐ˜ chat causal LM๊ณผ ๋‹ค๋ฅด๋‹ค.

๋”ฐ๋ผ์„œ GGUF๋ฅผ ์ œ๋Œ€๋กœ ๋งŒ๋“ค๋ ค๋ฉด ๋‹ค์Œ ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.

1. llama.cpp MODEL_ARCH์— HRM_TEXT ์ถ”๊ฐ€
2. H/L recurrent forward ๊ตฌํ˜„
3. gqkv gated attention ๊ตฌํ˜„
4. PrefixLM prompt/token boundary ์ฒ˜๋ฆฌ
5. tokenizer pre-tokenizer hash ๋“ฑ๋ก
6. quantized tensor name mapping ์ž‘์„ฑ
7. llama-cli generation smoke test

๋‹จ์ˆœ converter patch๋กœ ๋๋‚˜๋Š” ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋‹ค.

HF CPU pack

HF์—๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ์ค‘๋ณต ์—…๋กœ๋“œํ•˜์ง€ ์•Š๊ณ  CPU runtime pack์„ ๋”ฐ๋กœ ์˜ฌ๋ฆฐ๋‹ค.

๋Œ€์ƒ repo:

LLM-OS-Models/KoHRM-Text-1.4B-CPU-Runtime

์ด repo์—๋Š” ๋‹ค์Œ๋งŒ ๋“ค์–ด๊ฐ„๋‹ค.

README.md
inference/kohrm_cpu_runtime.py
inference/requirements-cpu.txt
notebooks/kohrm_colab_generate.py

๊ฐ€์ค‘์น˜๋Š” ์‹คํ–‰ ์‹œ ์›๋ณธ repo์—์„œ ๋ฐ›๋Š”๋‹ค.

LLM-OS-Models/KoHRM-Text-1.4B

๊ณต์šฉ์ปด ๊ธฐ์ค€์œผ๋กœ HF token์€ .env์—์„œ ์ฝ๋˜ ์ถœ๋ ฅํ•˜์ง€ ์•Š๋Š”๋‹ค.