gyung's picture
Add KoHRM CPU quantized runtime pack
24cac6a verified
|
raw
history blame contribute delete
5.07 kB
---
license: apache-2.0
base_model: LLM-OS-Models/KoHRM-Text-1.4B
base_model_relation: quantized
library_name: pytorch
tags:
- kohrm
- hrm-text
- cpu
- int8
- int4
- korean
- terminal
---
# KoHRM-Text-1.4B CPU Runtime
This repository contains a CPU-oriented inference runtime for
`LLM-OS-Models/KoHRM-Text-1.4B`.
It does not duplicate the original model weights. The runtime downloads the
base model from Hugging Face and applies CPU quantization at load time.
# KoHRM-Text CPU Runtime Pack
์ž‘์„ฑ์ผ: `2026-06-09`
## ๊ฒฐ๋ก 
`LLM-OS-Models/KoHRM-Text-1.4B`๋Š” ํ˜„์žฌ GGUF๋กœ ๋ฐ”๋กœ ๋งŒ๋“ค ์ˆ˜ ์—†๋‹ค.
์ด์œ ๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ์ผ๋ฐ˜ Llama/Qwen/Gemma ๊ณ„์—ด์ด ์•„๋‹ˆ๋ผ ์•„๋ž˜ ์ „์šฉ ๊ตฌ์กฐ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
```text
model_type: hrm_text
architectures: HrmTextForCausalLM
H_cycles: 2
L_cycles: 3
prefix_lm: true
```
llama.cpp ๋ณ€ํ™˜๊ธฐ๋กœ ์ง์ ‘ ์‹œ๋„ํ•˜๋ฉด ๋‹ค์Œ ์ง€์ ์—์„œ ๋ง‰ํžŒ๋‹ค.
```text
ERROR:hf-to-gguf:Model HrmTextForCausalLM is not supported
```
๋”ฐ๋ผ์„œ ์ง€๊ธˆ ํ˜„์‹ค์ ์ธ CPU ๊ฒฝ๋กœ๋Š” GGUF๊ฐ€ ์•„๋‹ˆ๋ผ PyTorch ์ „์šฉ runtime์ด๋‹ค.
## ์ถ”๊ฐ€ํ•œ ํŒŒ์ผ
```text
HRM-Text/inference/kohrm_cpu_runtime.py
HRM-Text/inference/requirements-cpu.txt
HRM-Text/scripts/upload_kohrm_cpu_runtime_pack.py
```
์ด runtime์€ ๊ธฐ์กด `HRM-Text/notebooks/kohrm_colab_generate.py`์˜ safetensors ์ง์ ‘ ๋กœ๋”ฉ ๊ฒฝ๋กœ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๊ณ , CPU์šฉ ์–‘์žํ™”์™€ H/L cycle override๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.
## ์‚ฌ์šฉ๋ฒ•
๊ธฐ๋ณธ ๊ถŒ์žฅ๊ฐ’์€ `dynamic-int8`์ด๋‹ค.
```bash
cd /home/work/.projects/LLM-OS-Models/Terminal
CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \
python HRM-Text/inference/kohrm_cpu_runtime.py \
--model LLM-OS-Models/KoHRM-Text-1.4B \
--quant dynamic-int8 \
--prompt "๋ฆฌ๋ˆ…์Šค์—์„œ ํ˜„์žฌ ๋””๋ ‰ํ† ๋ฆฌ ํŒŒ์ผ ๋ชฉ๋ก์„ ๋ณด๋Š” ๋ช…๋ น์–ด๋Š”?" \
--max-new-tokens 128 \
--max-seq-len 768 \
--temperature 0
```
16GB CPU RAM ํ™˜๊ฒฝ์—์„œ๋Š” ์•„๋ž˜ ์ˆœ์„œ๋กœ ์“ฐ๋ฉด ๋œ๋‹ค.
```text
1์ˆœ์œ„: dynamic-int8
2์ˆœ์œ„: none
3์ˆœ์œ„: weight-int4
```
`dynamic-int8`์€ PyTorch CPU dynamic quantization์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์™€ ์†๋„ ๊ท ํ˜•์ด ๊ฐ€์žฅ ๋‚ซ๋‹ค.
`weight-int4`๋Š” ์ง์ ‘ ๊ตฌํ˜„ํ•œ portable 4bit weight-only fallback์ด๋‹ค. ๋ฉ”๋ชจ๋ฆฌ๋Š” ์ค„์ง€๋งŒ ๋งค forward๋งˆ๋‹ค unpack/dequantize๊ฐ€ ๋“ค์–ด๊ฐ€์„œ ๋งค์šฐ ๋А๋ฆฌ๋‹ค. โ€œ๋ฐ˜๋“œ์‹œ ์ž‘์€ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋Œ์•„๊ฐ€์•ผ ํ•œ๋‹คโ€๋Š” ๊ฒฝ์šฐ์—๋งŒ ์“ด๋‹ค.
## H/L cycle override
KoHRM์€ ๊ฐ™์€ H/L module์„ ๋ฐ˜๋ณต ์ ์šฉํ•œ๋‹ค. ๊ธฐ๋ณธ์€ `H=2`, `L=3`์ด๋‹ค.
CPU์—์„œ๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์ค„์—ฌ ์†๋„๋ฅผ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค.
```bash
CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \
python HRM-Text/inference/kohrm_cpu_runtime.py \
--model LLM-OS-Models/KoHRM-Text-1.4B \
--quant dynamic-int8 \
--h-cycles 1 \
--l-cycles 1 \
--prompt "๋ฆฌ๋ˆ…์Šค์—์„œ ํ˜„์žฌ ๋””๋ ‰ํ† ๋ฆฌ ํŒŒ์ผ ๋ชฉ๋ก์„ ๋ณด๋Š” ๋ช…๋ น์–ด๋Š”?" \
--max-new-tokens 128 \
--max-seq-len 768 \
--temperature 0
```
์ฃผ์˜ํ•  ์ ์€ ๋ช…ํ™•ํ•˜๋‹ค.
- `H=2,L=3`: ์›๋ž˜ ํ’ˆ์งˆ ๊ฒฝ๋กœ.
- `H=1,L=1`: CPU ์†๋„ ์šฐ์„  ๊ฒฝ๋กœ.
- cycle์„ ์ค„์ด๋ฉด ํ’ˆ์งˆ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.
## Smoke test ๊ฒฐ๊ณผ
๊ฐ™์€ ์งง์€ prompt, `max_new_tokens=4`, `max_seq_len=128`, `OMP_NUM_THREADS=8` ๊ธฐ์ค€์ด๋‹ค.
```text
none:
elapsed: 1.48s
speed: 2.69 tok/s
cycles: H=2, L=3
dynamic-int8:
elapsed: 0.53s
speed: 7.59 tok/s
cycles: H=2, L=3
dynamic-int8 + H=1,L=1:
elapsed: 0.24s
speed: 8.18 tok/s
cycles: H=1, L=1
weight-int4:
elapsed: 23.25s
speed: 0.17 tok/s
cycles: H=2, L=3
```
์งง์€ smoke test๋ผ ์ ˆ๋Œ€ ์„ฑ๋Šฅ ์ˆซ์ž๋Š” ์ฐธ๊ณ ์šฉ์ด๋‹ค. ํ•˜์ง€๋งŒ ๋ฐฉํ–ฅ์€ ๋ถ„๋ช…ํ•˜๋‹ค.
```text
์‹ค์‚ฌ์šฉ: dynamic-int8
๋ฉ”๋ชจ๋ฆฌ ๊ฐ•์ œ ์ ˆ์•ฝ: weight-int4
ํ’ˆ์งˆ ์œ ์ง€: H=2,L=3
์†๋„ ์šฐ์„ : H=1,L=1
```
## ์™œ GGUF๊ฐ€ ์–ด๋ ค์šด๊ฐ€
GGUF ํŒŒ์ผ์€ ๋‹จ์ˆœํžˆ weight๋ฅผ ๋‹ด๋Š” ํฌ๋งท์ด ์•„๋‹ˆ๋‹ค. llama.cpp๊ฐ€ ํ•ด๋‹น architecture์˜ forward pass๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.
KoHRM์€ ์ผ๋ฐ˜ Transformer block์„ ํ•œ ๋ฒˆ์”ฉ ์Œ“๋Š” ๋ชจ๋ธ์ด ์•„๋‹ˆ๋‹ค.
- H module๊ณผ L module์ด ์žˆ๋‹ค.
- `H_cycles`, `L_cycles`๋งŒํผ recurrentํ•˜๊ฒŒ ๋ฐ˜๋ณตํ•œ๋‹ค.
- PrefixLM formatting๊ณผ stop token ์ฒ˜๋ฆฌ๊ฐ€ ๋‹ค๋ฅด๋‹ค.
- KV cache ๊ตฌ์กฐ๋„ ์ผ๋ฐ˜ chat causal LM๊ณผ ๋‹ค๋ฅด๋‹ค.
๋”ฐ๋ผ์„œ GGUF๋ฅผ ์ œ๋Œ€๋กœ ๋งŒ๋“ค๋ ค๋ฉด ๋‹ค์Œ ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.
```text
1. llama.cpp MODEL_ARCH์— HRM_TEXT ์ถ”๊ฐ€
2. H/L recurrent forward ๊ตฌํ˜„
3. gqkv gated attention ๊ตฌํ˜„
4. PrefixLM prompt/token boundary ์ฒ˜๋ฆฌ
5. tokenizer pre-tokenizer hash ๋“ฑ๋ก
6. quantized tensor name mapping ์ž‘์„ฑ
7. llama-cli generation smoke test
```
๋‹จ์ˆœ converter patch๋กœ ๋๋‚˜๋Š” ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋‹ค.
## HF CPU pack
HF์—๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ์ค‘๋ณต ์—…๋กœ๋“œํ•˜์ง€ ์•Š๊ณ  CPU runtime pack์„ ๋”ฐ๋กœ ์˜ฌ๋ฆฐ๋‹ค.
๋Œ€์ƒ repo:
```text
LLM-OS-Models/KoHRM-Text-1.4B-CPU-Runtime
```
์ด repo์—๋Š” ๋‹ค์Œ๋งŒ ๋“ค์–ด๊ฐ„๋‹ค.
```text
README.md
inference/kohrm_cpu_runtime.py
inference/requirements-cpu.txt
notebooks/kohrm_colab_generate.py
```
๊ฐ€์ค‘์น˜๋Š” ์‹คํ–‰ ์‹œ ์›๋ณธ repo์—์„œ ๋ฐ›๋Š”๋‹ค.
```text
LLM-OS-Models/KoHRM-Text-1.4B
```
๊ณต์šฉ์ปด ๊ธฐ์ค€์œผ๋กœ HF token์€ `.env`์—์„œ ์ฝ๋˜ ์ถœ๋ ฅํ•˜์ง€ ์•Š๋Š”๋‹ค.