| --- |
| license: apache-2.0 |
| base_model: LLM-OS-Models/KoHRM-Text-1.4B |
| base_model_relation: quantized |
| library_name: pytorch |
| tags: |
| - kohrm |
| - hrm-text |
| - cpu |
| - int8 |
| - int4 |
| - korean |
| - terminal |
| --- |
| |
| # KoHRM-Text-1.4B CPU Runtime |
|
|
| This repository contains a CPU-oriented inference runtime for |
| `LLM-OS-Models/KoHRM-Text-1.4B`. |
|
|
| It does not duplicate the original model weights. The runtime downloads the |
| base model from Hugging Face and applies CPU quantization at load time. |
|
|
| # KoHRM-Text CPU Runtime Pack |
|
|
| ์์ฑ์ผ: `2026-06-09` |
|
|
| ## ๊ฒฐ๋ก |
|
|
| `LLM-OS-Models/KoHRM-Text-1.4B`๋ ํ์ฌ GGUF๋ก ๋ฐ๋ก ๋ง๋ค ์ ์๋ค. |
|
|
| ์ด์ ๋ ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ ์ผ๋ฐ Llama/Qwen/Gemma ๊ณ์ด์ด ์๋๋ผ ์๋ ์ ์ฉ ๊ตฌ์กฐ์ด๊ธฐ ๋๋ฌธ์ด๋ค. |
|
|
| ```text |
| model_type: hrm_text |
| architectures: HrmTextForCausalLM |
| H_cycles: 2 |
| L_cycles: 3 |
| prefix_lm: true |
| ``` |
|
|
| llama.cpp ๋ณํ๊ธฐ๋ก ์ง์ ์๋ํ๋ฉด ๋ค์ ์ง์ ์์ ๋งํ๋ค. |
|
|
| ```text |
| ERROR:hf-to-gguf:Model HrmTextForCausalLM is not supported |
| ``` |
|
|
| ๋ฐ๋ผ์ ์ง๊ธ ํ์ค์ ์ธ CPU ๊ฒฝ๋ก๋ GGUF๊ฐ ์๋๋ผ PyTorch ์ ์ฉ runtime์ด๋ค. |
|
|
| ## ์ถ๊ฐํ ํ์ผ |
|
|
| ```text |
| HRM-Text/inference/kohrm_cpu_runtime.py |
| HRM-Text/inference/requirements-cpu.txt |
| HRM-Text/scripts/upload_kohrm_cpu_runtime_pack.py |
| ``` |
|
|
| ์ด runtime์ ๊ธฐ์กด `HRM-Text/notebooks/kohrm_colab_generate.py`์ safetensors ์ง์ ๋ก๋ฉ ๊ฒฝ๋ก๋ฅผ ์ฌ์ฌ์ฉํ๊ณ , CPU์ฉ ์์ํ์ H/L cycle override๋ฅผ ์ถ๊ฐํ๋ค. |
|
|
| ## ์ฌ์ฉ๋ฒ |
|
|
| ๊ธฐ๋ณธ ๊ถ์ฅ๊ฐ์ `dynamic-int8`์ด๋ค. |
|
|
| ```bash |
| cd /home/work/.projects/LLM-OS-Models/Terminal |
| |
| CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \ |
| python HRM-Text/inference/kohrm_cpu_runtime.py \ |
| --model LLM-OS-Models/KoHRM-Text-1.4B \ |
| --quant dynamic-int8 \ |
| --prompt "๋ฆฌ๋
์ค์์ ํ์ฌ ๋๋ ํ ๋ฆฌ ํ์ผ ๋ชฉ๋ก์ ๋ณด๋ ๋ช
๋ น์ด๋?" \ |
| --max-new-tokens 128 \ |
| --max-seq-len 768 \ |
| --temperature 0 |
| ``` |
|
|
| 16GB CPU RAM ํ๊ฒฝ์์๋ ์๋ ์์๋ก ์ฐ๋ฉด ๋๋ค. |
|
|
| ```text |
| 1์์: dynamic-int8 |
| 2์์: none |
| 3์์: weight-int4 |
| ``` |
|
|
| `dynamic-int8`์ PyTorch CPU dynamic quantization์ ์ฌ์ฉํ๋ค. ์ผ๋ฐ์ ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ์ ์๋ ๊ท ํ์ด ๊ฐ์ฅ ๋ซ๋ค. |
|
|
| `weight-int4`๋ ์ง์ ๊ตฌํํ portable 4bit weight-only fallback์ด๋ค. ๋ฉ๋ชจ๋ฆฌ๋ ์ค์ง๋ง ๋งค forward๋ง๋ค unpack/dequantize๊ฐ ๋ค์ด๊ฐ์ ๋งค์ฐ ๋๋ฆฌ๋ค. โ๋ฐ๋์ ์์ ๋ฉ๋ชจ๋ฆฌ๋ก ๋์๊ฐ์ผ ํ๋คโ๋ ๊ฒฝ์ฐ์๋ง ์ด๋ค. |
|
|
| ## H/L cycle override |
|
|
| KoHRM์ ๊ฐ์ H/L module์ ๋ฐ๋ณต ์ ์ฉํ๋ค. ๊ธฐ๋ณธ์ `H=2`, `L=3`์ด๋ค. |
|
|
| CPU์์๋ ์๋์ฒ๋ผ ๋ฐ๋ณต ํ์๋ฅผ ์ค์ฌ ์๋๋ฅผ ์ฌ๋ฆด ์ ์๋ค. |
|
|
| ```bash |
| CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \ |
| python HRM-Text/inference/kohrm_cpu_runtime.py \ |
| --model LLM-OS-Models/KoHRM-Text-1.4B \ |
| --quant dynamic-int8 \ |
| --h-cycles 1 \ |
| --l-cycles 1 \ |
| --prompt "๋ฆฌ๋
์ค์์ ํ์ฌ ๋๋ ํ ๋ฆฌ ํ์ผ ๋ชฉ๋ก์ ๋ณด๋ ๋ช
๋ น์ด๋?" \ |
| --max-new-tokens 128 \ |
| --max-seq-len 768 \ |
| --temperature 0 |
| ``` |
|
|
| ์ฃผ์ํ ์ ์ ๋ช
ํํ๋ค. |
|
|
| - `H=2,L=3`: ์๋ ํ์ง ๊ฒฝ๋ก. |
| - `H=1,L=1`: CPU ์๋ ์ฐ์ ๊ฒฝ๋ก. |
| - cycle์ ์ค์ด๋ฉด ํ์ง์ด ๋จ์ด์ง ์ ์๋ค. |
|
|
| ## Smoke test ๊ฒฐ๊ณผ |
|
|
| ๊ฐ์ ์งง์ prompt, `max_new_tokens=4`, `max_seq_len=128`, `OMP_NUM_THREADS=8` ๊ธฐ์ค์ด๋ค. |
|
|
| ```text |
| none: |
| elapsed: 1.48s |
| speed: 2.69 tok/s |
| cycles: H=2, L=3 |
| |
| dynamic-int8: |
| elapsed: 0.53s |
| speed: 7.59 tok/s |
| cycles: H=2, L=3 |
| |
| dynamic-int8 + H=1,L=1: |
| elapsed: 0.24s |
| speed: 8.18 tok/s |
| cycles: H=1, L=1 |
| |
| weight-int4: |
| elapsed: 23.25s |
| speed: 0.17 tok/s |
| cycles: H=2, L=3 |
| ``` |
|
|
| ์งง์ smoke test๋ผ ์ ๋ ์ฑ๋ฅ ์ซ์๋ ์ฐธ๊ณ ์ฉ์ด๋ค. ํ์ง๋ง ๋ฐฉํฅ์ ๋ถ๋ช
ํ๋ค. |
|
|
| ```text |
| ์ค์ฌ์ฉ: dynamic-int8 |
| ๋ฉ๋ชจ๋ฆฌ ๊ฐ์ ์ ์ฝ: weight-int4 |
| ํ์ง ์ ์ง: H=2,L=3 |
| ์๋ ์ฐ์ : H=1,L=1 |
| ``` |
|
|
| ## ์ GGUF๊ฐ ์ด๋ ค์ด๊ฐ |
|
|
| GGUF ํ์ผ์ ๋จ์ํ weight๋ฅผ ๋ด๋ ํฌ๋งท์ด ์๋๋ค. llama.cpp๊ฐ ํด๋น architecture์ forward pass๋ฅผ ์์์ผ ํ๋ค. |
|
|
| KoHRM์ ์ผ๋ฐ Transformer block์ ํ ๋ฒ์ฉ ์๋ ๋ชจ๋ธ์ด ์๋๋ค. |
|
|
| - H module๊ณผ L module์ด ์๋ค. |
| - `H_cycles`, `L_cycles`๋งํผ recurrentํ๊ฒ ๋ฐ๋ณตํ๋ค. |
| - PrefixLM formatting๊ณผ stop token ์ฒ๋ฆฌ๊ฐ ๋ค๋ฅด๋ค. |
| - KV cache ๊ตฌ์กฐ๋ ์ผ๋ฐ chat causal LM๊ณผ ๋ค๋ฅด๋ค. |
|
|
| ๋ฐ๋ผ์ GGUF๋ฅผ ์ ๋๋ก ๋ง๋ค๋ ค๋ฉด ๋ค์ ์์
์ด ํ์ํ๋ค. |
|
|
| ```text |
| 1. llama.cpp MODEL_ARCH์ HRM_TEXT ์ถ๊ฐ |
| 2. H/L recurrent forward ๊ตฌํ |
| 3. gqkv gated attention ๊ตฌํ |
| 4. PrefixLM prompt/token boundary ์ฒ๋ฆฌ |
| 5. tokenizer pre-tokenizer hash ๋ฑ๋ก |
| 6. quantized tensor name mapping ์์ฑ |
| 7. llama-cli generation smoke test |
| ``` |
|
|
| ๋จ์ converter patch๋ก ๋๋๋ ๋ฌธ์ ๊ฐ ์๋๋ค. |
|
|
| ## HF CPU pack |
|
|
| HF์๋ ๊ฐ์ค์น๋ฅผ ์ค๋ณต ์
๋ก๋ํ์ง ์๊ณ CPU runtime pack์ ๋ฐ๋ก ์ฌ๋ฆฐ๋ค. |
|
|
| ๋์ repo: |
|
|
| ```text |
| LLM-OS-Models/KoHRM-Text-1.4B-CPU-Runtime |
| ``` |
|
|
| ์ด repo์๋ ๋ค์๋ง ๋ค์ด๊ฐ๋ค. |
|
|
| ```text |
| README.md |
| inference/kohrm_cpu_runtime.py |
| inference/requirements-cpu.txt |
| notebooks/kohrm_colab_generate.py |
| ``` |
|
|
| ๊ฐ์ค์น๋ ์คํ ์ ์๋ณธ repo์์ ๋ฐ๋๋ค. |
|
|
| ```text |
| LLM-OS-Models/KoHRM-Text-1.4B |
| ``` |
|
|
| ๊ณต์ฉ์ปด ๊ธฐ์ค์ผ๋ก HF token์ `.env`์์ ์ฝ๋ ์ถ๋ ฅํ์ง ์๋๋ค. |
|
|