Add KoHRM CPU quantized runtime pack

Browse files

Files changed (4) hide show

README.md +199 -0
inference/kohrm_cpu_runtime.py +378 -0
inference/requirements-cpu.txt +4 -0
notebooks/kohrm_colab_generate.py +474 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+license: apache-2.0
+base_model: LLM-OS-Models/KoHRM-Text-1.4B
+base_model_relation: quantized
+library_name: pytorch
+tags:
+- kohrm
+- hrm-text
+- cpu
+- int8
+- int4
+- korean
+- terminal
+---
+# KoHRM-Text-1.4B CPU Runtime
+This repository contains a CPU-oriented inference runtime for
+`LLM-OS-Models/KoHRM-Text-1.4B`.
+It does not duplicate the original model weights. The runtime downloads the
+base model from Hugging Face and applies CPU quantization at load time.
+# KoHRM-Text CPU Runtime Pack
+작성일: `2026-06-09`
+## 결론
+`LLM-OS-Models/KoHRM-Text-1.4B`는 현재 GGUF로 바로 만들 수 없다.
+이유는 모델 구조가 일반 Llama/Qwen/Gemma 계열이 아니라 아래 전용 구조이기 때문이다.
+```text
+model_type: hrm_text
+architectures: HrmTextForCausalLM
+H_cycles: 2
+L_cycles: 3
+prefix_lm: true
+```
+llama.cpp 변환기로 직접 시도하면 다음 지점에서 막힌다.
+```text
+ERROR:hf-to-gguf:Model HrmTextForCausalLM is not supported
+```
+따라서 지금 현실적인 CPU 경로는 GGUF가 아니라 PyTorch 전용 runtime이다.
+## 추가한 파일
+```text
+HRM-Text/inference/kohrm_cpu_runtime.py
+HRM-Text/inference/requirements-cpu.txt
+HRM-Text/scripts/upload_kohrm_cpu_runtime_pack.py
+```
+이 runtime은 기존 `HRM-Text/notebooks/kohrm_colab_generate.py`의 safetensors 직접 로딩 경로를 재사용하고, CPU용 양자화와 H/L cycle override를 추가한다.
+## 사용법
+기본 권장값은 `dynamic-int8`이다.
+```bash
+cd /home/work/.projects/LLM-OS-Models/Terminal
+CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \
+python HRM-Text/inference/kohrm_cpu_runtime.py \
+  --model LLM-OS-Models/KoHRM-Text-1.4B \
+  --quant dynamic-int8 \
+  --prompt "리눅스에서 현재 디렉토리 파일 목록을 보는 명령어는?" \
+  --max-new-tokens 128 \
+  --max-seq-len 768 \
+  --temperature 0
+```
+16GB CPU RAM 환경에서는 아래 순서로 쓰면 된다.
+```text
+1순위: dynamic-int8
+2순위: none
+3순위: weight-int4
+```
+`dynamic-int8`은 PyTorch CPU dynamic quantization을 사용한다. 일반적으로 메모리와 속도 균형이 가장 낫다.
+`weight-int4`는 직접 구현한 portable 4bit weight-only fallback이다. 메모리는 줄지만 매 forward마다 unpack/dequantize가 들어가서 매우 느리다. “반드시 작은 메모리로 돌아가야 한다”는 경우에만 쓴다.
+## H/L cycle override
+KoHRM은 같은 H/L module을 반복 적용한다. 기본은 `H=2`, `L=3`이다.
+CPU에서는 아래처럼 반복 횟수를 줄여 속도를 올릴 수 있다.
+```bash
+CUDA_VISIBLE_DEVICES= OMP_NUM_THREADS=8 \
+python HRM-Text/inference/kohrm_cpu_runtime.py \
+  --model LLM-OS-Models/KoHRM-Text-1.4B \
+  --quant dynamic-int8 \
+  --h-cycles 1 \
+  --l-cycles 1 \
+  --prompt "리눅스에서 현재 디렉토리 파일 목록을 보는 명령어는?" \
+  --max-new-tokens 128 \
+  --max-seq-len 768 \
+  --temperature 0
+```
+주의할 점은 명확하다.
+- `H=2,L=3`: 원래 품질 경로.
+- `H=1,L=1`: CPU 속도 우선 경로.
+- cycle을 줄이면 품질이 떨어질 수 있다.
+## Smoke test 결과
+같은 짧은 prompt, `max_new_tokens=4`, `max_seq_len=128`, `OMP_NUM_THREADS=8` 기준이다.
+```text
+none:
+  elapsed: 1.48s
+  speed:   2.69 tok/s
+  cycles:  H=2, L=3
+dynamic-int8:
+  elapsed: 0.53s
+  speed:   7.59 tok/s
+  cycles:  H=2, L=3
+dynamic-int8 + H=1,L=1:
+  elapsed: 0.24s
+  speed:   8.18 tok/s
+  cycles:  H=1, L=1
+weight-int4:
+  elapsed: 23.25s
+  speed:   0.17 tok/s
+  cycles:  H=2, L=3
+```
+짧은 smoke test라 절대 성능 숫자는 참고용이다. 하지만 방향은 분명하다.
+```text
+실사용: dynamic-int8
+메모리 강제 절약: weight-int4
+품질 유지: H=2,L=3
+속도 우선: H=1,L=1
+```
+## 왜 GGUF가 어려운가
+GGUF 파일은 단순히 weight를 담는 포맷이 아니다. llama.cpp가 해당 architecture의 forward pass를 알아야 한다.
+KoHRM은 일반 Transformer block을 한 번씩 쌓는 모델이 아니다.
+- H module과 L module이 있다.
+- `H_cycles`, `L_cycles`만큼 recurrent하게 반복한다.
+- PrefixLM formatting과 stop token 처리가 다르다.
+- KV cache 구조도 일반 chat causal LM과 다르다.
+따라서 GGUF를 제대로 만들려면 다음 작업이 필요하다.
+```text
+1. llama.cpp MODEL_ARCH에 HRM_TEXT 추가
+2. H/L recurrent forward 구현
+3. gqkv gated attention 구현
+4. PrefixLM prompt/token boundary 처리
+5. tokenizer pre-tokenizer hash 등록
+6. quantized tensor name mapping 작성
+7. llama-cli generation smoke test
+```
+단순 converter patch로 끝나는 문제가 아니다.
+## HF CPU pack
+HF에는 가중치를 중복 업로드하지 않고 CPU runtime pack을 따로 올린다.
+대상 repo:
+```text
+LLM-OS-Models/KoHRM-Text-1.4B-CPU-Runtime
+```
+이 repo에는 다음만 들어간다.
+```text
+README.md
+inference/kohrm_cpu_runtime.py
+inference/requirements-cpu.txt
+notebooks/kohrm_colab_generate.py
+```
+가중치는 ���행 시 원본 repo에서 받는다.
+```text
+LLM-OS-Models/KoHRM-Text-1.4B
+```
+공용컴 기준으로 HF token은 `.env`에서 읽되 출력하지 않는다.

inference/kohrm_cpu_runtime.py ADDED Viewed

	@@ -0,0 +1,378 @@

+"""CPU-oriented KoHRM-Text inference runtime.
+KoHRM-Text uses the custom ``hrm_text`` / ``HrmTextForCausalLM`` architecture,
+so it cannot currently be served by llama.cpp/GGUF or ordinary vLLM paths.
+This runtime wraps the existing safetensors loader and adds CPU-friendly
+quantization and cycle overrides.
+Recommended mode for normal CPU use:
+    python HRM-Text/inference/kohrm_cpu_runtime.py \
+      --model LLM-OS-Models/KoHRM-Text-1.4B \
+      --quant dynamic-int8 \
+      --prompt "리눅스에서 현재 디렉토리 파일 목록을 보는 명령어는?" \
+      --max-new-tokens 64
+Experimental memory-first mode:
+    python HRM-Text/inference/kohrm_cpu_runtime.py --quant weight-int4 ...
+"""
+from __future__ import annotations
+import argparse
+import gc
+import importlib.util
+import json
+import math
+import os
+import shutil
+import sys
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from huggingface_hub import snapshot_download
+REPO_ROOT = Path(__file__).resolve().parents[1]
+HELPER_PATH = REPO_ROOT / "notebooks" / "kohrm_colab_generate.py"
+DEFAULT_REPO_ID = "LLM-OS-Models/KoHRM-Text-1.4B"
+def _load_helper():
+    if not HELPER_PATH.exists():
+        raise FileNotFoundError(f"missing KoHRM helper: {HELPER_PATH}")
+    spec = importlib.util.spec_from_file_location("kohrm_colab_generate", HELPER_PATH)
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"cannot import helper from {HELPER_PATH}")
+    module = importlib.util.module_from_spec(spec)
+    sys.modules.setdefault("kohrm_colab_generate", module)
+    spec.loader.exec_module(module)
+    return module
+def _read_dotenv_token() -> str | None:
+    """Read a local HF token without printing it or exporting it to shell logs."""
+    candidates = [
+        Path.cwd() / ".env",
+        REPO_ROOT.parent / ".env",
+        REPO_ROOT / ".env",
+        Path.home() / ".cache" / "huggingface" / "token",
+    ]
+    for path in candidates:
+        if not path.exists():
+            continue
+        if path.name == "token":
+            token = path.read_text(encoding="utf-8").strip()
+            return token or None
+        for raw in path.read_text(encoding="utf-8", errors="ignore").splitlines():
+            line = raw.strip()
+            if not line or line.startswith("#") or "=" not in line:
+                continue
+            key, value = line.split("=", 1)
+            key = key.strip()
+            if key.startswith("export "):
+                key = key.split(None, 1)[1]
+            if key in {"HF_TOKEN", "HUGGINGFACE_TOKEN", "HUGGING_FACE_HUB_TOKEN"}:
+                token = value.strip().strip('"').strip("'")
+                return token or None
+    return os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")
+def resolve_model_dir(model: str, revision: str | None = None) -> Path:
+    path = Path(model).expanduser()
+    if path.exists():
+        return path
+    token = _read_dotenv_token()
+    return Path(
+        snapshot_download(
+            repo_id=model,
+            revision=revision,
+            allow_patterns=["config.json", "tokenizer.json", "tokenizer_config.json", "model.safetensors", "README.md"],
+            token=token,
+        )
+    )
+@dataclass
+class RuntimeStats:
+    prompt_tokens: int
+    generated_tokens: int
+    elapsed_s: float
+    tokens_per_s: float
+    quantization: str
+    h_cycles: int
+    l_cycles: int
+    dtype: str
+class WeightOnlyInt8Linear(nn.Module):
+    """Simple symmetric per-group int8 weight-only Linear.
+    This is a portability fallback, not an optimized kernel. It reduces resident
+    weight memory after conversion, but dequantizes on forward. For speed, prefer
+    PyTorch dynamic int8.
+    """
+    def __init__(self, qweight: torch.Tensor, scales: torch.Tensor, in_features: int, out_features: int, group_size: int) -> None:
+        super().__init__()
+        self.in_features = int(in_features)
+        self.out_features = int(out_features)
+        self.group_size = int(group_size)
+        self.register_buffer("qweight", qweight.contiguous())
+        self.register_buffer("scales", scales.contiguous())
+    @classmethod
+    def from_linear(cls, linear: nn.Linear, group_size: int = 128) -> "WeightOnlyInt8Linear":
+        weight = linear.weight.detach().to(dtype=torch.float32, device="cpu")
+        out_features, in_features = weight.shape
+        pad = (-in_features) % group_size
+        if pad:
+            weight = F.pad(weight, (0, pad))
+        grouped = weight.view(out_features, -1, group_size)
+        scales = grouped.abs().amax(dim=-1).clamp_min(1e-8) / 127.0
+        qweight = torch.round(grouped / scales.unsqueeze(-1)).clamp(-127, 127).to(torch.int8)
+        return cls(qweight=qweight, scales=scales.to(torch.float16), in_features=in_features, out_features=out_features, group_size=group_size)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        weight = (self.qweight.to(torch.float32) * self.scales.to(torch.float32).unsqueeze(-1)).view(self.out_features, -1)
+        weight = weight[:, : self.in_features].to(dtype=x.dtype)
+        return F.linear(x, weight)
+class WeightOnlyInt4Linear(nn.Module):
+    """Portable symmetric per-group int4 weight-only Linear.
+    Values are stored as packed signed nibbles. Forward unpacks and dequantizes
+    on CPU, so this is memory-first rather than speed-first.
+    """
+    def __init__(self, packed: torch.Tensor, scales: torch.Tensor, in_features: int, out_features: int, padded_features: int, group_size: int) -> None:
+        super().__init__()
+        self.in_features = int(in_features)
+        self.out_features = int(out_features)
+        self.padded_features = int(padded_features)
+        self.group_size = int(group_size)
+        self.register_buffer("packed", packed.contiguous())
+        self.register_buffer("scales", scales.contiguous())
+    @classmethod
+    def from_linear(cls, linear: nn.Linear, group_size: int = 128) -> "WeightOnlyInt4Linear":
+        weight = linear.weight.detach().to(dtype=torch.float32, device="cpu")
+        out_features, in_features = weight.shape
+        pad_group = (-in_features) % group_size
+        if pad_group:
+            weight = F.pad(weight, (0, pad_group))
+        if weight.shape[1] % 2:
+            weight = F.pad(weight, (0, 1))
+        padded_features = weight.shape[1]
+        grouped = weight.view(out_features, -1, group_size)
+        scales = grouped.abs().amax(dim=-1).clamp_min(1e-8) / 7.0
+        q = torch.round(grouped / scales.unsqueeze(-1)).clamp(-8, 7).to(torch.int16)
+        q = (q + 16).remainder(16).to(torch.uint8).view(out_features, padded_features)
+        low = q[:, 0::2]
+        high = q[:, 1::2] << 4
+        packed = low | high
+        return cls(
+            packed=packed,
+            scales=scales.to(torch.float16),
+            in_features=in_features,
+            out_features=out_features,
+            padded_features=padded_features,
+            group_size=group_size,
+        )
+    def _unpack(self) -> torch.Tensor:
+        low = self.packed & 0x0F
+        high = (self.packed >> 4) & 0x0F
+        q = torch.empty((self.out_features, self.packed.shape[1] * 2), dtype=torch.int16, device=self.packed.device)
+        q[:, 0::2] = low.to(torch.int16)
+        q[:, 1::2] = high.to(torch.int16)
+        q = torch.where(q >= 8, q - 16, q)
+        return q[:, : self.padded_features]
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        q = self._unpack().to(torch.float32)
+        weight = (q.view(self.out_features, -1, self.group_size) * self.scales.to(torch.float32).unsqueeze(-1)).view(self.out_features, -1)
+        weight = weight[:, : self.in_features].to(dtype=x.dtype)
+        return F.linear(x, weight)
+def _replace_linear_modules(module: nn.Module, *, quant: str, group_size: int, quantize_lm_head: bool, prefix: str = "") -> int:
+    replaced = 0
+    for name, child in list(module.named_children()):
+        child_prefix = f"{prefix}.{name}" if prefix else name
+        if isinstance(child, nn.Linear):
+            if child_prefix == "lm_head" and not quantize_lm_head:
+                continue
+            if child.bias is not None:
+                raise ValueError(f"bias is not supported by portable weight-only quantization: {child_prefix}")
+            if quant == "weight-int8":
+                new_child = WeightOnlyInt8Linear.from_linear(child, group_size=group_size)
+            elif quant == "weight-int4":
+                new_child = WeightOnlyInt4Linear.from_linear(child, group_size=group_size)
+            else:
+                raise ValueError(f"unsupported weight-only quantization: {quant}")
+            setattr(module, name, new_child)
+            replaced += 1
+        else:
+            replaced += _replace_linear_modules(child, quant=quant, group_size=group_size, quantize_lm_head=quantize_lm_head, prefix=child_prefix)
+    return replaced
+def apply_quantization(
+    model: nn.Module,
+    quant: str,
+    *,
+    group_size: int = 128,
+    quantize_lm_head: bool = False,
+) -> nn.Module:
+    if quant == "none":
+        return model
+    if quant == "dynamic-int8":
+        torch.backends.quantized.engine = "fbgemm"
+        return torch.ao.quantization.quantize_dynamic(model.cpu(), {nn.Linear}, dtype=torch.qint8, inplace=False)
+    if quant in {"weight-int8", "weight-int4"}:
+        replaced = _replace_linear_modules(model, quant=quant, group_size=group_size, quantize_lm_head=quantize_lm_head)
+        if replaced == 0:
+            raise RuntimeError("no Linear modules were replaced")
+        gc.collect()
+        return model.cpu().eval()
+    raise ValueError(f"unknown quantization mode: {quant}")
+def load_runtime(
+    model_dir: Path,
+    *,
+    quant: str,
+    h_cycles: int | None,
+    l_cycles: int | None,
+    group_size: int,
+    quantize_lm_head: bool,
+):
+    helper = _load_helper()
+    model, tokenizer, cfg = helper.load_kohrm(model_dir, device="cpu")
+    if h_cycles is not None:
+        cfg["H_cycles"] = int(h_cycles)
+        model.cfg["H_cycles"] = int(h_cycles)
+        model.model.cfg["H_cycles"] = int(h_cycles)
+    if l_cycles is not None:
+        cfg["L_cycles"] = int(l_cycles)
+        model.cfg["L_cycles"] = int(l_cycles)
+        model.model.cfg["L_cycles"] = int(l_cycles)
+    model = apply_quantization(model, quant, group_size=group_size, quantize_lm_head=quantize_lm_head)
+    return helper, model.eval(), tokenizer, cfg
+def generate(
+    model: nn.Module,
+    tokenizer: Any,
+    cfg: dict[str, Any],
+    helper: Any,
+    prompt: str,
+    *,
+    max_new_tokens: int,
+    min_new_tokens: int,
+    max_seq_len: int,
+    temperature: float,
+    top_p: float,
+    repetition_penalty: float,
+    no_repeat_ngram_size: int,
+    condition: str,
+) -> tuple[str, RuntimeStats]:
+    wrapped = helper.format_kohrm_prompt(prompt, condition=condition)
+    prompt_tokens = len(tokenizer.encode(wrapped, add_special_tokens=False).ids)
+    start = time.perf_counter()
+    output = helper.generate_from_loaded(
+        model,
+        tokenizer,
+        cfg,
+        prompt,
+        max_new_tokens=max_new_tokens,
+        min_new_tokens=min_new_tokens,
+        max_seq_len=max_seq_len,
+        temperature=temperature,
+        top_p=top_p,
+        repetition_penalty=repetition_penalty,
+        no_repeat_ngram_size=no_repeat_ngram_size,
+        condition=condition,
+    )
+    elapsed = time.perf_counter() - start
+    out_tokens = len(tokenizer.encode(output, add_special_tokens=False).ids) if output else 0
+    stats = RuntimeStats(
+        prompt_tokens=prompt_tokens,
+        generated_tokens=out_tokens,
+        elapsed_s=elapsed,
+        tokens_per_s=(out_tokens / elapsed if elapsed > 0 else math.nan),
+        quantization="",
+        h_cycles=int(cfg.get("H_cycles", 0)),
+        l_cycles=int(cfg.get("L_cycles", 0)),
+        dtype=str(next(model.parameters()).dtype) if any(True for _ in model.parameters()) else "unknown",
+    )
+    return output, stats
+def build_arg_parser() -> argparse.ArgumentParser:
+    ap = argparse.ArgumentParser(description="Run KoHRM-Text on CPU with optional quantization.")
+    ap.add_argument("--model", default=DEFAULT_REPO_ID, help="HF repo id or local directory containing KoHRM HF export files.")
+    ap.add_argument("--revision", default=None)
+    ap.add_argument("--prompt", required=True)
+    ap.add_argument("--quant", choices=["none", "dynamic-int8", "weight-int8", "weight-int4"], default="dynamic-int8")
+    ap.add_argument("--group-size", type=int, default=128)
+    ap.add_argument("--quantize-lm-head", action="store_true", help="Also quantize lm_head in portable weight-only modes. Saves memory but slows generation.")
+    ap.add_argument("--h-cycles", type=int, default=None, help="Override H_cycles. Lower values trade quality for CPU speed.")
+    ap.add_argument("--l-cycles", type=int, default=None, help="Override L_cycles. Lower values trade quality for CPU speed.")
+    ap.add_argument("--max-new-tokens", type=int, default=128)
+    ap.add_argument("--min-new-tokens", type=int, default=0)
+    ap.add_argument("--max-seq-len", type=int, default=768)
+    ap.add_argument("--temperature", type=float, default=0.0)
+    ap.add_argument("--top-p", type=float, default=0.9)
+    ap.add_argument("--repetition-penalty", type=float, default=1.05)
+    ap.add_argument("--no-repeat-ngram-size", type=int, default=0)
+    ap.add_argument("--condition", default="direct", choices=["direct", "cot", "noisy", "synth"])
+    ap.add_argument("--json-stats", action="store_true")
+    return ap
+def main() -> None:
+    args = build_arg_parser().parse_args()
+    # Keep CPU execution predictable on shared machines.
+    if "OMP_NUM_THREADS" not in os.environ:
+        os.environ["OMP_NUM_THREADS"] = str(min(8, os.cpu_count() or 8))
+    model_dir = resolve_model_dir(args.model, revision=args.revision)
+    helper, model, tokenizer, cfg = load_runtime(
+        model_dir,
+        quant=args.quant,
+        h_cycles=args.h_cycles,
+        l_cycles=args.l_cycles,
+        group_size=args.group_size,
+        quantize_lm_head=args.quantize_lm_head,
+    )
+    output, stats = generate(
+        model,
+        tokenizer,
+        cfg,
+        helper,
+        args.prompt,
+        max_new_tokens=args.max_new_tokens,
+        min_new_tokens=args.min_new_tokens,
+        max_seq_len=args.max_seq_len,
+        temperature=args.temperature,
+        top_p=args.top_p,
+        repetition_penalty=args.repetition_penalty,
+        no_repeat_ngram_size=args.no_repeat_ngram_size,
+        condition=args.condition,
+    )
+    stats.quantization = args.quant
+    print(output)
+    if args.json_stats:
+        print(json.dumps(stats.__dict__, ensure_ascii=False), file=sys.stderr)
+if __name__ == "__main__":
+    main()

inference/requirements-cpu.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+torch>=2.6
+safetensors>=0.4.5
+tokenizers>=0.20
+huggingface_hub>=0.28

notebooks/kohrm_colab_generate.py ADDED Viewed

	@@ -0,0 +1,474 @@

+"""Minimal KoHRM-Text generation runtime for Colab.
+This file intentionally avoids `transformers` and FlashAttention. It loads the
+public `model.safetensors` export and runs HRM-Text generation with PyTorch
+scaled-dot-product attention. It is built for long pretraining-checkpoint
+knowledge probes on Colab T4 and small CUDA machines.
+"""
+from __future__ import annotations
+import json
+import math
+import argparse
+from pathlib import Path
+from typing import Any
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from safetensors.torch import load_file
+from tokenizers import Tokenizer
+DEFAULT_CONDITION_TOKENS = {
+    "direct": "<|object_ref_start|>",
+    "cot": "<|object_ref_end|>",
+    "noisy": "<|quad_start|>",
+    "synth": "<|quad_end|>",
+}
+def _rms_norm(x: torch.Tensor, eps: float) -> torch.Tensor:
+    return F.rms_norm(x, (x.shape[-1],), eps=eps)
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def _rope_cos_sin(position_ids: torch.Tensor, head_dim: int, theta: float, dtype: torch.dtype) -> tuple[torch.Tensor, torch.Tensor]:
+    inv_freq = 1.0 / (theta ** (torch.arange(0, head_dim, 2, device=position_ids.device, dtype=torch.float32) / head_dim))
+    freqs = torch.einsum("bt,d->btd", position_ids.to(torch.float32), inv_freq)
+    emb = torch.cat((freqs, freqs), dim=-1)
+    return emb.cos().to(dtype), emb.sin().to(dtype)
+def _apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+    return ((x * cos.unsqueeze(-2)) + (_rotate_half(x) * sin.unsqueeze(-2))).to(x.dtype)
+class KoHRMAttention(nn.Module):
+    def __init__(self, hidden_size: int, num_heads: int, head_dim: int, device: str = "meta") -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = head_dim
+        self.gqkv_proj = nn.Linear(hidden_size, (4 * num_heads) * head_dim, bias=False, device=device)
+        self.o_proj = nn.Linear(num_heads * head_dim, hidden_size, bias=False, device=device)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+        cache: dict[str, torch.Tensor] | None,
+        cache_pos: int,
+    ) -> torch.Tensor:
+        bsz, seqlen, _ = x.shape
+        gqkv = self.gqkv_proj(x).view(bsz, seqlen, 4 * self.num_heads, self.head_dim)
+        gate, q, k, v = gqkv.split((self.num_heads, self.num_heads, self.num_heads, self.num_heads), dim=-2)
+        q = _apply_rope(q, cos, sin)
+        k = _apply_rope(k, cos, sin)
+        if cache is not None:
+            end = cache_pos + seqlen
+            cache["k"][:, cache_pos:end].copy_(k)
+            cache["v"][:, cache_pos:end].copy_(v)
+            k = cache["k"][:, :end]
+            v = cache["v"][:, :end]
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        y = F.scaled_dot_product_attention(q, k, v, dropout_p=0.0, is_causal=False)
+        y = y.transpose(1, 2)
+        y = (torch.sigmoid(gate) * y).reshape(bsz, seqlen, self.num_heads * self.head_dim)
+        return self.o_proj(y)
+class KoHRMMLP(nn.Module):
+    def __init__(self, hidden_size: int, intermediate_size: int, device: str = "meta") -> None:
+        super().__init__()
+        self.gate_up_proj = nn.Linear(hidden_size, 2 * intermediate_size, bias=False, device=device)
+        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False, device=device)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        gate, up = self.gate_up_proj(x).chunk(2, dim=-1)
+        return self.down_proj(F.silu(gate) * up)
+class KoHRMBlock(nn.Module):
+    def __init__(self, cfg: dict[str, Any], device: str = "meta") -> None:
+        super().__init__()
+        self.eps = float(cfg["rms_norm_eps"])
+        self.attn = KoHRMAttention(cfg["hidden_size"], cfg["num_attention_heads"], cfg["head_dim"], device=device)
+        self.mlp = KoHRMMLP(cfg["hidden_size"], cfg["intermediate_size"], device=device)
+    def forward(self, x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, cache: dict[str, torch.Tensor] | None, cache_pos: int) -> torch.Tensor:
+        x = x + self.attn(_rms_norm(x, self.eps), cos, sin, cache, cache_pos)
+        x = x + self.mlp(_rms_norm(x, self.eps))
+        return x
+class KoHRMModule(nn.Module):
+    def __init__(self, cfg: dict[str, Any], num_layers: int, device: str = "meta") -> None:
+        super().__init__()
+        self.eps = float(cfg["rms_norm_eps"])
+        self.layers = nn.ModuleList([KoHRMBlock(cfg, device=device) for _ in range(num_layers)])
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        input_injection: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+        caches: list[dict[str, torch.Tensor]] | None,
+        cache_pos: int,
+    ) -> torch.Tensor:
+        x = hidden_states + input_injection
+        for idx, layer in enumerate(self.layers):
+            x = layer(x, cos, sin, None if caches is None else caches[idx], cache_pos)
+        return _rms_norm(x, self.eps)
+class KoHRMCore(nn.Module):
+    def __init__(self, cfg: dict[str, Any], num_layers: int, device: str = "meta") -> None:
+        super().__init__()
+        self.cfg = cfg
+        self.embedding_scale = float(cfg.get("embedding_scale", 1.0))
+        self.embed_tokens = nn.Embedding(cfg["vocab_size"], cfg["hidden_size"], device=device)
+        self.register_buffer("z_L_init", torch.empty(cfg["hidden_size"], device=device), persistent=True)
+        self.H_module = KoHRMModule(cfg, num_layers, device=device)
+        self.L_module = KoHRMModule(cfg, num_layers, device=device)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        position_ids: torch.Tensor,
+        caches: dict[str, list[list[dict[str, torch.Tensor]]]] | None,
+        cache_pos: int,
+    ) -> torch.Tensor:
+        x = self.embedding_scale * self.embed_tokens(input_ids)
+        cos, sin = _rope_cos_sin(position_ids, self.cfg["head_dim"], float(self.cfg["rope_theta"]), x.dtype)
+        z_h = x
+        z_l = self.z_L_init.to(dtype=x.dtype).view(1, 1, -1).expand_as(x)
+        h_cycles, l_cycles = int(self.cfg["H_cycles"]), int(self.cfg["L_cycles"])
+        for h_idx in range(h_cycles):
+            for l_idx in range(l_cycles):
+                pass_idx = h_idx * l_cycles + l_idx
+                z_l = self.L_module(z_l, z_h, cos, sin, None if caches is None else caches["L"][pass_idx], cache_pos)
+            z_h = self.H_module(z_h, z_l, cos, sin, None if caches is None else caches["H"][h_idx], cache_pos)
+        return z_h
+class KoHRMTextForGeneration(nn.Module):
+    def __init__(self, cfg: dict[str, Any], num_layers: int, device: str = "meta") -> None:
+        super().__init__()
+        self.cfg = cfg
+        self.num_layers = num_layers
+        self.model = KoHRMCore(cfg, num_layers, device=device)
+        self.lm_head = nn.Linear(cfg["hidden_size"], cfg["vocab_size"], bias=False, device=device)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        position_ids: torch.Tensor,
+        caches: dict[str, list[list[dict[str, torch.Tensor]]]] | None = None,
+        cache_pos: int = 0,
+    ) -> torch.Tensor:
+        hidden = self.model(input_ids, position_ids, caches, cache_pos)
+        return self.lm_head(hidden)
+    def init_cache(self, batch_size: int, max_seq_len: int, device: torch.device, dtype: torch.dtype) -> dict[str, list[list[dict[str, torch.Tensor]]]]:
+        heads, head_dim = int(self.cfg["num_attention_heads"]), int(self.cfg["head_dim"])
+        def one_layer() -> dict[str, torch.Tensor]:
+            shape = (batch_size, max_seq_len, heads, head_dim)
+            return {
+                "k": torch.empty(shape, device=device, dtype=dtype),
+                "v": torch.empty(shape, device=device, dtype=dtype),
+            }
+        def one_pass() -> list[dict[str, torch.Tensor]]:
+            return [one_layer() for _ in range(self.num_layers)]
+        return {
+            "H": [one_pass() for _ in range(int(self.cfg["H_cycles"]))],
+            "L": [one_pass() for _ in range(int(self.cfg["H_cycles"]) * int(self.cfg["L_cycles"]))],
+        }
+def _module_layer_count(state: dict[str, torch.Tensor], prefix: str) -> int:
+    layers = set()
+    marker = f"{prefix}.layers."
+    for key in state:
+        if key.startswith(marker):
+            layers.add(int(key[len(marker) :].split(".", 1)[0]))
+    return max(layers) + 1
+def load_kohrm(repo_dir: str | Path, device: str | None = None, max_gpu_memory_gib: float | None = None) -> tuple[KoHRMTextForGeneration, Tokenizer, dict[str, Any]]:
+    repo_dir = Path(repo_dir)
+    cfg = json.loads((repo_dir / "config.json").read_text())
+    tokenizer = Tokenizer.from_file(str(repo_dir / "tokenizer.json"))
+    state = load_file(str(repo_dir / "model.safetensors"), device="cpu")
+    num_layers = _module_layer_count(state, "model.H_module")
+    model = KoHRMTextForGeneration(cfg, num_layers=num_layers, device="meta")
+    model.load_state_dict(state, strict=True, assign=True)
+    del state
+    if device is None:
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+    target = torch.device(device)
+    dtype = torch.float16 if target.type == "cuda" else torch.float32
+    model = model.to(device=target, dtype=dtype).eval()
+    if target.type == "cuda":
+        torch.set_float32_matmul_precision("high")
+    if target.type == "cuda" and max_gpu_memory_gib is not None:
+        free, total = torch.cuda.mem_get_info()
+        print(f"GPU memory free/total GiB: {free / 2**30:.2f}/{total / 2**30:.2f}")
+    return model, tokenizer, cfg
+def condition_to_tokens(condition: str = "direct", mapping: dict[str, str] | None = None) -> str:
+    """Map upstream HRM-Text condition names to tokenizer control tokens."""
+    mapping = mapping or DEFAULT_CONDITION_TOKENS
+    pieces: list[str] = []
+    for raw_name in condition.split(","):
+        name = raw_name.strip()
+        if not name:
+            continue
+        if name not in mapping:
+            valid = ", ".join(sorted(mapping))
+            raise ValueError(f"Unknown condition {name!r}; expected one of: {valid}")
+        pieces.append(mapping[name])
+    if not pieces:
+        pieces.append(mapping["direct"])
+    return "".join(pieces)
+def format_kohrm_prompt(
+    prompt: str,
+    condition: str = "direct",
+    condition_token: str | None = None,
+) -> str:
+    """Format prompts like upstream InferenceCheckpoint.tokenize_prompt().
+    Upstream wraps prompts as:
+    `<boq><condition_tokens><instruction><eoq>`.
+    For answer-only generation use condition="direct", which maps to
+    `<|object_ref_start|>` in the KoHRM tokenizer. `condition_token` is kept
+    for backward compatibility and overrides `condition` when supplied.
+    """
+    if condition_token is None:
+        condition_token = condition_to_tokens(condition)
+    return f"<|im_start|>{condition_token}{prompt}<|im_end|>"
+def _apply_repetition_penalty(logits: torch.Tensor, seen_ids: list[int], penalty: float) -> torch.Tensor:
+    if penalty <= 1.0 or not seen_ids:
+        return logits
+    for token_id in set(seen_ids):
+        value = logits[..., token_id]
+        logits[..., token_id] = torch.where(value < 0, value * penalty, value / penalty)
+    return logits
+def _apply_no_repeat_ngram(logits: torch.Tensor, seen_ids: list[int], ngram_size: int) -> torch.Tensor:
+    if ngram_size <= 0 or len(seen_ids) < ngram_size - 1:
+        return logits
+    prefix = tuple(seen_ids[-(ngram_size - 1):])
+    blocked: set[int] = set()
+    for idx in range(len(seen_ids) - ngram_size + 1):
+        if tuple(seen_ids[idx:idx + ngram_size - 1]) == prefix:
+            blocked.add(seen_ids[idx + ngram_size - 1])
+    if blocked:
+        logits[..., list(blocked)] = -torch.inf
+    return logits
+def _sample_next(
+    logits: torch.Tensor,
+    temperature: float,
+    top_p: float,
+    seen_ids: list[int] | None = None,
+    repetition_penalty: float = 1.0,
+    no_repeat_ngram_size: int = 0,
+    blocked_ids: set[int] | None = None,
+) -> int:
+    logits = logits.float()
+    seen_ids = seen_ids or []
+    logits = _apply_repetition_penalty(logits, seen_ids, repetition_penalty)
+    logits = _apply_no_repeat_ngram(logits, seen_ids, no_repeat_ngram_size)
+    if blocked_ids:
+        logits[..., list(blocked_ids)] = -torch.inf
+    if temperature <= 0:
+        return int(torch.argmax(logits, dim=-1).item())
+    probs = torch.softmax(logits / temperature, dim=-1)
+    if top_p < 1.0:
+        sorted_probs, sorted_idx = torch.sort(probs, descending=True)
+        keep = torch.cumsum(sorted_probs, dim=-1) <= top_p
+        keep[..., 0] = True
+        sorted_probs = sorted_probs.masked_fill(~keep, 0)
+        sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
+        next_sorted = torch.multinomial(sorted_probs, num_samples=1)
+        return int(sorted_idx.gather(-1, next_sorted).item())
+    return int(torch.multinomial(probs, num_samples=1).item())
+@torch.inference_mode()
+def generate_from_loaded(
+    model: KoHRMTextForGeneration,
+    tokenizer: Tokenizer,
+    cfg: dict[str, Any],
+    prompt: str,
+    *,
+    max_new_tokens: int = 64,
+    min_new_tokens: int = 0,
+    max_seq_len: int = 512,
+    temperature: float = 0.0,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.18,
+    no_repeat_ngram_size: int = 4,
+    condition: str = "direct",
+    condition_token: str | None = None,
+) -> str:
+    dev = next(model.parameters()).device
+    dtype = next(model.parameters()).dtype
+    wrapped = format_kohrm_prompt(prompt, condition=condition, condition_token=condition_token)
+    input_ids = tokenizer.encode(wrapped, add_special_tokens=False).ids
+    if len(input_ids) + max_new_tokens + 1 > max_seq_len:
+        raise ValueError(f"Prompt plus generation exceeds max_seq_len={max_seq_len}: prompt_tokens={len(input_ids)}")
+    caches = model.init_cache(1, max_seq_len, dev, dtype)
+    ids = torch.tensor([input_ids], device=dev, dtype=torch.long)
+    pos = torch.arange(ids.shape[1], device=dev, dtype=torch.long).unsqueeze(0)
+    logits = model(ids, pos, caches=caches, cache_pos=0)[:, -1, :]
+    cache_pos = ids.shape[1]
+    eos_id = int(cfg.get("eos_token_id") or tokenizer.token_to_id("<|box_end|>"))
+    stop_ids = {
+        eos_id,
+        tokenizer.token_to_id("<|im_end|>"),
+        tokenizer.token_to_id("<|box_end|>"),
+    }
+    stop_ids = {int(x) for x in stop_ids if x is not None}
+    out_ids: list[int] = []
+    seen_ids = list(input_ids)
+    next_id = _sample_next(
+        logits,
+        temperature,
+        top_p,
+        seen_ids,
+        repetition_penalty,
+        no_repeat_ngram_size,
+        blocked_ids=stop_ids if min_new_tokens > 0 else None,
+    )
+    for _ in range(max_new_tokens):
+        if next_id in stop_ids and len(out_ids) >= min_new_tokens:
+            break
+        out_ids.append(next_id)
+        seen_ids.append(next_id)
+        token = torch.tensor([[next_id]], device=dev, dtype=torch.long)
+        pos = torch.tensor([[cache_pos]], device=dev, dtype=torch.long)
+        logits = model(token, pos, caches=caches, cache_pos=cache_pos)[:, -1, :]
+        cache_pos += 1
+        next_id = _sample_next(
+            logits,
+            temperature,
+            top_p,
+            seen_ids,
+            repetition_penalty,
+            no_repeat_ngram_size,
+            blocked_ids=stop_ids if len(out_ids) < min_new_tokens else None,
+        )
+    return tokenizer.decode(out_ids, skip_special_tokens=True).strip()
+@torch.inference_mode()
+def generate_text(
+    repo_dir: str | Path,
+    prompt: str,
+    *,
+    max_new_tokens: int = 64,
+    min_new_tokens: int = 0,
+    max_seq_len: int = 512,
+    temperature: float = 0.0,
+    top_p: float = 0.9,
+    repetition_penalty: float = 1.18,
+    no_repeat_ngram_size: int = 4,
+    condition: str = "direct",
+    condition_token: str | None = None,
+    device: str | None = None,
+) -> str:
+    model, tokenizer, cfg = load_kohrm(repo_dir, device=device, max_gpu_memory_gib=14.0)
+    return generate_from_loaded(
+        model,
+        tokenizer,
+        cfg,
+        prompt,
+        max_new_tokens=max_new_tokens,
+        min_new_tokens=min_new_tokens,
+        max_seq_len=max_seq_len,
+        temperature=temperature,
+        top_p=top_p,
+        repetition_penalty=repetition_penalty,
+        no_repeat_ngram_size=no_repeat_ngram_size,
+        condition=condition,
+        condition_token=condition_token,
+    )
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run a KoHRM-Text long generation probe without transformers.")
+    parser.add_argument("repo_dir", type=Path, help="Directory containing config.json, tokenizer.json, and model.safetensors")
+    parser.add_argument(
+        "--prompt",
+        default=(
+            "다음은 한국어 위키백과 문서 원문 일부입니다. 백과사전식 한국어, "
+            "고유명사, 날짜, 기술/사회/문화 지식을 그대로 학습하십시오.\n\n"
+            "[문서명]\n훈민정음\n\n[부분]\n1/1"
+        ),
+    )
+    parser.add_argument("--max-new-tokens", type=int, default=384)
+    parser.add_argument("--min-new-tokens", type=int, default=160)
+    parser.add_argument("--max-seq-len", type=int, default=1536)
+    parser.add_argument("--temperature", type=float, default=0.65)
+    parser.add_argument("--top-p", type=float, default=0.92)
+    parser.add_argument("--repetition-penalty", type=float, default=1.05)
+    parser.add_argument("--no-repeat-ngram-size", type=int, default=0)
+    parser.add_argument(
+        "--condition",
+        default="direct",
+        help="Comma-separated HRM-Text condition names: direct, cot, noisy, synth. Use direct for answer-only outputs.",
+    )
+    parser.add_argument(
+        "--condition-token",
+        default=None,
+        help="Optional raw condition token override. Normally use --condition direct instead.",
+    )
+    parser.add_argument("--device", default=None)
+    args = parser.parse_args()
+    print(generate_text(
+        args.repo_dir,
+        args.prompt,
+        max_new_tokens=args.max_new_tokens,
+        min_new_tokens=args.min_new_tokens,
+        max_seq_len=args.max_seq_len,
+        temperature=args.temperature,
+        top_p=args.top_p,
+        repetition_penalty=args.repetition_penalty,
+        no_repeat_ngram_size=args.no_repeat_ngram_size,
+        condition=args.condition,
+        condition_token=args.condition_token,
+        device=args.device,
+    ))
+if __name__ == "__main__":
+    main()