Spaces:

build-small-hackathon
/

First-Principle-AI

Paused

App Files Files Community

owenisas commited on Jun 6

Commit

a38bb98

verified ·

1 Parent(s): 0546c37

Run GGUF through official llama.cpp CLI

Browse files

Files changed (4) hide show

README.md +7 -7
__pycache__/app.cpython-314.pyc +0 -0
app.py +111 -89
requirements.txt +0 -3

README.md CHANGED Viewed

@@ -28,7 +28,7 @@ license: mit
 First-Principle AI is a compact Gradio console for running and probing the
 `build-small-hackathon/phase-3-gguf` Q8 GGUF model through
-`llama-cpp-python`.
 The UI includes benchmark-style examples inspired by common LLM evaluation
 areas: math reasoning, commonsense, science QA, truthfulness, instruction
@@ -39,18 +39,18 @@ questions are original prompts, not copied benchmark items.
 - Model repo: `build-small-hackathon/phase-3-gguf`
 - Model file: `model-Q8_0.gguf`
-- Runtime: `llama-cpp-python`
 - Hardware target: ZeroGPU
 - Fallback behavior: visible runtime diagnostics instead of silent mock output
-- Model loading: runtime download/load through `llama-cpp-python`
 - Default llama.cpp settings: `n_ctx=4096`, `n_batch=512`, `n_ubatch=128`,
   memory-mapped weights, and CPU fallback if CUDA offload is unavailable
 ZeroGPU is a Gradio dynamic GPU runtime primarily documented around PyTorch
-workloads. This app targets ZeroGPU as requested, but it also reports whether
-the GGUF can actually load through llama.cpp on the current runtime. If the
-runtime does not expose enough memory or a compatible llama.cpp backend, the
-app returns a visible compatibility message.
 The model is intentionally not preloaded during the Space build because the Q8
 GGUF is 33.6 GB and can make build startup unreliable. The app resolves the Hub

 First-Principle AI is a compact Gradio console for running and probing the
 `build-small-hackathon/phase-3-gguf` Q8 GGUF model through
+the official `llama.cpp` Ubuntu CLI release.
 The UI includes benchmark-style examples inspired by common LLM evaluation
 areas: math reasoning, commonsense, science QA, truthfulness, instruction
 - Model repo: `build-small-hackathon/phase-3-gguf`
 - Model file: `model-Q8_0.gguf`
+- Runtime: official `llama.cpp` `llama-cli`
 - Hardware target: ZeroGPU
 - Fallback behavior: visible runtime diagnostics instead of silent mock output
+- Model loading: runtime download/load through `llama-cli`
 - Default llama.cpp settings: `n_ctx=4096`, `n_batch=512`, `n_ubatch=128`,
   memory-mapped weights, and CPU fallback if CUDA offload is unavailable
 ZeroGPU is a Gradio dynamic GPU runtime primarily documented around PyTorch
+workloads. This app targets ZeroGPU as requested, but it runs the GGUF through
+the official llama.cpp CLI path so it does not depend on a Python extension
+compile during the Space build. If the runtime does not expose enough memory or
+a compatible llama.cpp binary, the app returns a visible compatibility message.
 The model is intentionally not preloaded during the Space build because the Q8
 GGUF is 33.6 GB and can make build startup unreliable. The app resolves the Hub

__pycache__/app.cpython-314.pyc CHANGED Viewed

Binary files a/__pycache__/app.cpython-314.pyc and b/__pycache__/app.cpython-314.pyc differ

app.py CHANGED Viewed

@@ -5,7 +5,9 @@ import platform
 import re
 import threading
 import time
-import inspect
 from pathlib import Path
 from typing import Any
@@ -17,19 +19,15 @@ try:
 except Exception:  # pragma: no cover - the package exists on HF ZeroGPU runtimes
     spaces = None  # type: ignore[assignment]
-try:
-    from llama_cpp import Llama
-except Exception as exc:  # pragma: no cover - resolved in the Space runtime
-    Llama = None  # type: ignore[assignment]
-    LLAMA_IMPORT_ERROR = exc
-else:
-    LLAMA_IMPORT_ERROR = None
 MODEL_REPO = os.getenv("PHASE3_MODEL_REPO", "build-small-hackathon/phase-3-gguf")
 MODEL_FILE = os.getenv("PHASE3_MODEL_FILE", "model-Q8_0.gguf")
 MODEL_LABEL = "First-Principle AI"
 LOCAL_MODEL_PATH = Path("/Users/user/.lmstudio/models/owenisas/Phase-3-GGUF/model-Q8_0.gguf")
 MAX_CONTEXT = int(os.getenv("PHASE3_MAX_CONTEXT", "4096"))
 MIN_RAM_GB = float(os.getenv("PHASE3_MIN_RAM_GB", "38"))
 DISABLE_MODEL = os.getenv("PHASE3_DISABLE_MODEL", "").lower() in {"1", "true", "yes"}
@@ -42,10 +40,11 @@ USE_MMAP = os.getenv("PHASE3_USE_MMAP", "1").lower() not in {"0", "false", "no"}
 USE_MLOCK = os.getenv("PHASE3_USE_MLOCK", "").lower() in {"1", "true", "yes"}
 FLASH_ATTN = os.getenv("PHASE3_FLASH_ATTN", "").lower() in {"1", "true", "yes"}
 OFFLOAD_KQV = os.getenv("PHASE3_OFFLOAD_KQV", "1").lower() not in {"0", "false", "no"}
 MODEL_LOCK = threading.Lock()
-MODEL: Any | None = None
 MODEL_PATH: Path | None = None
 MODEL_ERROR: str | None = None
 MODEL_SETTINGS: dict[str, Any] = {}
@@ -98,6 +97,7 @@ def _safe_env_summary() -> dict[str, str]:
         "CUDA_VISIBLE_DEVICES",
         "PHASE3_MODEL_REPO",
         "PHASE3_MODEL_FILE",
         "PHASE3_MAX_CONTEXT",
         "PHASE3_DISABLE_MODEL",
         "PHASE3_USE_ZEROGPU",
@@ -143,29 +143,6 @@ def _find_model_path() -> Path:
     return Path(downloaded)
-def _llama_init_kwargs(path: Path, n_gpu_layers: int) -> dict[str, Any]:
-    requested = {
-        "model_path": str(path),
-        "n_ctx": MAX_CONTEXT,
-        "n_batch": N_BATCH,
-        "n_ubatch": N_UBATCH,
-        "n_threads": N_THREADS,
-        "n_threads_batch": N_THREADS_BATCH,
-        "n_gpu_layers": n_gpu_layers,
-        "use_mmap": USE_MMAP,
-        "use_mlock": USE_MLOCK,
-        "flash_attn": FLASH_ATTN,
-        "offload_kqv": OFFLOAD_KQV,
-        "logits_all": False,
-        "verbose": False,
-    }
-    try:
-        allowed = set(inspect.signature(Llama).parameters)
-    except Exception:
-        return requested
-    return {key: value for key, value in requested.items() if key in allowed}
 def _gpu_layers() -> int:
     if "PHASE3_N_GPU_LAYERS" in os.environ:
         return int(os.environ["PHASE3_N_GPU_LAYERS"])
@@ -174,20 +151,39 @@ def _gpu_layers() -> int:
     return 0
-def _load_model() -> Any:
-    global MODEL, MODEL_PATH, MODEL_ERROR, MODEL_SETTINGS
-    if MODEL is not None:
-        return MODEL
     if MODEL_ERROR is not None:
         raise RuntimeError(MODEL_ERROR)
-    if Llama is None:
-        MODEL_ERROR = f"llama-cpp-python is not importable: {LLAMA_IMPORT_ERROR}"
-        raise RuntimeError(MODEL_ERROR)
     with MODEL_LOCK:
-        if MODEL is not None:
-            return MODEL
         if MODEL_ERROR is not None:
             raise RuntimeError(MODEL_ERROR)
@@ -200,39 +196,24 @@ def _load_model() -> Any:
             raise RuntimeError(MODEL_ERROR)
         path = _find_model_path()
         MODEL_PATH = path
         n_gpu_layers = _gpu_layers()
-        load_kwargs = _llama_init_kwargs(path, n_gpu_layers)
-        try:
-            MODEL = Llama(**load_kwargs)
-        except Exception as exc:
-            if n_gpu_layers != 0:
-                fallback_kwargs = _llama_init_kwargs(path, 0)
-                try:
-                    MODEL = Llama(**fallback_kwargs)
-                    load_kwargs = fallback_kwargs
-                except Exception as fallback_exc:
-                    MODEL_ERROR = f"Model load failed with GPU offload and CPU fallback: {fallback_exc}"
-                    raise RuntimeError(MODEL_ERROR) from fallback_exc
-            else:
-                MODEL_ERROR = f"Model load failed: {exc}"
-                raise RuntimeError(MODEL_ERROR) from exc
         MODEL_SETTINGS = {
             "path": str(path),
-            "n_ctx": load_kwargs.get("n_ctx"),
-            "n_batch": load_kwargs.get("n_batch"),
-            "n_ubatch": load_kwargs.get("n_ubatch"),
-            "n_threads": load_kwargs.get("n_threads"),
-            "n_threads_batch": load_kwargs.get("n_threads_batch"),
-            "n_gpu_layers": load_kwargs.get("n_gpu_layers"),
-            "use_mmap": load_kwargs.get("use_mmap"),
-            "use_mlock": load_kwargs.get("use_mlock"),
-            "flash_attn": load_kwargs.get("flash_attn"),
-            "offload_kqv": load_kwargs.get("offload_kqv"),
         }
-        return MODEL
 def _format_prompt(system_prompt: str, history: list[dict[str, str]], message: str) -> str:
@@ -256,26 +237,66 @@ def _complete(
     top_p: float,
     repeat_penalty: float,
 ) -> tuple[str, dict[str, Any]]:
-    model = _load_model()
     started = time.time()
-    output = model(
         prompt,
-        max_tokens=int(max_tokens),
-        temperature=float(temperature),
-        top_p=float(top_p),
-        repeat_penalty=float(repeat_penalty),
-        stop=["<|im_end|>", "<|endoftext|>"],
-        echo=False,
     )
     elapsed = max(time.time() - started, 0.001)
-    text = output["choices"][0]["text"].strip()
-    usage = output.get("usage") or {}
-    completion_tokens = usage.get("completion_tokens") or max(1, len(text.split()))
     return text, {
         "elapsed": elapsed,
         "completion_tokens": completion_tokens,
         "tokens_per_second": completion_tokens / elapsed,
-        "usage": usage,
     }
@@ -283,11 +304,11 @@ def _status_markdown() -> str:
     total_gb, available_gb = _meminfo_gb()
     size = _repo_file_size()
     size_text = f"{size / (1024 ** 3):.1f} GB" if size else "unknown"
-    llama_state = "importable" if Llama is not None else f"missing ({LLAMA_IMPORT_ERROR})"
     spaces_state = "importable" if spaces is not None else "not importable"
-    model_state = "Loaded" if MODEL is not None else ("Error" if MODEL_ERROR else "Ready to load on first prompt")
     available_text = f"{available_gb:.1f} GB" if available_gb is not None else "unknown"
     path_text = f"`{MODEL_PATH}`" if MODEL_PATH else "not resolved yet"
     settings = MODEL_SETTINGS or {
         "n_ctx": MAX_CONTEXT,
         "n_batch": N_BATCH,
@@ -310,14 +331,15 @@ def _status_markdown() -> str:
 | --- | --- |
 | Model | `{MODEL_REPO}` |
 | File | `{MODEL_FILE}` ({size_text}) |
-| Runtime | `llama.cpp` {llama_state}; ZeroGPU helper {spaces_state} |
 | Available RAM | {available_text} |
 | CUDA devices | `{cuda_text}` |
 | Model path | {path_text} |
 | llama.cpp settings | `ctx={settings.get('n_ctx')}`, `batch={settings.get('n_batch')}`, `ubatch={settings.get('n_ubatch')}`, `threads={settings.get('n_threads')}`, `gpu_layers={settings.get('n_gpu_layers')}` |
 | Memory/options | `mmap={settings.get('use_mmap')}`, `mlock={settings.get('use_mlock')}`, `flash_attn={settings.get('flash_attn')}`, `offload_kqv={settings.get('offload_kqv')}` |
-The first prompt downloads and loads the 31 GB Q8 GGUF if it is not already cached. That first run can take several minutes; later runs reuse the in-process llama.cpp model.
 """
@@ -374,7 +396,7 @@ def respond(
             "Model load or inference failed.\n\n"
             f"{exc}\n\n"
             "The UI is live and the model artifact is published, but the runtime could not complete "
-            "a llama.cpp load/generation pass. Check the runtime status and Space logs before retrying."
         )
         meta = {"elapsed": 0.0, "completion_tokens": len(text.split()), "tokens_per_second": 0.0}
@@ -526,7 +548,7 @@ with gr.Blocks(title="First-Principle AI", fill_width=True) as demo:
               <p>A clean model-console interface for probing the Phase-3 Q8 GGUF with transparent runtime status.</p>
               <div class="phase-badge-row">
                 <span class="phase-badge"><strong>Model</strong> build-small-hackathon/phase-3-gguf</span>
-                <span class="phase-badge"><strong>Runtime</strong> llama.cpp via llama-cpp-python</span>
                 <span class="phase-badge"><strong>Mode</strong> real GGUF inference</span>
               </div>
             </div>

 import re
 import threading
 import time
+import subprocess
+import tarfile
+import urllib.request
 from pathlib import Path
 from typing import Any
 except Exception:  # pragma: no cover - the package exists on HF ZeroGPU runtimes
     spaces = None  # type: ignore[assignment]
 MODEL_REPO = os.getenv("PHASE3_MODEL_REPO", "build-small-hackathon/phase-3-gguf")
 MODEL_FILE = os.getenv("PHASE3_MODEL_FILE", "model-Q8_0.gguf")
 MODEL_LABEL = "First-Principle AI"
 LOCAL_MODEL_PATH = Path("/Users/user/.lmstudio/models/owenisas/Phase-3-GGUF/model-Q8_0.gguf")
+LLAMA_RELEASE = os.getenv("PHASE3_LLAMA_RELEASE", "b9360")
+LLAMA_URL = os.getenv(
+    "PHASE3_LLAMA_URL",
+    f"https://github.com/ggml-org/llama.cpp/releases/download/{LLAMA_RELEASE}/llama-{LLAMA_RELEASE}-bin-ubuntu-x64.tar.gz",
+)
 MAX_CONTEXT = int(os.getenv("PHASE3_MAX_CONTEXT", "4096"))
 MIN_RAM_GB = float(os.getenv("PHASE3_MIN_RAM_GB", "38"))
 DISABLE_MODEL = os.getenv("PHASE3_DISABLE_MODEL", "").lower() in {"1", "true", "yes"}
 USE_MLOCK = os.getenv("PHASE3_USE_MLOCK", "").lower() in {"1", "true", "yes"}
 FLASH_ATTN = os.getenv("PHASE3_FLASH_ATTN", "").lower() in {"1", "true", "yes"}
 OFFLOAD_KQV = os.getenv("PHASE3_OFFLOAD_KQV", "1").lower() not in {"0", "false", "no"}
+INFER_TIMEOUT = int(os.getenv("PHASE3_INFER_TIMEOUT", "900"))
 MODEL_LOCK = threading.Lock()
 MODEL_PATH: Path | None = None
+LLAMA_CLI_PATH: Path | None = None
 MODEL_ERROR: str | None = None
 MODEL_SETTINGS: dict[str, Any] = {}
         "CUDA_VISIBLE_DEVICES",
         "PHASE3_MODEL_REPO",
         "PHASE3_MODEL_FILE",
+        "PHASE3_LLAMA_RELEASE",
         "PHASE3_MAX_CONTEXT",
         "PHASE3_DISABLE_MODEL",
         "PHASE3_USE_ZEROGPU",
     return Path(downloaded)
 def _gpu_layers() -> int:
     if "PHASE3_N_GPU_LAYERS" in os.environ:
         return int(os.environ["PHASE3_N_GPU_LAYERS"])
     return 0
+def _ensure_llama_cli() -> Path:
+    global LLAMA_CLI_PATH
+    if LLAMA_CLI_PATH is not None and LLAMA_CLI_PATH.exists():
+        return LLAMA_CLI_PATH
+    root = Path(os.getenv("PHASE3_LLAMA_DIR", "/tmp/phase3-llama.cpp"))
+    release_dir = root / f"llama-{LLAMA_RELEASE}"
+    cli = release_dir / "llama-cli"
+    if cli.exists():
+        LLAMA_CLI_PATH = cli
+        return cli
+    root.mkdir(parents=True, exist_ok=True)
+    archive = root / f"llama-{LLAMA_RELEASE}-bin-ubuntu-x64.tar.gz"
+    if not archive.exists():
+        urllib.request.urlretrieve(LLAMA_URL, archive)
+    with tarfile.open(archive, "r:gz") as tar:
+        tar.extractall(root)
+    if not cli.exists():
+        raise RuntimeError(f"llama-cli was not found after extracting {LLAMA_URL}")
+    cli.chmod(0o755)
+    LLAMA_CLI_PATH = cli
+    return cli
+def _prepare_runtime() -> tuple[Path, Path]:
+    global MODEL_PATH, MODEL_ERROR, MODEL_SETTINGS
     if MODEL_ERROR is not None:
         raise RuntimeError(MODEL_ERROR)
     with MODEL_LOCK:
         if MODEL_ERROR is not None:
             raise RuntimeError(MODEL_ERROR)
             raise RuntimeError(MODEL_ERROR)
         path = _find_model_path()
+        cli = _ensure_llama_cli()
         MODEL_PATH = path
         n_gpu_layers = _gpu_layers()
         MODEL_SETTINGS = {
             "path": str(path),
+            "llama_cli": str(cli),
+            "n_ctx": MAX_CONTEXT,
+            "n_batch": N_BATCH,
+            "n_ubatch": N_UBATCH,
+            "n_threads": N_THREADS,
+            "n_threads_batch": N_THREADS_BATCH,
+            "n_gpu_layers": n_gpu_layers,
+            "use_mmap": USE_MMAP,
+            "use_mlock": USE_MLOCK,
+            "flash_attn": FLASH_ATTN,
+            "offload_kqv": OFFLOAD_KQV,
         }
+        return path, cli
 def _format_prompt(system_prompt: str, history: list[dict[str, str]], message: str) -> str:
     top_p: float,
     repeat_penalty: float,
 ) -> tuple[str, dict[str, Any]]:
+    model_path, llama_cli = _prepare_runtime()
     started = time.time()
+    cmd = [
+        str(llama_cli),
+        "-m",
+        str(model_path),
+        "-p",
         prompt,
+        "-n",
+        str(int(max_tokens)),
+        "-c",
+        str(MAX_CONTEXT),
+        "-t",
+        str(N_THREADS),
+        "-b",
+        str(N_BATCH),
+        "-ub",
+        str(N_UBATCH),
+        "--temp",
+        str(float(temperature)),
+        "--top-p",
+        str(float(top_p)),
+        "--repeat-penalty",
+        str(float(repeat_penalty)),
+        "--no-display-prompt",
+    ]
+    if _gpu_layers() != 0:
+        cmd.extend(["-ngl", str(_gpu_layers())])
+    if USE_MLOCK:
+        cmd.append("--mlock")
+    if not USE_MMAP:
+        cmd.append("--no-mmap")
+    if FLASH_ATTN:
+        cmd.append("-fa")
+    env = os.environ.copy()
+    binary_dir = str(llama_cli.parent)
+    env["LD_LIBRARY_PATH"] = f"{binary_dir}:{env.get('LD_LIBRARY_PATH', '')}"
+    proc = subprocess.run(
+        cmd,
+        cwd=binary_dir,
+        env=env,
+        text=True,
+        capture_output=True,
+        timeout=INFER_TIMEOUT,
     )
     elapsed = max(time.time() - started, 0.001)
+    if proc.returncode != 0:
+        stderr = proc.stderr.strip()
+        stdout = proc.stdout.strip()
+        detail = stderr or stdout or f"llama-cli exited with code {proc.returncode}"
+        raise RuntimeError(detail[-4000:])
+    text = proc.stdout.strip()
+    text = text.split("<|im_end|>", 1)[0].strip()
+    completion_tokens = max(1, len(text.split()))
     return text, {
         "elapsed": elapsed,
         "completion_tokens": completion_tokens,
         "tokens_per_second": completion_tokens / elapsed,
+        "usage": {},
     }
     total_gb, available_gb = _meminfo_gb()
     size = _repo_file_size()
     size_text = f"{size / (1024 ** 3):.1f} GB" if size else "unknown"
     spaces_state = "importable" if spaces is not None else "not importable"
+    model_state = "Ready" if MODEL_PATH is not None else ("Error" if MODEL_ERROR else "Ready to load on first prompt")
     available_text = f"{available_gb:.1f} GB" if available_gb is not None else "unknown"
     path_text = f"`{MODEL_PATH}`" if MODEL_PATH else "not resolved yet"
+    cli_text = f"`{LLAMA_CLI_PATH}`" if LLAMA_CLI_PATH else f"`{LLAMA_RELEASE}` not extracted yet"
     settings = MODEL_SETTINGS or {
         "n_ctx": MAX_CONTEXT,
         "n_batch": N_BATCH,
 | --- | --- |
 | Model | `{MODEL_REPO}` |
 | File | `{MODEL_FILE}` ({size_text}) |
+| Runtime | `llama.cpp` CLI `{LLAMA_RELEASE}`; ZeroGPU helper {spaces_state} |
 | Available RAM | {available_text} |
 | CUDA devices | `{cuda_text}` |
 | Model path | {path_text} |
+| llama-cli | {cli_text} |
 | llama.cpp settings | `ctx={settings.get('n_ctx')}`, `batch={settings.get('n_batch')}`, `ubatch={settings.get('n_ubatch')}`, `threads={settings.get('n_threads')}`, `gpu_layers={settings.get('n_gpu_layers')}` |
 | Memory/options | `mmap={settings.get('use_mmap')}`, `mlock={settings.get('use_mlock')}`, `flash_attn={settings.get('flash_attn')}`, `offload_kqv={settings.get('offload_kqv')}` |
+The first prompt downloads the 31 GB Q8 GGUF and the llama.cpp binary if they are not cached. Generation runs through `llama-cli`.
 """
             "Model load or inference failed.\n\n"
             f"{exc}\n\n"
             "The UI is live and the model artifact is published, but the runtime could not complete "
+            "a llama.cpp CLI generation pass. Check the runtime status and Space logs before retrying."
         )
         meta = {"elapsed": 0.0, "completion_tokens": len(text.split()), "tokens_per_second": 0.0}
               <p>A clean model-console interface for probing the Phase-3 Q8 GGUF with transparent runtime status.</p>
               <div class="phase-badge-row">
                 <span class="phase-badge"><strong>Model</strong> build-small-hackathon/phase-3-gguf</span>
+                <span class="phase-badge"><strong>Runtime</strong> llama.cpp CLI</span>
                 <span class="phase-badge"><strong>Mode</strong> real GGUF inference</span>
               </div>
             </div>

requirements.txt CHANGED Viewed

@@ -1,6 +1,3 @@
---no-binary=llama-cpp-python
 gradio==6.14.0
 huggingface-hub==1.17.0
 spaces==0.50.4
-llama-cpp-python==0.3.24

 gradio==6.14.0
 huggingface-hub==1.17.0
 spaces==0.50.4