owenisas commited on
Commit
a38bb98
·
verified ·
1 Parent(s): 0546c37

Run GGUF through official llama.cpp CLI

Browse files
Files changed (4) hide show
  1. README.md +7 -7
  2. __pycache__/app.cpython-314.pyc +0 -0
  3. app.py +111 -89
  4. requirements.txt +0 -3
README.md CHANGED
@@ -28,7 +28,7 @@ license: mit
28
 
29
  First-Principle AI is a compact Gradio console for running and probing the
30
  `build-small-hackathon/phase-3-gguf` Q8 GGUF model through
31
- `llama-cpp-python`.
32
 
33
  The UI includes benchmark-style examples inspired by common LLM evaluation
34
  areas: math reasoning, commonsense, science QA, truthfulness, instruction
@@ -39,18 +39,18 @@ questions are original prompts, not copied benchmark items.
39
 
40
  - Model repo: `build-small-hackathon/phase-3-gguf`
41
  - Model file: `model-Q8_0.gguf`
42
- - Runtime: `llama-cpp-python`
43
  - Hardware target: ZeroGPU
44
  - Fallback behavior: visible runtime diagnostics instead of silent mock output
45
- - Model loading: runtime download/load through `llama-cpp-python`
46
  - Default llama.cpp settings: `n_ctx=4096`, `n_batch=512`, `n_ubatch=128`,
47
  memory-mapped weights, and CPU fallback if CUDA offload is unavailable
48
 
49
  ZeroGPU is a Gradio dynamic GPU runtime primarily documented around PyTorch
50
- workloads. This app targets ZeroGPU as requested, but it also reports whether
51
- the GGUF can actually load through llama.cpp on the current runtime. If the
52
- runtime does not expose enough memory or a compatible llama.cpp backend, the
53
- app returns a visible compatibility message.
54
 
55
  The model is intentionally not preloaded during the Space build because the Q8
56
  GGUF is 33.6 GB and can make build startup unreliable. The app resolves the Hub
 
28
 
29
  First-Principle AI is a compact Gradio console for running and probing the
30
  `build-small-hackathon/phase-3-gguf` Q8 GGUF model through
31
+ the official `llama.cpp` Ubuntu CLI release.
32
 
33
  The UI includes benchmark-style examples inspired by common LLM evaluation
34
  areas: math reasoning, commonsense, science QA, truthfulness, instruction
 
39
 
40
  - Model repo: `build-small-hackathon/phase-3-gguf`
41
  - Model file: `model-Q8_0.gguf`
42
+ - Runtime: official `llama.cpp` `llama-cli`
43
  - Hardware target: ZeroGPU
44
  - Fallback behavior: visible runtime diagnostics instead of silent mock output
45
+ - Model loading: runtime download/load through `llama-cli`
46
  - Default llama.cpp settings: `n_ctx=4096`, `n_batch=512`, `n_ubatch=128`,
47
  memory-mapped weights, and CPU fallback if CUDA offload is unavailable
48
 
49
  ZeroGPU is a Gradio dynamic GPU runtime primarily documented around PyTorch
50
+ workloads. This app targets ZeroGPU as requested, but it runs the GGUF through
51
+ the official llama.cpp CLI path so it does not depend on a Python extension
52
+ compile during the Space build. If the runtime does not expose enough memory or
53
+ a compatible llama.cpp binary, the app returns a visible compatibility message.
54
 
55
  The model is intentionally not preloaded during the Space build because the Q8
56
  GGUF is 33.6 GB and can make build startup unreliable. The app resolves the Hub
__pycache__/app.cpython-314.pyc CHANGED
Binary files a/__pycache__/app.cpython-314.pyc and b/__pycache__/app.cpython-314.pyc differ
 
app.py CHANGED
@@ -5,7 +5,9 @@ import platform
5
  import re
6
  import threading
7
  import time
8
- import inspect
 
 
9
  from pathlib import Path
10
  from typing import Any
11
 
@@ -17,19 +19,15 @@ try:
17
  except Exception: # pragma: no cover - the package exists on HF ZeroGPU runtimes
18
  spaces = None # type: ignore[assignment]
19
 
20
- try:
21
- from llama_cpp import Llama
22
- except Exception as exc: # pragma: no cover - resolved in the Space runtime
23
- Llama = None # type: ignore[assignment]
24
- LLAMA_IMPORT_ERROR = exc
25
- else:
26
- LLAMA_IMPORT_ERROR = None
27
-
28
-
29
  MODEL_REPO = os.getenv("PHASE3_MODEL_REPO", "build-small-hackathon/phase-3-gguf")
30
  MODEL_FILE = os.getenv("PHASE3_MODEL_FILE", "model-Q8_0.gguf")
31
  MODEL_LABEL = "First-Principle AI"
32
  LOCAL_MODEL_PATH = Path("/Users/user/.lmstudio/models/owenisas/Phase-3-GGUF/model-Q8_0.gguf")
 
 
 
 
 
33
  MAX_CONTEXT = int(os.getenv("PHASE3_MAX_CONTEXT", "4096"))
34
  MIN_RAM_GB = float(os.getenv("PHASE3_MIN_RAM_GB", "38"))
35
  DISABLE_MODEL = os.getenv("PHASE3_DISABLE_MODEL", "").lower() in {"1", "true", "yes"}
@@ -42,10 +40,11 @@ USE_MMAP = os.getenv("PHASE3_USE_MMAP", "1").lower() not in {"0", "false", "no"}
42
  USE_MLOCK = os.getenv("PHASE3_USE_MLOCK", "").lower() in {"1", "true", "yes"}
43
  FLASH_ATTN = os.getenv("PHASE3_FLASH_ATTN", "").lower() in {"1", "true", "yes"}
44
  OFFLOAD_KQV = os.getenv("PHASE3_OFFLOAD_KQV", "1").lower() not in {"0", "false", "no"}
 
45
 
46
  MODEL_LOCK = threading.Lock()
47
- MODEL: Any | None = None
48
  MODEL_PATH: Path | None = None
 
49
  MODEL_ERROR: str | None = None
50
  MODEL_SETTINGS: dict[str, Any] = {}
51
 
@@ -98,6 +97,7 @@ def _safe_env_summary() -> dict[str, str]:
98
  "CUDA_VISIBLE_DEVICES",
99
  "PHASE3_MODEL_REPO",
100
  "PHASE3_MODEL_FILE",
 
101
  "PHASE3_MAX_CONTEXT",
102
  "PHASE3_DISABLE_MODEL",
103
  "PHASE3_USE_ZEROGPU",
@@ -143,29 +143,6 @@ def _find_model_path() -> Path:
143
  return Path(downloaded)
144
 
145
 
146
- def _llama_init_kwargs(path: Path, n_gpu_layers: int) -> dict[str, Any]:
147
- requested = {
148
- "model_path": str(path),
149
- "n_ctx": MAX_CONTEXT,
150
- "n_batch": N_BATCH,
151
- "n_ubatch": N_UBATCH,
152
- "n_threads": N_THREADS,
153
- "n_threads_batch": N_THREADS_BATCH,
154
- "n_gpu_layers": n_gpu_layers,
155
- "use_mmap": USE_MMAP,
156
- "use_mlock": USE_MLOCK,
157
- "flash_attn": FLASH_ATTN,
158
- "offload_kqv": OFFLOAD_KQV,
159
- "logits_all": False,
160
- "verbose": False,
161
- }
162
- try:
163
- allowed = set(inspect.signature(Llama).parameters)
164
- except Exception:
165
- return requested
166
- return {key: value for key, value in requested.items() if key in allowed}
167
-
168
-
169
  def _gpu_layers() -> int:
170
  if "PHASE3_N_GPU_LAYERS" in os.environ:
171
  return int(os.environ["PHASE3_N_GPU_LAYERS"])
@@ -174,20 +151,39 @@ def _gpu_layers() -> int:
174
  return 0
175
 
176
 
177
- def _load_model() -> Any:
178
- global MODEL, MODEL_PATH, MODEL_ERROR, MODEL_SETTINGS
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
- if MODEL is not None:
181
- return MODEL
182
  if MODEL_ERROR is not None:
183
  raise RuntimeError(MODEL_ERROR)
184
- if Llama is None:
185
- MODEL_ERROR = f"llama-cpp-python is not importable: {LLAMA_IMPORT_ERROR}"
186
- raise RuntimeError(MODEL_ERROR)
187
 
188
  with MODEL_LOCK:
189
- if MODEL is not None:
190
- return MODEL
191
  if MODEL_ERROR is not None:
192
  raise RuntimeError(MODEL_ERROR)
193
 
@@ -200,39 +196,24 @@ def _load_model() -> Any:
200
  raise RuntimeError(MODEL_ERROR)
201
 
202
  path = _find_model_path()
 
203
  MODEL_PATH = path
204
  n_gpu_layers = _gpu_layers()
205
- load_kwargs = _llama_init_kwargs(path, n_gpu_layers)
206
-
207
- try:
208
- MODEL = Llama(**load_kwargs)
209
- except Exception as exc:
210
- if n_gpu_layers != 0:
211
- fallback_kwargs = _llama_init_kwargs(path, 0)
212
- try:
213
- MODEL = Llama(**fallback_kwargs)
214
- load_kwargs = fallback_kwargs
215
- except Exception as fallback_exc:
216
- MODEL_ERROR = f"Model load failed with GPU offload and CPU fallback: {fallback_exc}"
217
- raise RuntimeError(MODEL_ERROR) from fallback_exc
218
- else:
219
- MODEL_ERROR = f"Model load failed: {exc}"
220
- raise RuntimeError(MODEL_ERROR) from exc
221
-
222
  MODEL_SETTINGS = {
223
  "path": str(path),
224
- "n_ctx": load_kwargs.get("n_ctx"),
225
- "n_batch": load_kwargs.get("n_batch"),
226
- "n_ubatch": load_kwargs.get("n_ubatch"),
227
- "n_threads": load_kwargs.get("n_threads"),
228
- "n_threads_batch": load_kwargs.get("n_threads_batch"),
229
- "n_gpu_layers": load_kwargs.get("n_gpu_layers"),
230
- "use_mmap": load_kwargs.get("use_mmap"),
231
- "use_mlock": load_kwargs.get("use_mlock"),
232
- "flash_attn": load_kwargs.get("flash_attn"),
233
- "offload_kqv": load_kwargs.get("offload_kqv"),
 
234
  }
235
- return MODEL
236
 
237
 
238
  def _format_prompt(system_prompt: str, history: list[dict[str, str]], message: str) -> str:
@@ -256,26 +237,66 @@ def _complete(
256
  top_p: float,
257
  repeat_penalty: float,
258
  ) -> tuple[str, dict[str, Any]]:
259
- model = _load_model()
260
  started = time.time()
261
- output = model(
 
 
 
 
262
  prompt,
263
- max_tokens=int(max_tokens),
264
- temperature=float(temperature),
265
- top_p=float(top_p),
266
- repeat_penalty=float(repeat_penalty),
267
- stop=["<|im_end|>", "<|endoftext|>"],
268
- echo=False,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
  )
270
  elapsed = max(time.time() - started, 0.001)
271
- text = output["choices"][0]["text"].strip()
272
- usage = output.get("usage") or {}
273
- completion_tokens = usage.get("completion_tokens") or max(1, len(text.split()))
 
 
 
 
 
274
  return text, {
275
  "elapsed": elapsed,
276
  "completion_tokens": completion_tokens,
277
  "tokens_per_second": completion_tokens / elapsed,
278
- "usage": usage,
279
  }
280
 
281
 
@@ -283,11 +304,11 @@ def _status_markdown() -> str:
283
  total_gb, available_gb = _meminfo_gb()
284
  size = _repo_file_size()
285
  size_text = f"{size / (1024 ** 3):.1f} GB" if size else "unknown"
286
- llama_state = "importable" if Llama is not None else f"missing ({LLAMA_IMPORT_ERROR})"
287
  spaces_state = "importable" if spaces is not None else "not importable"
288
- model_state = "Loaded" if MODEL is not None else ("Error" if MODEL_ERROR else "Ready to load on first prompt")
289
  available_text = f"{available_gb:.1f} GB" if available_gb is not None else "unknown"
290
  path_text = f"`{MODEL_PATH}`" if MODEL_PATH else "not resolved yet"
 
291
  settings = MODEL_SETTINGS or {
292
  "n_ctx": MAX_CONTEXT,
293
  "n_batch": N_BATCH,
@@ -310,14 +331,15 @@ def _status_markdown() -> str:
310
  | --- | --- |
311
  | Model | `{MODEL_REPO}` |
312
  | File | `{MODEL_FILE}` ({size_text}) |
313
- | Runtime | `llama.cpp` {llama_state}; ZeroGPU helper {spaces_state} |
314
  | Available RAM | {available_text} |
315
  | CUDA devices | `{cuda_text}` |
316
  | Model path | {path_text} |
 
317
  | llama.cpp settings | `ctx={settings.get('n_ctx')}`, `batch={settings.get('n_batch')}`, `ubatch={settings.get('n_ubatch')}`, `threads={settings.get('n_threads')}`, `gpu_layers={settings.get('n_gpu_layers')}` |
318
  | Memory/options | `mmap={settings.get('use_mmap')}`, `mlock={settings.get('use_mlock')}`, `flash_attn={settings.get('flash_attn')}`, `offload_kqv={settings.get('offload_kqv')}` |
319
 
320
- The first prompt downloads and loads the 31 GB Q8 GGUF if it is not already cached. That first run can take several minutes; later runs reuse the in-process llama.cpp model.
321
  """
322
 
323
 
@@ -374,7 +396,7 @@ def respond(
374
  "Model load or inference failed.\n\n"
375
  f"{exc}\n\n"
376
  "The UI is live and the model artifact is published, but the runtime could not complete "
377
- "a llama.cpp load/generation pass. Check the runtime status and Space logs before retrying."
378
  )
379
  meta = {"elapsed": 0.0, "completion_tokens": len(text.split()), "tokens_per_second": 0.0}
380
 
@@ -526,7 +548,7 @@ with gr.Blocks(title="First-Principle AI", fill_width=True) as demo:
526
  <p>A clean model-console interface for probing the Phase-3 Q8 GGUF with transparent runtime status.</p>
527
  <div class="phase-badge-row">
528
  <span class="phase-badge"><strong>Model</strong> build-small-hackathon/phase-3-gguf</span>
529
- <span class="phase-badge"><strong>Runtime</strong> llama.cpp via llama-cpp-python</span>
530
  <span class="phase-badge"><strong>Mode</strong> real GGUF inference</span>
531
  </div>
532
  </div>
 
5
  import re
6
  import threading
7
  import time
8
+ import subprocess
9
+ import tarfile
10
+ import urllib.request
11
  from pathlib import Path
12
  from typing import Any
13
 
 
19
  except Exception: # pragma: no cover - the package exists on HF ZeroGPU runtimes
20
  spaces = None # type: ignore[assignment]
21
 
 
 
 
 
 
 
 
 
 
22
  MODEL_REPO = os.getenv("PHASE3_MODEL_REPO", "build-small-hackathon/phase-3-gguf")
23
  MODEL_FILE = os.getenv("PHASE3_MODEL_FILE", "model-Q8_0.gguf")
24
  MODEL_LABEL = "First-Principle AI"
25
  LOCAL_MODEL_PATH = Path("/Users/user/.lmstudio/models/owenisas/Phase-3-GGUF/model-Q8_0.gguf")
26
+ LLAMA_RELEASE = os.getenv("PHASE3_LLAMA_RELEASE", "b9360")
27
+ LLAMA_URL = os.getenv(
28
+ "PHASE3_LLAMA_URL",
29
+ f"https://github.com/ggml-org/llama.cpp/releases/download/{LLAMA_RELEASE}/llama-{LLAMA_RELEASE}-bin-ubuntu-x64.tar.gz",
30
+ )
31
  MAX_CONTEXT = int(os.getenv("PHASE3_MAX_CONTEXT", "4096"))
32
  MIN_RAM_GB = float(os.getenv("PHASE3_MIN_RAM_GB", "38"))
33
  DISABLE_MODEL = os.getenv("PHASE3_DISABLE_MODEL", "").lower() in {"1", "true", "yes"}
 
40
  USE_MLOCK = os.getenv("PHASE3_USE_MLOCK", "").lower() in {"1", "true", "yes"}
41
  FLASH_ATTN = os.getenv("PHASE3_FLASH_ATTN", "").lower() in {"1", "true", "yes"}
42
  OFFLOAD_KQV = os.getenv("PHASE3_OFFLOAD_KQV", "1").lower() not in {"0", "false", "no"}
43
+ INFER_TIMEOUT = int(os.getenv("PHASE3_INFER_TIMEOUT", "900"))
44
 
45
  MODEL_LOCK = threading.Lock()
 
46
  MODEL_PATH: Path | None = None
47
+ LLAMA_CLI_PATH: Path | None = None
48
  MODEL_ERROR: str | None = None
49
  MODEL_SETTINGS: dict[str, Any] = {}
50
 
 
97
  "CUDA_VISIBLE_DEVICES",
98
  "PHASE3_MODEL_REPO",
99
  "PHASE3_MODEL_FILE",
100
+ "PHASE3_LLAMA_RELEASE",
101
  "PHASE3_MAX_CONTEXT",
102
  "PHASE3_DISABLE_MODEL",
103
  "PHASE3_USE_ZEROGPU",
 
143
  return Path(downloaded)
144
 
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  def _gpu_layers() -> int:
147
  if "PHASE3_N_GPU_LAYERS" in os.environ:
148
  return int(os.environ["PHASE3_N_GPU_LAYERS"])
 
151
  return 0
152
 
153
 
154
+ def _ensure_llama_cli() -> Path:
155
+ global LLAMA_CLI_PATH
156
+
157
+ if LLAMA_CLI_PATH is not None and LLAMA_CLI_PATH.exists():
158
+ return LLAMA_CLI_PATH
159
+
160
+ root = Path(os.getenv("PHASE3_LLAMA_DIR", "/tmp/phase3-llama.cpp"))
161
+ release_dir = root / f"llama-{LLAMA_RELEASE}"
162
+ cli = release_dir / "llama-cli"
163
+ if cli.exists():
164
+ LLAMA_CLI_PATH = cli
165
+ return cli
166
+
167
+ root.mkdir(parents=True, exist_ok=True)
168
+ archive = root / f"llama-{LLAMA_RELEASE}-bin-ubuntu-x64.tar.gz"
169
+ if not archive.exists():
170
+ urllib.request.urlretrieve(LLAMA_URL, archive)
171
+ with tarfile.open(archive, "r:gz") as tar:
172
+ tar.extractall(root)
173
+ if not cli.exists():
174
+ raise RuntimeError(f"llama-cli was not found after extracting {LLAMA_URL}")
175
+ cli.chmod(0o755)
176
+ LLAMA_CLI_PATH = cli
177
+ return cli
178
+
179
+
180
+ def _prepare_runtime() -> tuple[Path, Path]:
181
+ global MODEL_PATH, MODEL_ERROR, MODEL_SETTINGS
182
 
 
 
183
  if MODEL_ERROR is not None:
184
  raise RuntimeError(MODEL_ERROR)
 
 
 
185
 
186
  with MODEL_LOCK:
 
 
187
  if MODEL_ERROR is not None:
188
  raise RuntimeError(MODEL_ERROR)
189
 
 
196
  raise RuntimeError(MODEL_ERROR)
197
 
198
  path = _find_model_path()
199
+ cli = _ensure_llama_cli()
200
  MODEL_PATH = path
201
  n_gpu_layers = _gpu_layers()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
  MODEL_SETTINGS = {
203
  "path": str(path),
204
+ "llama_cli": str(cli),
205
+ "n_ctx": MAX_CONTEXT,
206
+ "n_batch": N_BATCH,
207
+ "n_ubatch": N_UBATCH,
208
+ "n_threads": N_THREADS,
209
+ "n_threads_batch": N_THREADS_BATCH,
210
+ "n_gpu_layers": n_gpu_layers,
211
+ "use_mmap": USE_MMAP,
212
+ "use_mlock": USE_MLOCK,
213
+ "flash_attn": FLASH_ATTN,
214
+ "offload_kqv": OFFLOAD_KQV,
215
  }
216
+ return path, cli
217
 
218
 
219
  def _format_prompt(system_prompt: str, history: list[dict[str, str]], message: str) -> str:
 
237
  top_p: float,
238
  repeat_penalty: float,
239
  ) -> tuple[str, dict[str, Any]]:
240
+ model_path, llama_cli = _prepare_runtime()
241
  started = time.time()
242
+ cmd = [
243
+ str(llama_cli),
244
+ "-m",
245
+ str(model_path),
246
+ "-p",
247
  prompt,
248
+ "-n",
249
+ str(int(max_tokens)),
250
+ "-c",
251
+ str(MAX_CONTEXT),
252
+ "-t",
253
+ str(N_THREADS),
254
+ "-b",
255
+ str(N_BATCH),
256
+ "-ub",
257
+ str(N_UBATCH),
258
+ "--temp",
259
+ str(float(temperature)),
260
+ "--top-p",
261
+ str(float(top_p)),
262
+ "--repeat-penalty",
263
+ str(float(repeat_penalty)),
264
+ "--no-display-prompt",
265
+ ]
266
+ if _gpu_layers() != 0:
267
+ cmd.extend(["-ngl", str(_gpu_layers())])
268
+ if USE_MLOCK:
269
+ cmd.append("--mlock")
270
+ if not USE_MMAP:
271
+ cmd.append("--no-mmap")
272
+ if FLASH_ATTN:
273
+ cmd.append("-fa")
274
+
275
+ env = os.environ.copy()
276
+ binary_dir = str(llama_cli.parent)
277
+ env["LD_LIBRARY_PATH"] = f"{binary_dir}:{env.get('LD_LIBRARY_PATH', '')}"
278
+ proc = subprocess.run(
279
+ cmd,
280
+ cwd=binary_dir,
281
+ env=env,
282
+ text=True,
283
+ capture_output=True,
284
+ timeout=INFER_TIMEOUT,
285
  )
286
  elapsed = max(time.time() - started, 0.001)
287
+ if proc.returncode != 0:
288
+ stderr = proc.stderr.strip()
289
+ stdout = proc.stdout.strip()
290
+ detail = stderr or stdout or f"llama-cli exited with code {proc.returncode}"
291
+ raise RuntimeError(detail[-4000:])
292
+ text = proc.stdout.strip()
293
+ text = text.split("<|im_end|>", 1)[0].strip()
294
+ completion_tokens = max(1, len(text.split()))
295
  return text, {
296
  "elapsed": elapsed,
297
  "completion_tokens": completion_tokens,
298
  "tokens_per_second": completion_tokens / elapsed,
299
+ "usage": {},
300
  }
301
 
302
 
 
304
  total_gb, available_gb = _meminfo_gb()
305
  size = _repo_file_size()
306
  size_text = f"{size / (1024 ** 3):.1f} GB" if size else "unknown"
 
307
  spaces_state = "importable" if spaces is not None else "not importable"
308
+ model_state = "Ready" if MODEL_PATH is not None else ("Error" if MODEL_ERROR else "Ready to load on first prompt")
309
  available_text = f"{available_gb:.1f} GB" if available_gb is not None else "unknown"
310
  path_text = f"`{MODEL_PATH}`" if MODEL_PATH else "not resolved yet"
311
+ cli_text = f"`{LLAMA_CLI_PATH}`" if LLAMA_CLI_PATH else f"`{LLAMA_RELEASE}` not extracted yet"
312
  settings = MODEL_SETTINGS or {
313
  "n_ctx": MAX_CONTEXT,
314
  "n_batch": N_BATCH,
 
331
  | --- | --- |
332
  | Model | `{MODEL_REPO}` |
333
  | File | `{MODEL_FILE}` ({size_text}) |
334
+ | Runtime | `llama.cpp` CLI `{LLAMA_RELEASE}`; ZeroGPU helper {spaces_state} |
335
  | Available RAM | {available_text} |
336
  | CUDA devices | `{cuda_text}` |
337
  | Model path | {path_text} |
338
+ | llama-cli | {cli_text} |
339
  | llama.cpp settings | `ctx={settings.get('n_ctx')}`, `batch={settings.get('n_batch')}`, `ubatch={settings.get('n_ubatch')}`, `threads={settings.get('n_threads')}`, `gpu_layers={settings.get('n_gpu_layers')}` |
340
  | Memory/options | `mmap={settings.get('use_mmap')}`, `mlock={settings.get('use_mlock')}`, `flash_attn={settings.get('flash_attn')}`, `offload_kqv={settings.get('offload_kqv')}` |
341
 
342
+ The first prompt downloads the 31 GB Q8 GGUF and the llama.cpp binary if they are not cached. Generation runs through `llama-cli`.
343
  """
344
 
345
 
 
396
  "Model load or inference failed.\n\n"
397
  f"{exc}\n\n"
398
  "The UI is live and the model artifact is published, but the runtime could not complete "
399
+ "a llama.cpp CLI generation pass. Check the runtime status and Space logs before retrying."
400
  )
401
  meta = {"elapsed": 0.0, "completion_tokens": len(text.split()), "tokens_per_second": 0.0}
402
 
 
548
  <p>A clean model-console interface for probing the Phase-3 Q8 GGUF with transparent runtime status.</p>
549
  <div class="phase-badge-row">
550
  <span class="phase-badge"><strong>Model</strong> build-small-hackathon/phase-3-gguf</span>
551
+ <span class="phase-badge"><strong>Runtime</strong> llama.cpp CLI</span>
552
  <span class="phase-badge"><strong>Mode</strong> real GGUF inference</span>
553
  </div>
554
  </div>
requirements.txt CHANGED
@@ -1,6 +1,3 @@
1
- --no-binary=llama-cpp-python
2
-
3
  gradio==6.14.0
4
  huggingface-hub==1.17.0
5
  spaces==0.50.4
6
- llama-cpp-python==0.3.24
 
 
 
1
  gradio==6.14.0
2
  huggingface-hub==1.17.0
3
  spaces==0.50.4