File size: 2,418 Bytes
d7f66ae
ae6d0c6
 
d7f66ae
ae6d0c6
d7f66ae
ae6d0c6
 
d7f66ae
ae6d0c6
 
29905f1
ae6d0c6
 
cd2e1fb
ae6d0c6
 
 
 
 
 
 
 
 
d7f66ae
 
ae6d0c6
 
29905f1
cd2e1fb
a8d9fe7
ae6d0c6
d01c43d
 
 
 
 
ae6d0c6
 
 
cd2e1fb
ae6d0c6
a8d9fe7
ae6d0c6
 
a8d9fe7
 
 
ae6d0c6
 
a38bb98
 
 
 
ae6d0c6
 
 
29905f1
 
 
ae6d0c6
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
title: "First-Principle AI"
emoji: "⚙️"
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: "6.14.0"
python_version: "3.12"
app_file: app.py
fullWidth: true
header: mini
short_description: "Phase-3 Q8 GGUF lab console with llama.cpp."
suggested_hardware: zero-a10g
models:
  - build-small-hackathon/phase-3-gguf
tags:
  - gradio
  - zerogpu
  - llama-cpp
  - gguf
  - chatbot
  - model-lab
  - build-small-hackathon
license: mit
---

# First-Principle AI

First-Principle AI is a compact Gradio console for running and probing the
`build-small-hackathon/phase-3-gguf` Q8 GGUF model through
the official `llama.cpp` Ubuntu `llama-server` release.

The UI includes benchmark-style examples inspired by common LLM evaluation
areas: math reasoning, commonsense, science QA, truthfulness, instruction
following, coding, logic, summarization, extraction, robustness, and
goal-binding prompts where the model must identify which real-world object
needs to move. The questions are original prompts, not copied benchmark items.

## Runtime Notes

- Model repo: `build-small-hackathon/phase-3-gguf`
- Model file: `model-Q8_0.gguf`
- Runtime: official `llama.cpp` `llama-server`
- Hardware target: ZeroGPU
- Fallback behavior: visible runtime diagnostics instead of silent mock output
- Model loading: runtime download/load through a persistent `llama-server`
- Default llama.cpp settings: `n_ctx=2048`, `n_batch=256`, `n_ubatch=64`,
  memory-mapped weights, no warmup, and CPU fallback if CUDA offload is unavailable

ZeroGPU is a Gradio dynamic GPU runtime primarily documented around PyTorch
workloads. This app targets ZeroGPU as requested, but it runs the GGUF through
the official llama.cpp CLI path so it does not depend on a Python extension
compile during the Space build. If the runtime does not expose enough memory or
a compatible llama.cpp binary, the app returns a visible compatibility message.

The model is intentionally not preloaded during the Space build because the Q8
GGUF is 33.6 GB and can make build startup unreliable. The app resolves the Hub
file at runtime after checking memory and runtime compatibility. The first
prompt may take several minutes while the model downloads and initializes;
subsequent prompts reuse the in-process llama.cpp model.

## Local Smoke Test

```bash
cd /Users/user/Documents/Automation-agents/hf-spaces/phase-3-gguf-lab
PHASE3_DISABLE_MODEL=1 python app.py
```