cn0303 commited on
Commit
12d2e34
·
verified ·
1 Parent(s): c97ad08

Deploy FitCheck: engine + Nemotron model brick on ZeroGPU

Browse files
README.md CHANGED
@@ -6,11 +6,23 @@ colorTo: green
6
  sdk: gradio
7
  sdk_version: 6.16.0
8
  app_file: app.py
 
9
  pinned: false
10
  license: mit
11
  short_description: Honest, plain answers about what AI your computer can run
 
 
12
  ---
13
 
 
 
 
 
 
 
 
 
 
14
  # FitCheck
15
 
16
  **What AI can your computer actually run?**
@@ -35,14 +47,21 @@ Built for the [Build Small hackathon](https://huggingface.co/build-small-hackath
35
 
36
  ## How it is built
37
 
38
- A hand-built HTML, CSS and JS frontend (no framework, no build step) served by
39
- Gradio server mode (`gr.Server`, which is a FastAPI app). The interface talks to
40
- a single connector, `POST /api/advise`.
41
 
42
- > Note: this is the UI view. The advice endpoint currently returns input-aware
43
- > placeholder data so the interface is complete and live. A deterministic
44
- > calculation engine and a small local model plug into the same `/api/advise`
45
- > contract next, with no frontend changes.
 
 
 
 
 
 
 
 
46
 
47
  ## Run it locally
48
 
@@ -53,4 +72,6 @@ pip install -r requirements.txt
53
  python app.py
54
  ```
55
 
56
- Then open http://127.0.0.1:7860/ (add `?go` for a sample result).
 
 
 
6
  sdk: gradio
7
  sdk_version: 6.16.0
8
  app_file: app.py
9
+ python_version: "3.12"
10
  pinned: false
11
  license: mit
12
  short_description: Honest, plain answers about what AI your computer can run
13
+ models:
14
+ - nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16
15
  ---
16
 
17
+ <!--
18
+ ZeroGPU: select "ZeroGPU" hardware in the Space's Settings (the README can't
19
+ set it). The model brick (/api/ask) only loads the LLM when SPACE_ID is set, so
20
+ local `python app.py` stays instant and the chat uses a deterministic fallback.
21
+ Swap the model with no code change via the FITCHECK_MODEL Space secret/variable,
22
+ e.g. FITCHECK_MODEL=Qwen/Qwen3-4B-Instruct-2507 (clean Apache fallback).
23
+ -->
24
+
25
+
26
  # FitCheck
27
 
28
  **What AI can your computer actually run?**
 
47
 
48
  ## How it is built
49
 
50
+ Three bricks behind one hand-built HTML/CSS/JS frontend (no framework, no build
51
+ step), served by Gradio server mode (`gr.Server`, which is a FastAPI app):
 
52
 
53
+ 1. **The frontend** (`static/`) gathers your setup in plain words.
54
+ 2. **The deterministic engine** (`engine/`) does the real memory arithmetic and
55
+ returns an honest verdict over `POST /api/advise`. No AI in the loop, so
56
+ every number is inspectable. (LLM goals run on the engine today; vision /
57
+ image / audio / data goals use a conservative placeholder until the engine
58
+ models those families.)
59
+ 3. **The model brick** (`model_brick.py`) is a small local LLM
60
+ (NVIDIA Nemotron 3 Nano 4B) that *explains* the engine's numbers in plain
61
+ words over `/gradio_api/call/ask`. It is a closed-context narrator: it never
62
+ invents a number, only re-voices the facts the engine produced. On a ZeroGPU
63
+ Space it runs on a GPU via `@spaces.GPU`; locally it degrades to a
64
+ deterministic explainer so the chat always answers.
65
 
66
  ## Run it locally
67
 
 
72
  python app.py
73
  ```
74
 
75
+ Then open http://127.0.0.1:7860/ (add `?go` for a sample result). Locally the
76
+ follow-up chat uses the deterministic explainer; the Nemotron model loads only
77
+ on a Space (when `SPACE_ID` is set).
app.py CHANGED
@@ -1,14 +1,17 @@
1
  """
2
  FitCheck — what AI can your computer actually run?
3
 
4
- This file is the UI BRICK's backend: a `gr.Server` (which IS a FastAPI app)
5
- that serves the hand-built frontend in static/ and exposes ONE connector
6
- endpoint, /api/advise.
7
-
8
- IMPORTANT: /api/advise currently returns input-aware *placeholder* results so
9
- the interface is complete and feels alive (the gauge moves, bands change). The
10
- real deterministic engine (a separate brick) will plug into this same contract
11
- later the frontend won't need to change. Mock logic is fenced below.
 
 
 
12
  """
13
 
14
  import re
@@ -19,15 +22,19 @@ from fastapi.responses import FileResponse
19
  from fastapi.staticfiles import StaticFiles
20
  from pydantic import BaseModel
21
 
22
- CATALOGUE_VERSION = "2026-06-07"
 
 
 
23
  STATIC = Path(__file__).parent / "static"
24
 
25
  app = gr.Server()
26
 
27
 
28
  # ==========================================================================
29
- # PLACEHOLDER engine — to be replaced by the real calculation brick.
30
- # Numbers are plausible and conservative but are NOT the audited engine yet.
 
31
  # ==========================================================================
32
 
33
  _COLORS = {"model": "#818CF8", "chat": "#A5F3C4", "work": "#868E9C"}
@@ -259,7 +266,22 @@ class AdviseIn(BaseModel):
259
 
260
  @app.post("/api/advise")
261
  def api_advise(payload: AdviseIn):
262
- return advise_mock(payload.model_dump())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
 
264
 
265
  app.mount("/static", StaticFiles(directory=STATIC), name="static")
 
1
  """
2
  FitCheck — what AI can your computer actually run?
3
 
4
+ This file wires three bricks together behind a `gr.Server` (which IS a FastAPI
5
+ app) that serves the hand-built frontend in static/:
6
+
7
+ - /api/advise : the honest verdict. For LLM goals it runs the REAL
8
+ deterministic engine (engine/, via ui_adapter). Vision /
9
+ image-gen / audio / data goals which the engine doesn't
10
+ model yet still use the input-aware placeholder below.
11
+ - /gradio_api/call/ask : the model brick (model_brick.ask), a small local LLM
12
+ that explains the engine's numbers in plain words. Exposed as
13
+ @app.api so it runs on Gradio's queue and gets a ZeroGPU
14
+ allocation; called from the browser via @gradio/client.
15
  """
16
 
17
  import re
 
22
  from fastapi.staticfiles import StaticFiles
23
  from pydantic import BaseModel
24
 
25
+ from engine import CATALOGUE_VERSION
26
+ from engine.ui_adapter import advise_for_ui, is_llm_usecase
27
+ from model_brick import ask as model_ask
28
+
29
  STATIC = Path(__file__).parent / "static"
30
 
31
  app = gr.Server()
32
 
33
 
34
  # ==========================================================================
35
+ # PLACEHOLDER engine — vision / image-gen / audio / data goals only.
36
+ # The real engine (engine/) covers LLM goals; these families aren't modelled
37
+ # there yet, so they keep these plausible, conservative placeholder numbers.
38
  # ==========================================================================
39
 
40
  _COLORS = {"model": "#818CF8", "chat": "#A5F3C4", "work": "#868E9C"}
 
266
 
267
  @app.post("/api/advise")
268
  def api_advise(payload: AdviseIn):
269
+ p = payload.model_dump()
270
+ # LLM goals -> the real, audited engine. Other families -> placeholder.
271
+ if is_llm_usecase(p.get("usecase", "chat")):
272
+ return advise_for_ui(p, CATALOGUE_VERSION)
273
+ return advise_mock(p)
274
+
275
+
276
+ @app.api(name="ask")
277
+ def api_ask(question: str, facts: str = "") -> dict:
278
+ """Plain-English follow-up, grounded in the facts /api/advise returned.
279
+
280
+ Exposed at /gradio_api/call/ask (NOT a plain POST) so it runs through
281
+ Gradio's queue and gets a ZeroGPU allocation. `facts` is the JSON string of
282
+ the last /api/advise result. Returns {headline, why, next_step}.
283
+ """
284
+ return model_ask(question, facts)
285
 
286
 
287
  app.mount("/static", StaticFiles(directory=STATIC), name="static")
engine/__init__.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ can-i-run-it — the honest local-AI hardware advisor.
3
+
4
+ The `engine` package is a *deterministic* compatibility engine. It does the
5
+ real arithmetic (memory budgets, what fits, honest trade-offs) with no AI in
6
+ the loop, so every number it produces can be inspected and trusted.
7
+
8
+ A small language model is layered on top *only* to chat and explain — it never
9
+ invents the facts. That separation is the whole point: the numbers are math,
10
+ the words are plain English.
11
+ """
12
+
13
+ from .advisor import advise
14
+ from .hardware import HardwareSpec
15
+
16
+ __all__ = ["advise", "HardwareSpec"]
17
+
18
+ # Bump this whenever the catalogue or rules change. Shown in the UI so people
19
+ # know how fresh the advice is — credibility over cleverness.
20
+ CATALOGUE_VERSION = "2026-06-07"
engine/advisor.py ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ The advisor: turn a machine + a goal into an honest verdict.
3
+
4
+ Output is organised into three plain bands, because that is what makes the
5
+ tool trustworthy instead of hypey:
6
+
7
+ - WORKS NOW : runs well, on the fast path, today.
8
+ - WORKS WITH COMPROMISES : it'll run, but slower or smaller than ideal.
9
+ - DON'T BOTHER : not realistic on this machine — said plainly.
10
+
11
+ No fake promises. If something doesn't fit, we say so and explain why.
12
+ """
13
+
14
+ from dataclasses import dataclass, field
15
+
16
+ from .catalogue import (
17
+ MODEL_CLASSES,
18
+ QUANT_TIERS,
19
+ RECOMMENDED_QUANT,
20
+ QUANT_BY_KEY,
21
+ MODEL_BY_KEY,
22
+ ModelClass,
23
+ QuantTier,
24
+ UseCase,
25
+ USE_CASE_BY_KEY,
26
+ )
27
+ from .estimator import MemoryEstimate, estimate_memory
28
+ from .hardware import HardwareSpec
29
+ from .runtimes import Runtime, pick_runtimes
30
+
31
+
32
+ # How much text (context) we assume per job, in tokens. ~750 words per 1000.
33
+ _CONTEXT_FOR_USE_CASE = {
34
+ "chat": 4096,
35
+ "writing": 4096,
36
+ "coding": 4096,
37
+ "agents": 4096,
38
+ "rag": 8192,
39
+ "finetune": 2048,
40
+ }
41
+
42
+ # We only ever fill a budget to this fraction — the rest is breathing room.
43
+ _SAFETY_FILL = 0.90
44
+
45
+ VERDICT_WORKS = "works_now"
46
+ VERDICT_COMPROMISE = "compromises"
47
+ VERDICT_NO = "dont_bother"
48
+
49
+
50
+ @dataclass
51
+ class ModelVerdict:
52
+ model: ModelClass
53
+ verdict: str # one of the VERDICT_* constants
54
+ quant: QuantTier # the quant we'd actually recommend
55
+ estimate: MemoryEstimate
56
+ full_quality_on_fast: bool # True if it runs on the GPU at fp16/near-full
57
+ notes: list[str] = field(default_factory=list)
58
+
59
+
60
+ @dataclass
61
+ class Advice:
62
+ spec: HardwareSpec
63
+ use_case: UseCase
64
+ context_tokens: int
65
+ verdicts: list[ModelVerdict] # one per model class, big→small order kept
66
+ headline: ModelVerdict | None # the single best pick for this goal
67
+ runtimes: list[Runtime]
68
+ meets_goal: bool # does the headline satisfy the use case?
69
+
70
+ @property
71
+ def works_now(self) -> list[ModelVerdict]:
72
+ return [v for v in self.verdicts if v.verdict == VERDICT_WORKS]
73
+
74
+ @property
75
+ def compromises(self) -> list[ModelVerdict]:
76
+ return [v for v in self.verdicts if v.verdict == VERDICT_COMPROMISE]
77
+
78
+ @property
79
+ def dont_bother(self) -> list[ModelVerdict]:
80
+ return [v for v in self.verdicts if v.verdict == VERDICT_NO]
81
+
82
+
83
+ def _evaluate_model(
84
+ model: ModelClass, spec: HardwareSpec, use_case: UseCase, context_tokens: int
85
+ ) -> ModelVerdict:
86
+ fast = spec.fast_budget_gb
87
+ total = spec.total_budget_gb
88
+ of = use_case.overhead_factor
89
+ q4_bpw = RECOMMENDED_QUANT.bits_per_weight # the 4-bit quality floor
90
+
91
+ # --- Fast path: best *quality* quant that fits on the GPU/shared mem ---
92
+ # We only call it "Works now" if it fits fast at 4-bit or better. Cramming
93
+ # a big model down to 2-bit just to claim it "fits" is exactly the kind of
94
+ # overpromise this tool refuses to make — that path becomes a compromise.
95
+ if spec.has_fast_path:
96
+ for q in QUANT_TIERS: # ordered best-quality -> smallest
97
+ if q.bits_per_weight < q4_bpw:
98
+ break # don't accept sub-4-bit as a clean "works now"
99
+ est = estimate_memory(model, q, context_tokens=context_tokens,
100
+ job_overhead_factor=of)
101
+ if est.total_gb <= fast * _SAFETY_FILL:
102
+ full_q = q.key in ("fp16", "Q8_0", "Q6_K")
103
+ notes = []
104
+ if q is not RECOMMENDED_QUANT and not full_q:
105
+ notes.append(f"Runs at {q.plain_name} — even a touch sharper than the usual 4-bit.")
106
+ return ModelVerdict(model, VERDICT_WORKS, q, est, full_q, notes)
107
+
108
+ # --- Compromise path: fits if we let it use ordinary RAM (slower) ------
109
+ # Prefer the everyday 4-bit; drop smaller only if needed.
110
+ for q in (RECOMMENDED_QUANT, QUANT_BY_KEY["Q3_K_M"], QUANT_BY_KEY["Q2_K"]):
111
+ est = estimate_memory(model, q, context_tokens=context_tokens,
112
+ job_overhead_factor=of)
113
+ if est.total_gb <= total * _SAFETY_FILL:
114
+ notes = []
115
+ if not spec.has_fast_path:
116
+ notes.append("Runs on the processor (no graphics card to speed it up) — expect slow replies.")
117
+ else:
118
+ notes.append("Too big to fit the graphics card on its own — part runs on slower memory, so replies come more slowly.")
119
+ if q is not RECOMMENDED_QUANT:
120
+ notes.append(f"Had to shrink it to {q.plain_name} to fit — some quality is lost.")
121
+ return ModelVerdict(model, VERDICT_COMPROMISE, q, est, False, notes)
122
+
123
+ # --- Doesn't fit even at the smallest setting --------------------------
124
+ est = estimate_memory(model, QUANT_BY_KEY["Q2_K"], context_tokens=context_tokens,
125
+ job_overhead_factor=of)
126
+ short_by = round(est.total_gb - total, 1)
127
+ notes = [f"Needs about {est.total_gb:g} GB even squeezed down — "
128
+ f"around {short_by:g} GB more than this machine can give it."]
129
+ return ModelVerdict(model, VERDICT_NO, QUANT_BY_KEY["Q2_K"], est, False, notes)
130
+
131
+
132
+ def _rank(model_key: str) -> int:
133
+ return next(i for i, m in enumerate(MODEL_CLASSES) if m.key == model_key)
134
+
135
+
136
+ def advise(spec: HardwareSpec, use_case_key: str = "chat") -> Advice:
137
+ """Produce full advice for a machine and a goal."""
138
+ use_case = USE_CASE_BY_KEY.get(use_case_key, USE_CASE_BY_KEY["chat"])
139
+ context_tokens = _CONTEXT_FOR_USE_CASE.get(use_case.key, 4096)
140
+
141
+ # Evaluate every size class, biggest first (so the table reads top-down).
142
+ verdicts = [
143
+ _evaluate_model(m, spec, use_case, context_tokens)
144
+ for m in reversed(MODEL_CLASSES)
145
+ ]
146
+
147
+ # --- Headline: the single "just use this" pick -----------------------
148
+ # Priorities, in order:
149
+ # 1. The biggest model that WORKS NOW (fast + good quality) and is at
150
+ # least big enough for the job. Fast-and-capable is the best answer.
151
+ # 2. If nothing fast is big enough, the best COMPROMISE that does the
152
+ # job — sized close to ideal, not needlessly oversized-and-slow.
153
+ # 3. Otherwise, the best we can honestly offer, flagged as below-par.
154
+ good_rank = _rank(use_case.good_class)
155
+ min_rank = _rank(use_case.min_class)
156
+
157
+ q4_bpw = RECOMMENDED_QUANT.bits_per_weight
158
+ works = [v for v in verdicts if v.verdict == VERDICT_WORKS]
159
+ comp = [v for v in verdicts if v.verdict == VERDICT_COMPROMISE]
160
+
161
+ def largest(vs):
162
+ return max(vs, key=lambda v: _rank(v.model.key))
163
+
164
+ def nearest_good(vs):
165
+ # Closest to the ideal size without overshooting into needless slowness.
166
+ below = [v for v in vs if _rank(v.model.key) <= good_rank]
167
+ return largest(below) if below else min(vs, key=lambda v: _rank(v.model.key))
168
+
169
+ def decent(vs):
170
+ # Don't headline a model that only fits at a desperate sub-4-bit squeeze
171
+ # if a cleaner option exists — quality matters more than size on the box.
172
+ return [v for v in vs if v.quant.bits_per_weight >= q4_bpw]
173
+
174
+ works_ok = [v for v in works if _rank(v.model.key) >= min_rank]
175
+ comp_ok = [v for v in comp if _rank(v.model.key) >= min_rank]
176
+
177
+ headline = None
178
+ meets_goal = False
179
+ if works_ok:
180
+ headline, meets_goal = largest(works_ok), True
181
+ elif comp_ok:
182
+ headline, meets_goal = nearest_good(decent(comp_ok) or comp_ok), True
183
+ elif works:
184
+ headline, meets_goal = largest(works), False
185
+ elif comp:
186
+ headline, meets_goal = nearest_good(decent(comp) or comp), False
187
+
188
+ if headline is not None and not meets_goal:
189
+ headline.notes.insert(
190
+ 0, f"This is the best this machine can do, but it's on the small "
191
+ f"side for {use_case.plain_name.lower()} — treat results as 'okay', not great.")
192
+
193
+ return Advice(
194
+ spec=spec,
195
+ use_case=use_case,
196
+ context_tokens=context_tokens,
197
+ verdicts=verdicts,
198
+ headline=headline,
199
+ runtimes=pick_runtimes(spec),
200
+ meets_goal=meets_goal,
201
+ )
engine/catalogue.py ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Static catalogue: the frozen facts the advisor reasons over.
3
+
4
+ Everything here is build-time data — no network calls at runtime. That keeps
5
+ the tool fully offline-capable (the "Off the Grid" goal) and means the advice
6
+ can't silently drift when some external API changes.
7
+
8
+ Sources for the numbers (so anyone can check our work):
9
+ - bits-per-weight for GGUF quant families: llama.cpp / Hugging Face GGUF docs
10
+ - "~2 GB per 1B params at fp16": Hugging Face Transformers optimisation guide
11
+ - 8-bit ≈ 50% of fp16, 4-bit ≈ 25-30%: bitsandbytes docs
12
+ - architecture sizes (layers / hidden): typical published configs per size class
13
+ """
14
+
15
+ from dataclasses import dataclass, field
16
+
17
+
18
+ # --------------------------------------------------------------------------
19
+ # Quantisation tiers
20
+ # --------------------------------------------------------------------------
21
+ # "Quantisation" = squashing the model's numbers into fewer bits so it takes
22
+ # less memory. Fewer bits = smaller + faster, but slightly less sharp.
23
+ # gb_per_billion is just bits_per_weight / 8 (bits -> bytes -> GB per 1B params).
24
+
25
+ @dataclass(frozen=True)
26
+ class QuantTier:
27
+ key: str
28
+ plain_name: str # what a normal person sees
29
+ bits_per_weight: float
30
+ blurb: str # one honest sentence about the trade-off
31
+ recommended: bool = False
32
+
33
+ @property
34
+ def gb_per_billion(self) -> float:
35
+ return self.bits_per_weight / 8.0
36
+
37
+
38
+ QUANT_TIERS: list[QuantTier] = [
39
+ QuantTier("fp16", "Full quality (fp16)", 16.0,
40
+ "The original, uncompressed model. Biggest and slowest to load."),
41
+ QuantTier("Q8_0", "Near-full (8-bit)", 8.5,
42
+ "Practically indistinguishable from full quality, about half the size."),
43
+ QuantTier("Q6_K", "High (6-bit)", 6.56,
44
+ "Very close to full quality, a bit smaller again."),
45
+ QuantTier("Q5_K_M", "Balanced+ (5-bit)", 5.67,
46
+ "A touch sharper than 4-bit for a little more memory."),
47
+ QuantTier("Q4_K_M", "Balanced (4-bit)", 4.83,
48
+ "The sweet spot most people use: small, fast, and still very good.",
49
+ recommended=True),
50
+ QuantTier("Q3_K_M", "Compact (3-bit)", 3.91,
51
+ "Smaller still, with a slight, usually-acceptable quality dip."),
52
+ QuantTier("Q2_K", "Tiny (2-bit)", 3.35,
53
+ "Last resort to make something fit — noticeably less reliable."),
54
+ ]
55
+
56
+ QUANT_BY_KEY = {q.key: q for q in QUANT_TIERS}
57
+ RECOMMENDED_QUANT = next(q for q in QUANT_TIERS if q.recommended)
58
+
59
+
60
+ # --------------------------------------------------------------------------
61
+ # Model size classes
62
+ # --------------------------------------------------------------------------
63
+ # We reason in *size classes* rather than individual models, because the
64
+ # memory maths is driven by parameter count + architecture shape. Each class
65
+ # carries an approximate architecture so we can estimate the KV cache (chat
66
+ # memory) honestly. Layers/hidden are conservative typicals, not exact.
67
+
68
+ @dataclass(frozen=True)
69
+ class ModelClass:
70
+ key: str
71
+ billions: float # parameter count in billions (representative)
72
+ plain_name: str
73
+ good_for: str # plain-English "what it's actually good at"
74
+ n_layers: int
75
+ hidden: int
76
+ # Example concrete models for the copy-paste commands (real, well-known).
77
+ example_label: str
78
+ ollama_tag: str # what you'd type after `ollama run`
79
+ gguf_repo: str # a real Hugging Face GGUF repo for llama.cpp
80
+
81
+
82
+ MODEL_CLASSES: list[ModelClass] = [
83
+ ModelClass("tiny", 1.0, "Tiny (around 1 billion)",
84
+ "Quick simple chat, basic questions, tidying text. Runs on almost anything.",
85
+ 24, 2048, "Llama 3.2 1B", "llama3.2:1b",
86
+ "bartowski/Llama-3.2-1B-Instruct-GGUF"),
87
+ ModelClass("small", 3.5, "Small (3-4 billion)",
88
+ "Surprisingly capable everyday chat, summarising, and light coding help.",
89
+ 28, 3072, "Llama 3.2 3B", "llama3.2:3b",
90
+ "bartowski/Llama-3.2-3B-Instruct-GGUF"),
91
+ ModelClass("medium", 8.0, "Medium (7-9 billion)",
92
+ "A solid all-rounder: good chat, real coding help, decent reasoning.",
93
+ 32, 4096, "Qwen2.5 7B", "qwen2.5:7b",
94
+ "bartowski/Qwen2.5-7B-Instruct-GGUF"),
95
+ ModelClass("large", 14.0, "Large (13-14 billion)",
96
+ "Noticeably smarter and more reliable. Wants a real graphics card.",
97
+ 40, 5120, "Qwen2.5 14B", "qwen2.5:14b",
98
+ "bartowski/Qwen2.5-14B-Instruct-GGUF"),
99
+ ModelClass("xlarge", 32.0, "Very large (30-34 billion)",
100
+ "Near-premium quality. Needs a strong GPU or a lot of memory.",
101
+ 48, 6656, "Qwen2.5 32B", "qwen2.5:32b",
102
+ "bartowski/Qwen2.5-32B-Instruct-GGUF"),
103
+ ModelClass("huge", 70.0, "Huge (70 billion)",
104
+ "Top-tier open quality. Serious hardware only.",
105
+ 80, 8192, "Llama 3.3 70B", "llama3.3:70b",
106
+ "bartowski/Llama-3.3-70B-Instruct-GGUF"),
107
+ ]
108
+
109
+ MODEL_BY_KEY = {m.key: m for m in MODEL_CLASSES}
110
+
111
+
112
+ # --------------------------------------------------------------------------
113
+ # Use cases (jobs people actually want done)
114
+ # --------------------------------------------------------------------------
115
+ # Each maps to a *minimum* sensible size and a *comfortable* size. We never
116
+ # pretend a job works on a model that's too small for it.
117
+
118
+ @dataclass(frozen=True)
119
+ class UseCase:
120
+ key: str
121
+ plain_name: str
122
+ description: str
123
+ min_class: str # smallest model that does an OK job
124
+ good_class: str # where it starts feeling genuinely useful
125
+ # Extra memory headroom multiplier for this job (RAG/agents need more
126
+ # context; fine-tuning needs much more). 1.0 = normal inference.
127
+ overhead_factor: float = 1.0
128
+ note: str = ""
129
+
130
+
131
+ USE_CASES: list[UseCase] = [
132
+ UseCase("chat", "Just chatting / asking questions",
133
+ "General conversation, explanations, everyday questions.",
134
+ "tiny", "small"),
135
+ UseCase("writing", "Writing & summarising",
136
+ "Drafting emails, rewriting, condensing long text.",
137
+ "small", "medium"),
138
+ UseCase("coding", "Coding help",
139
+ "Explaining code, writing functions, fixing bugs.",
140
+ "small", "medium",
141
+ note="Bigger models are much more reliable for code."),
142
+ UseCase("agents", "Tool use / agents",
143
+ "Letting the model call tools, search, or take steps for you.",
144
+ "medium", "medium", overhead_factor=1.15,
145
+ note="Needs steady instruction-following — go medium or larger."),
146
+ UseCase("rag", "Document Q&A (your own files)",
147
+ "Answering questions over your PDFs/notes (a.k.a. RAG).",
148
+ "small", "medium", overhead_factor=1.25,
149
+ note="Long documents use extra memory for context."),
150
+ UseCase("finetune", "Teaching it your own data (fine-tuning)",
151
+ "Training a small adapter (LoRA/QLoRA) on your examples.",
152
+ "small", "medium", overhead_factor=2.2,
153
+ note="Training needs roughly 2-3x the memory of just chatting."),
154
+ ]
155
+
156
+ USE_CASE_BY_KEY = {u.key: u for u in USE_CASES}
engine/estimator.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ The memory maths — the honest heart of the tool.
3
+
4
+ We estimate how much memory a given model needs, broken into three parts:
5
+
6
+ 1. weights : the model itself = params x bits-per-weight
7
+ 2. kv_cache : the model's short-term "chat memory" — grows with how much
8
+ text it's holding (context). This is what people forget.
9
+ 3. overhead : runtime working space + a safety margin.
10
+
11
+ Every formula is spelled out and deliberately a little pessimistic. If we're
12
+ going to be wrong, we want to be wrong on the safe side.
13
+ """
14
+
15
+ from dataclasses import dataclass
16
+
17
+ from .catalogue import ModelClass, QuantTier
18
+
19
+
20
+ # Bytes per element in the KV cache. Modern runtimes can store it at 16-bit.
21
+ _KV_BYTES = 2
22
+
23
+ # Modern models share key/value heads (this is called "GQA"), which cuts the
24
+ # KV cache dramatically vs. older designs. ~0.30 is a conservative typical
25
+ # factor (i.e. we still assume KV is fairly chunky to stay safe).
26
+ _GQA_FACTOR = 0.30
27
+
28
+ # Flat runtime working space (program, buffers) in GB.
29
+ _RUNTIME_OVERHEAD_GB = 0.8
30
+
31
+
32
+ @dataclass
33
+ class MemoryEstimate:
34
+ weights_gb: float
35
+ kv_cache_gb: float
36
+ overhead_gb: float
37
+ context_tokens: int
38
+
39
+ @property
40
+ def total_gb(self) -> float:
41
+ return round(self.weights_gb + self.kv_cache_gb + self.overhead_gb, 2)
42
+
43
+
44
+ def estimate_memory(
45
+ model: ModelClass,
46
+ quant: QuantTier,
47
+ *,
48
+ context_tokens: int = 4096,
49
+ job_overhead_factor: float = 1.0,
50
+ ) -> MemoryEstimate:
51
+ """Estimate total memory (GB) to run `model` at `quant`.
52
+
53
+ context_tokens: how much text the model holds at once. 4096 (~3000 words)
54
+ is a sensible default for everyday use.
55
+ job_overhead_factor: extra multiplier for heavier jobs (RAG, agents,
56
+ fine-tuning) — see UseCase.overhead_factor.
57
+ """
58
+ # 1) Weights ---------------------------------------------------------
59
+ weights = model.billions * quant.gb_per_billion
60
+
61
+ # 2) KV cache --------------------------------------------------------
62
+ # bytes = 2(K and V) x layers x hidden x tokens x bytes_per_elem x gqa
63
+ kv_bytes = (
64
+ 2 * model.n_layers * model.hidden * context_tokens
65
+ * _KV_BYTES * _GQA_FACTOR
66
+ )
67
+ kv = kv_bytes / 1e9
68
+
69
+ # 3) Overhead --------------------------------------------------------
70
+ # A flat runtime cost, plus ~10% of the weights as working scratch,
71
+ # all scaled by how demanding the job is.
72
+ overhead = (_RUNTIME_OVERHEAD_GB + 0.10 * weights) * job_overhead_factor
73
+
74
+ # For training (factor well above 1) the *whole* footprint inflates,
75
+ # because optimiser state and activations dwarf plain inference.
76
+ if job_overhead_factor >= 2.0:
77
+ weights *= 1.0 # weights themselves unchanged...
78
+ kv *= 1.0
79
+ overhead = (_RUNTIME_OVERHEAD_GB + weights * (job_overhead_factor - 1.0))
80
+
81
+ return MemoryEstimate(
82
+ weights_gb=round(weights, 2),
83
+ kv_cache_gb=round(kv, 2),
84
+ overhead_gb=round(overhead, 2),
85
+ context_tokens=context_tokens,
86
+ )
engine/explain.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Putting it in plain words.
3
+
4
+ The advisor produces structured facts; this module turns them into sentences a
5
+ non-technical person actually understands, and into commands they can copy and
6
+ paste. No jargon survives here without being explained.
7
+ """
8
+
9
+ from .advisor import (
10
+ Advice,
11
+ ModelVerdict,
12
+ VERDICT_WORKS,
13
+ VERDICT_COMPROMISE,
14
+ VERDICT_NO,
15
+ )
16
+
17
+ VERDICT_EMOJI = {
18
+ VERDICT_WORKS: "🟢",
19
+ VERDICT_COMPROMISE: "🟡",
20
+ VERDICT_NO: "🔴",
21
+ }
22
+
23
+ VERDICT_WORD = {
24
+ VERDICT_WORKS: "Works now",
25
+ VERDICT_COMPROMISE: "Works, with compromises",
26
+ VERDICT_NO: "Don't bother",
27
+ }
28
+
29
+
30
+ def speed_hint(v: ModelVerdict, spec) -> str:
31
+ """A rough, honest feel for how fast replies will come."""
32
+ if v.verdict == VERDICT_NO:
33
+ return "—"
34
+ if v.verdict == VERDICT_COMPROMISE:
35
+ return "Slow — usable for short tasks, not snappy chat."
36
+ # Works now (fast path). Bigger models are still slower even on a GPU.
37
+ if v.model.billions <= 4:
38
+ return "Fast — replies feel instant."
39
+ if v.model.billions <= 14:
40
+ return "Comfortable — quick enough for live chat."
41
+ return "Steady — fine, just not instant on big answers."
42
+
43
+
44
+ # --------------------------------------------------------------------------
45
+ # Commands
46
+ # --------------------------------------------------------------------------
47
+
48
+ def ollama_command(v: ModelVerdict) -> str:
49
+ return f"ollama run {v.model.ollama_tag}"
50
+
51
+
52
+ def llamacpp_command(v: ModelVerdict) -> str:
53
+ # llama.cpp can pull a GGUF straight from Hugging Face by repo:quant.
54
+ return (f"llama-server -hf {v.model.gguf_repo}:{v.quant.key} "
55
+ f"-c {v.estimate.context_tokens}")
56
+
57
+
58
+ # --------------------------------------------------------------------------
59
+ # Headline summary, in human words
60
+ # --------------------------------------------------------------------------
61
+
62
+ def headline_text(advice: Advice) -> str:
63
+ spec = advice.spec
64
+ uc = advice.use_case
65
+ h = advice.headline
66
+
67
+ if h is None:
68
+ return (
69
+ f"**Honest answer: this machine can't comfortably run local AI "
70
+ f"for {uc.plain_name.lower()} yet.**\n\n"
71
+ f"Even the smallest models need more memory than the "
72
+ f"{spec.ram_gb:g} GB available here once everything else is "
73
+ f"running. That's not a failure — small computers just have small "
74
+ f"budgets. A free cloud option, or adding memory, would open this up."
75
+ )
76
+
77
+ m = h.model
78
+ q = h.quant
79
+ fast = "on the graphics card" if spec.has_fast_path and h.verdict == VERDICT_WORKS else "on the processor"
80
+
81
+ if h.verdict == VERDICT_WORKS:
82
+ lead = f"**Yes — you can run a {m.plain_name} model {fast}, today.**"
83
+ elif h.verdict == VERDICT_COMPROMISE:
84
+ lead = f"**Sort of — a {m.plain_name} model will run, but with trade-offs.**"
85
+ else:
86
+ lead = f"**Not really — even a {m.plain_name} model is a stretch here.**"
87
+
88
+ body = (
89
+ f"\n\nFor **{uc.plain_name.lower()}**, the sweet spot on your machine is a "
90
+ f"**{m.plain_name}** model at the **{q.plain_name}** setting. "
91
+ f"{m.good_for}\n\n"
92
+ f"That needs about **{h.estimate.total_gb:g} GB** of memory "
93
+ f"(model {h.estimate.weights_gb:g} GB + chat memory "
94
+ f"{h.estimate.kv_cache_gb:g} GB + working space {h.estimate.overhead_gb:g} GB), "
95
+ f"and you have roughly **{spec.fast_budget_gb:g} GB** fast / "
96
+ f"**{spec.total_budget_gb:g} GB** total to play with."
97
+ )
98
+
99
+ extra = ""
100
+ if uc.note:
101
+ extra += f"\n\n*Note for this job:* {uc.note}"
102
+ if h.notes:
103
+ extra += "\n\n" + "\n".join(f"- {n}" for n in h.notes)
104
+
105
+ return lead + body + extra
106
+
107
+
108
+ def jargon_glossary() -> str:
109
+ return (
110
+ "**Plain-English glossary**\n\n"
111
+ "- **Model** — the AI's 'brain'. Bigger = smarter but heavier.\n"
112
+ "- **Parameters (e.g. 7B)** — how big the brain is. 7B = 7 billion. "
113
+ "More = smarter and hungrier for memory.\n"
114
+ "- **Quantisation (4-bit, 8-bit)** — shrinking the model so it fits. "
115
+ "4-bit is the popular sweet spot: much smaller, barely-noticeable quality loss.\n"
116
+ "- **VRAM** — the fast memory on a graphics card. The single biggest "
117
+ "factor in what you can run quickly.\n"
118
+ "- **RAM** — your computer's normal memory. Models can use it too, but it's slower.\n"
119
+ "- **KV cache / 'chat memory'** — scratch space the model uses to "
120
+ "remember the current conversation. Longer chats use more.\n"
121
+ "- **GGUF** — a single-file model format made for running locally.\n"
122
+ "- **llama.cpp / Ollama** — the programs that actually run the model on your machine."
123
+ )
124
+
125
+
126
+ def how_to_find_specs(os_hint: str = "windows") -> str:
127
+ common = (
128
+ "**Not sure of your specs? Here's how to check:**\n\n"
129
+ )
130
+ if os_hint == "macos":
131
+ return common + (
132
+ "- Click the Apple menu (top-left) → **About This Mac**.\n"
133
+ "- It shows your chip (e.g. *Apple M2*) and **Memory** (e.g. *16 GB*).\n"
134
+ "- On a Mac, that one memory number is all you need — the graphics "
135
+ "share it."
136
+ )
137
+ if os_hint == "linux":
138
+ return common + (
139
+ "- RAM: run `free -h` in a terminal.\n"
140
+ "- Graphics card: run `nvidia-smi` (NVIDIA) or `lspci | grep VGA`.\n"
141
+ )
142
+ return common + (
143
+ "- **RAM:** press `Ctrl + Shift + Esc` → **Performance** tab → **Memory**.\n"
144
+ "- **Graphics card:** same window → **GPU**. The name is at the top "
145
+ "right (e.g. *NVIDIA RTX 3060*).\n"
146
+ "- No GPU section showing a real card? You likely have built-in "
147
+ "graphics — that's fine, just pick the 'built-in' option."
148
+ )
engine/hardware.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Describing a machine in terms the maths cares about.
3
+
4
+ The whole job here is to turn "I have a Windows laptop with an RTX 3060 and
5
+ 16 GB of RAM" into two numbers the advisor can reason about:
6
+
7
+ - fast_budget_gb : memory the model can use *on the fast path* (the GPU, or
8
+ on Apple Silicon, the shared memory the GPU can borrow)
9
+ - total_budget_gb: the absolute most a model can use if we let it spill onto
10
+ ordinary RAM (slower, but it runs)
11
+
12
+ Everything is deliberately conservative. We'd rather say "this might be tight"
13
+ and be wrong than promise something that then fails to load.
14
+ """
15
+
16
+ from dataclasses import dataclass
17
+
18
+
19
+ # --------------------------------------------------------------------------
20
+ # Common consumer GPUs -> VRAM (GB). So people pick a name, not a number.
21
+ # VRAM is baked into the label too, because some cards ship in two sizes.
22
+ # --------------------------------------------------------------------------
23
+ GPU_PRESETS: dict[str, float] = {
24
+ # NVIDIA RTX 50-series
25
+ "NVIDIA RTX 5090 (32 GB)": 32,
26
+ "NVIDIA RTX 5080 (16 GB)": 16,
27
+ "NVIDIA RTX 5070 Ti (16 GB)": 16,
28
+ "NVIDIA RTX 5070 (12 GB)": 12,
29
+ "NVIDIA RTX 5060 Ti (16 GB)": 16,
30
+ "NVIDIA RTX 5060 (8 GB)": 8,
31
+ # NVIDIA RTX 40-series
32
+ "NVIDIA RTX 4090 (24 GB)": 24,
33
+ "NVIDIA RTX 4080 (16 GB)": 16,
34
+ "NVIDIA RTX 4070 Ti (12 GB)": 12,
35
+ "NVIDIA RTX 4070 (12 GB)": 12,
36
+ "NVIDIA RTX 4060 Ti (16 GB)": 16,
37
+ "NVIDIA RTX 4060 (8 GB)": 8,
38
+ # NVIDIA RTX 30-series
39
+ "NVIDIA RTX 3090 (24 GB)": 24,
40
+ "NVIDIA RTX 3080 (10 GB)": 10,
41
+ "NVIDIA RTX 3070 (8 GB)": 8,
42
+ "NVIDIA RTX 3060 (12 GB)": 12,
43
+ "NVIDIA RTX 3050 (8 GB)": 8,
44
+ # Older / budget NVIDIA
45
+ "NVIDIA GTX 1660 (6 GB)": 6,
46
+ "NVIDIA GTX 1650 (4 GB)": 4,
47
+ # AMD
48
+ "AMD RX 7900 XTX (24 GB)": 24,
49
+ "AMD RX 7800 XT (16 GB)": 16,
50
+ "AMD RX 7600 (8 GB)": 8,
51
+ "AMD RX 6700 XT (12 GB)": 12,
52
+ # Laptop integrated (no real VRAM — uses shared system RAM)
53
+ "Intel built-in graphics (no separate card)": 0,
54
+ "AMD built-in graphics (no separate card)": 0,
55
+ }
56
+
57
+ # Apple Silicon: there's no separate VRAM. The GPU shares system memory, and
58
+ # macOS lets it borrow a large slice. We treat it specially below.
59
+ APPLE_CHIPS: dict[str, int] = {
60
+ "Apple M1 / M2 / M3 / M4 (base)": 8, # default RAM if they don't know
61
+ "Apple M-series Pro": 16,
62
+ "Apple M-series Max": 32,
63
+ "Apple M-series Ultra": 64,
64
+ }
65
+
66
+
67
+ @dataclass
68
+ class HardwareSpec:
69
+ """A machine, described just enough to reason about it."""
70
+
71
+ os: str = "windows" # windows | macos | linux
72
+ ram_gb: float = 16.0 # system RAM
73
+ gpu_vendor: str = "none" # nvidia | amd | apple | intel | none
74
+ vram_gb: float = 0.0 # dedicated GPU memory (0 if shared/none)
75
+ is_apple_silicon: bool = False
76
+ gpu_label: str = "No dedicated graphics card"
77
+ form_factor: str = "laptop" # laptop | desktop | mac | sbc
78
+
79
+ # -- derived memory budgets -------------------------------------------
80
+
81
+ @property
82
+ def fast_budget_gb(self) -> float:
83
+ """Memory available on the *fast* path (GPU / Apple shared memory)."""
84
+ if self.is_apple_silicon:
85
+ # macOS lets the GPU use a large fraction of unified memory.
86
+ # ~70% is a safe, widely-quoted working figure.
87
+ return round(self.ram_gb * 0.70, 1)
88
+ if self.gpu_vendor in ("nvidia", "amd") and self.vram_gb > 0:
89
+ # Leave headroom for the display, driver, and other apps.
90
+ return round(self.vram_gb * 0.85, 1)
91
+ # Integrated graphics / CPU-only: no meaningful fast path.
92
+ return 0.0
93
+
94
+ @property
95
+ def os_reserve_gb(self) -> float:
96
+ """RAM we set aside for the operating system + other open programs.
97
+
98
+ Windows idles heavy; a headless Raspberry Pi barely uses anything.
99
+ Being honest here matters: too small a reserve over-promises.
100
+ """
101
+ return {
102
+ "sbc": 1.0,
103
+ "mac": 3.0,
104
+ "desktop": 3.0,
105
+ "laptop": 3.5,
106
+ }.get(self.form_factor, 3.0) if self.os != "linux" else {
107
+ "sbc": 1.0,
108
+ }.get(self.form_factor, 2.0)
109
+
110
+ @property
111
+ def total_budget_gb(self) -> float:
112
+ """The most a model can use if it spills onto ordinary RAM (slower)."""
113
+ if self.is_apple_silicon:
114
+ return self.fast_budget_gb # unified memory — same pool
115
+ # Dedicated VRAM (fully usable on the fast path) PLUS a conservative
116
+ # slice of system RAM for CPU offload, after reserving room for the OS.
117
+ ram_for_model = max(0.0, self.ram_gb - self.os_reserve_gb) * 0.9
118
+ return round(self.vram_gb + ram_for_model, 1)
119
+
120
+ @property
121
+ def has_fast_path(self) -> bool:
122
+ return self.fast_budget_gb >= 1.0
123
+
124
+
125
+ def build_spec(
126
+ *,
127
+ computer_kind: str,
128
+ ram_gb: float,
129
+ gpu_choice: str,
130
+ apple_chip: str | None = None,
131
+ ) -> HardwareSpec:
132
+ """Turn friendly UI selections into a HardwareSpec.
133
+
134
+ computer_kind: "Windows laptop/desktop", "Mac", "Linux PC",
135
+ "Raspberry Pi / mini PC"
136
+ gpu_choice: a key from GPU_PRESETS, or one of the "don't know" options.
137
+ """
138
+ kind = computer_kind.lower()
139
+
140
+ # ---- Mac / Apple Silicon -------------------------------------------
141
+ if "mac" in kind:
142
+ chip = apple_chip or "Apple M1 / M2 / M3 / M4 (base)"
143
+ return HardwareSpec(
144
+ os="macos",
145
+ ram_gb=ram_gb,
146
+ gpu_vendor="apple",
147
+ vram_gb=0.0,
148
+ is_apple_silicon=True,
149
+ gpu_label=f"{chip} (shares your {ram_gb:g} GB of memory)",
150
+ form_factor="mac",
151
+ )
152
+
153
+ # ---- Raspberry Pi / tiny single-board ------------------------------
154
+ if "raspberry" in kind or "mini" in kind or "sbc" in kind:
155
+ return HardwareSpec(
156
+ os="linux",
157
+ ram_gb=ram_gb,
158
+ gpu_vendor="none",
159
+ vram_gb=0.0,
160
+ gpu_label="No dedicated graphics card (tiny computer)",
161
+ form_factor="sbc",
162
+ )
163
+
164
+ # ---- Windows / Linux PC with a possible discrete GPU ---------------
165
+ os_name = "linux" if "linux" in kind else "windows"
166
+ form = "desktop" if "desktop" in kind else "laptop"
167
+
168
+ vram = GPU_PRESETS.get(gpu_choice, 0.0)
169
+ if "nvidia" in gpu_choice.lower():
170
+ vendor = "nvidia"
171
+ elif "amd" in gpu_choice.lower() and "built-in" not in gpu_choice.lower():
172
+ vendor = "amd"
173
+ elif "built-in" in gpu_choice.lower():
174
+ vendor = "intel" if "intel" in gpu_choice.lower() else "amd"
175
+ else:
176
+ vendor = "none"
177
+
178
+ label = gpu_choice if vram > 0 else "No dedicated graphics card (built-in graphics only)"
179
+
180
+ return HardwareSpec(
181
+ os=os_name,
182
+ ram_gb=ram_gb,
183
+ gpu_vendor=vendor,
184
+ vram_gb=vram,
185
+ is_apple_silicon=False,
186
+ gpu_label=label,
187
+ form_factor=form,
188
+ )
engine/runtimes.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Runtimes: the actual programs that run a model on your machine.
3
+
4
+ We deliberately keep this list short and well-supported. For each machine we
5
+ surface TWO paths:
6
+
7
+ - the easiest path : a friendly app a non-technical person can install and
8
+ click (Ollama / LM Studio). This is the default.
9
+ - the power path : llama.cpp with GGUF files — more control, and the
10
+ tool the hackathon's "Llama Champion" goal rewards.
11
+
12
+ Plus platform-native options where they genuinely help (MLX on Apple,
13
+ OpenVINO on Intel, vLLM on big Linux GPU boxes).
14
+ """
15
+
16
+ from dataclasses import dataclass
17
+
18
+
19
+ @dataclass(frozen=True)
20
+ class Runtime:
21
+ key: str
22
+ name: str
23
+ plain_what: str # what it is, in one friendly line
24
+ difficulty: str # "Easiest" | "Moderate" | "Advanced"
25
+ install_hint: str
26
+ site: str
27
+
28
+
29
+ RUNTIMES: dict[str, Runtime] = {
30
+ "ollama": Runtime(
31
+ "ollama", "Ollama",
32
+ "A simple app. You type one line and it downloads and runs a model.",
33
+ "Easiest", "Download the installer from ollama.com", "https://ollama.com"),
34
+ "lmstudio": Runtime(
35
+ "lmstudio", "LM Studio",
36
+ "A point-and-click app with a chat window — no typing commands.",
37
+ "Easiest", "Download from lmstudio.ai", "https://lmstudio.ai"),
38
+ "llamacpp": Runtime(
39
+ "llamacpp", "llama.cpp",
40
+ "The lightweight engine under the hood. Runs GGUF model files directly.",
41
+ "Advanced", "Build from source or grab a release on GitHub",
42
+ "https://github.com/ggml-org/llama.cpp"),
43
+ "mlx": Runtime(
44
+ "mlx", "MLX",
45
+ "Apple's own framework, built for Mac chips and their shared memory.",
46
+ "Moderate", "pip install mlx-lm", "https://github.com/ml-explore/mlx"),
47
+ "openvino": Runtime(
48
+ "openvino", "OpenVINO",
49
+ "Intel's toolkit that squeezes good speed out of Intel chips and NPUs.",
50
+ "Moderate", "pip install optimum[openvino]",
51
+ "https://docs.openvino.ai"),
52
+ "vllm": Runtime(
53
+ "vllm", "vLLM",
54
+ "A heavy-duty server for big Linux machines with strong NVIDIA GPUs.",
55
+ "Advanced", "pip install vllm", "https://docs.vllm.ai"),
56
+ }
57
+
58
+
59
+ def pick_runtimes(spec) -> list[Runtime]:
60
+ """Choose the runtimes worth recommending for this machine, best-first.
61
+
62
+ `spec` is a HardwareSpec. The first entry is the friendly default; the
63
+ list always includes llama.cpp (the power / badge path) where it makes
64
+ sense, and a platform-native option when one clearly helps.
65
+ """
66
+ out: list[Runtime] = []
67
+
68
+ # Easiest path first — works almost everywhere and wraps llama.cpp anyway.
69
+ out.append(RUNTIMES["ollama"])
70
+ out.append(RUNTIMES["lmstudio"])
71
+
72
+ if spec.is_apple_silicon:
73
+ out.append(RUNTIMES["mlx"])
74
+ out.append(RUNTIMES["llamacpp"])
75
+ elif spec.gpu_vendor == "intel" or (spec.gpu_vendor == "none" and spec.os == "windows"):
76
+ # Intel-leaning / CPU machines benefit from OpenVINO.
77
+ out.append(RUNTIMES["openvino"])
78
+ out.append(RUNTIMES["llamacpp"])
79
+ else:
80
+ out.append(RUNTIMES["llamacpp"])
81
+ # Big Linux NVIDIA box → mention the server-grade option.
82
+ if spec.os == "linux" and spec.gpu_vendor == "nvidia" and spec.vram_gb >= 16:
83
+ out.append(RUNTIMES["vllm"])
84
+
85
+ # De-duplicate while preserving order.
86
+ seen, deduped = set(), []
87
+ for r in out:
88
+ if r.key not in seen:
89
+ seen.add(r.key)
90
+ deduped.append(r)
91
+ return deduped
engine/ui_adapter.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Adapter: turn a frontend payload into the exact JSON the static/ frontend
3
+ renders, using the REAL deterministic engine (not the placeholder).
4
+
5
+ The frontend speaks one contract (verdicts ``great|tight|no``, an options list,
6
+ a gauge, tools, commands). The engine speaks another (``works_now|compromises|
7
+ dont_bother`` over ``ModelVerdict`` objects). This module is the seam between
8
+ them, so neither side has to know about the other.
9
+
10
+ Scope: the engine currently models the **LLM** family only (its model classes
11
+ are all text models). Vision / image-gen / audio / data goals still fall back to
12
+ the input-aware placeholder in ``app.py`` — that boundary is deliberate and
13
+ honest, not an oversight. ``is_llm_usecase`` below is the routing switch.
14
+ """
15
+
16
+ import re
17
+
18
+ from .advisor import (
19
+ advise,
20
+ VERDICT_WORKS,
21
+ VERDICT_COMPROMISE,
22
+ VERDICT_NO,
23
+ )
24
+ from .catalogue import MODEL_CLASSES
25
+ from .explain import speed_hint, ollama_command, llamacpp_command
26
+ from .hardware import HardwareSpec
27
+
28
+ # Bands: engine verdict -> the colour-key the frontend understands.
29
+ _VERDICT_UI = {
30
+ VERDICT_WORKS: "great",
31
+ VERDICT_COMPROMISE: "tight",
32
+ VERDICT_NO: "no",
33
+ }
34
+ _VERDICT_WORD = {"great": "Runs great", "tight": "Tight, but works", "no": "Won't fit"}
35
+
36
+ # Gauge breakdown colours (match the placeholder palette in app.py / style.css).
37
+ _C_MODEL = "#818CF8" # the weights themselves
38
+ _C_WORK = "#868E9C" # chat memory + working space
39
+
40
+ # Goals the engine can answer for real. Everything LLM-shaped maps onto a chat
41
+ # context; "translate"/"custom" are still language models, so they route here.
42
+ _LLM_USECASES = {
43
+ "chat", "writing", "coding", "agents", "rag", "finetune", "translate", "custom",
44
+ }
45
+ # The engine's own use-case keys. Frontend ids that aren't 1:1 get mapped.
46
+ _USECASE_ALIAS = {"translate": "chat", "custom": "chat"}
47
+
48
+
49
+ def is_llm_usecase(usecase: str) -> bool:
50
+ """True if the real engine should answer this goal (vs. the placeholder)."""
51
+ return usecase in _LLM_USECASES
52
+
53
+
54
+ # --------------------------------------------------------------------------
55
+ # Frontend payload -> HardwareSpec
56
+ # --------------------------------------------------------------------------
57
+
58
+ def _num_in(text: str) -> float:
59
+ """First '<number> GB' figure in a string, else 0."""
60
+ m = re.search(r"(\d+(?:\.\d+)?)\s*GB", text or "", re.I)
61
+ return float(m.group(1)) if m else 0.0
62
+
63
+
64
+ def spec_from_payload(p: dict) -> HardwareSpec:
65
+ """Build a HardwareSpec straight from the frontend's gather() payload.
66
+
67
+ We construct the spec directly rather than going through build_spec(),
68
+ because the frontend carries the vendor and a VRAM-bearing label already,
69
+ and an Advanced box can override VRAM outright.
70
+ """
71
+ computer = (p.get("computer") or "Windows laptop")
72
+ kind = computer.lower()
73
+ provider = (p.get("provider") or "none").lower()
74
+ ram = float(p.get("ram_gb") or 16)
75
+
76
+ # --- Apple Silicon: unified memory, no separate VRAM -------------------
77
+ if "mac" in kind or provider == "apple":
78
+ return HardwareSpec(
79
+ os="macos", ram_gb=ram, gpu_vendor="apple", vram_gb=0.0,
80
+ is_apple_silicon=True,
81
+ gpu_label=f"Apple Silicon (shares your {ram:g} GB of memory)",
82
+ form_factor="mac",
83
+ )
84
+
85
+ # --- Raspberry Pi / mini PC -------------------------------------------
86
+ if "raspberry" in kind or "mini" in kind:
87
+ return HardwareSpec(
88
+ os="linux", ram_gb=ram, gpu_vendor="none", vram_gb=0.0,
89
+ gpu_label="No dedicated graphics card (tiny computer)",
90
+ form_factor="sbc",
91
+ )
92
+
93
+ os_name = "linux" if "linux" in kind else "windows"
94
+ form = "desktop" if "desktop" in kind else "laptop"
95
+
96
+ # VRAM: Advanced override wins; else the picker label; else a paste guess.
97
+ vram = p.get("vram_gb")
98
+ if not vram:
99
+ vram = _num_in(p.get("gpu", "")) or _num_in(p.get("paste", ""))
100
+ vram = float(vram or 0)
101
+
102
+ if provider == "nvidia":
103
+ vendor = "nvidia"
104
+ elif provider == "amd":
105
+ vendor = "amd"
106
+ elif provider == "intel":
107
+ vendor = "intel"
108
+ else:
109
+ vendor = "none" # "none" / "unsure": treat as no fast path
110
+ vram = 0.0
111
+
112
+ label = p.get("gpu") or "No dedicated graphics card (built-in graphics only)"
113
+ return HardwareSpec(
114
+ os=os_name, ram_gb=ram, gpu_vendor=vendor, vram_gb=vram,
115
+ is_apple_silicon=False, gpu_label=label, form_factor=form,
116
+ )
117
+
118
+
119
+ # --------------------------------------------------------------------------
120
+ # Advice -> frontend JSON
121
+ # --------------------------------------------------------------------------
122
+
123
+ def _where(spec: HardwareSpec, verdict: str) -> str:
124
+ if verdict == "great":
125
+ if spec.is_apple_silicon:
126
+ return "on your Mac"
127
+ if spec.has_fast_path:
128
+ return "on your graphics card"
129
+ return "on your computer"
130
+ if verdict == "tight":
131
+ return "using your computer's memory"
132
+ return ""
133
+
134
+
135
+ def advise_for_ui(payload: dict, catalogue_version: str) -> dict:
136
+ """Run the real engine and shape its output for static/app.js render()."""
137
+ usecase = _USECASE_ALIAS.get(payload.get("usecase", "chat"), payload.get("usecase", "chat"))
138
+ spec = spec_from_payload(payload)
139
+ adv = advise(spec, usecase)
140
+
141
+ fast = spec.fast_budget_gb
142
+ total = spec.total_budget_gb
143
+
144
+ # ---- Options table (already biggest -> smallest from the engine) -----
145
+ options = []
146
+ for v in adv.verdicts:
147
+ ui_v = _VERDICT_UI[v.verdict]
148
+ options.append({
149
+ "verdict": ui_v,
150
+ "model": v.model.plain_name,
151
+ "desc": v.model.good_for,
152
+ "setting": v.quant.plain_name,
153
+ "memory": "Too big" if v.verdict == VERDICT_NO else f"{v.estimate.total_gb:g} GB",
154
+ "feel": speed_hint(v, spec),
155
+ })
156
+
157
+ # ---- Headline ---------------------------------------------------------
158
+ h = adv.headline
159
+ hv = _VERDICT_UI[h.verdict] if h else "no"
160
+ where = _where(spec, hv)
161
+
162
+ if h and hv == "great":
163
+ headline = f"Yes, you can run a {h.model.plain_name} model {where}, today."
164
+ elif h and hv == "tight":
165
+ headline = f"Sort of. A {h.model.plain_name} model will run {where}, with trade-offs."
166
+ else:
167
+ headline = "This goal is a stretch on this machine. Here's the honest picture."
168
+
169
+ if h:
170
+ est = h.estimate
171
+ need_gb = est.total_gb
172
+ detail = (
173
+ f"For this goal, the sweet spot is a <b>{h.model.plain_name}</b> model "
174
+ f"at the <b>{h.quant.plain_name}</b> setting. {h.model.good_for} "
175
+ f"It needs about <b>{need_gb:g} GB</b> "
176
+ f"(model {est.weights_gb:g} GB + chat memory {est.kv_cache_gb:g} GB "
177
+ f"+ working space {est.overhead_gb:g} GB), and you have roughly "
178
+ f"<b>{fast:g} GB</b> fast / <b>{total:g} GB</b> total to work with."
179
+ )
180
+ else:
181
+ # Nothing fits even squeezed: be honest, show the shortfall.
182
+ smallest = adv.verdicts[-1]
183
+ need_gb = smallest.estimate.total_gb
184
+ detail = (
185
+ f"Even the smallest model here needs about <b>{need_gb:g} GB</b>, "
186
+ f"but this machine can offer only about <b>{total:g} GB</b> once the "
187
+ f"operating system has its share. That's not a failure — small "
188
+ f"computers just have small budgets. Adding memory, or a free cloud "
189
+ f"option, would open this up."
190
+ )
191
+
192
+ # Notes: use-case caveat + the headline's own honest footnotes.
193
+ note_bits = []
194
+ if adv.use_case.note:
195
+ note_bits.append(adv.use_case.note)
196
+ if h and h.notes:
197
+ note_bits.extend(h.notes)
198
+ note = " ".join(note_bits)
199
+
200
+ # ---- Gauge ------------------------------------------------------------
201
+ scale = max(total, need_gb, 1) * 1.05
202
+ if h:
203
+ model_part = round(h.estimate.weights_gb, 1)
204
+ work_part = round(need_gb - model_part, 1)
205
+ else:
206
+ model_part = round(need_gb * 0.8, 1)
207
+ work_part = round(need_gb * 0.2, 1)
208
+ gauge = {
209
+ "need_gb": f"{need_gb:g} GB needed",
210
+ "fast_gb": f"{fast:g} GB",
211
+ "total_gb": f"{total:g} GB",
212
+ "fill_pct": round(need_gb / scale * 100, 1),
213
+ "mark_pct": round(fast / scale * 100, 1),
214
+ "breakdown": [
215
+ {"label": f"Model {model_part:g} GB", "color": _C_MODEL},
216
+ {"label": f"Working space {work_part:g} GB", "color": _C_WORK},
217
+ ],
218
+ }
219
+
220
+ # ---- Tools (runtimes) -------------------------------------------------
221
+ tools = [{
222
+ "name": r.name, "what": r.plain_what,
223
+ "install": r.install_hint, "tag": r.difficulty,
224
+ } for r in adv.runtimes]
225
+
226
+ # ---- Commands ---------------------------------------------------------
227
+ cmd_intro = ("These get you a running model in minutes. Pick the easy one or "
228
+ "the power one; they do the same job.")
229
+ if h:
230
+ commands = {"intro": cmd_intro, "items": [
231
+ {"label": "Easy way (Ollama)", "code": ollama_command(h)},
232
+ {"label": "Power way (llama.cpp)", "code": llamacpp_command(h)},
233
+ ]}
234
+ else:
235
+ tiny = MODEL_CLASSES[0]
236
+ commands = {"intro": cmd_intro, "items": [
237
+ {"label": "Smallest you could try (Ollama)", "code": f"ollama run {tiny.ollama_tag}"},
238
+ ]}
239
+
240
+ return {
241
+ "catalogue_version": catalogue_version,
242
+ "verdict": hv,
243
+ "verdict_word": _VERDICT_WORD[hv],
244
+ "headline": headline,
245
+ "detail": detail,
246
+ "note": note,
247
+ "gauge": gauge,
248
+ "options": options,
249
+ "tools": tools,
250
+ "commands": commands,
251
+ # Echoed back so the model brick can narrate the SAME numbers the UI shows.
252
+ "meets_goal": adv.meets_goal,
253
+ "use_case": adv.use_case.plain_name,
254
+ }
model_brick.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ The model brick: a closed-context narrator.
3
+
4
+ It takes the deterministic engine's structured advice (the exact JSON the UI
5
+ already shows) plus a plain-English follow-up question, and re-voices those
6
+ facts simply. It NEVER invents numbers, models, or benchmarks — every figure it
7
+ states must already be in the facts. All arithmetic stays in engine/.
8
+
9
+ Serving (Hugging Face Spaces, ZeroGPU):
10
+ app.py exposes ask() via ``@app.api(name="ask")`` so it runs on Gradio's
11
+ queue; _generate() below is wrapped in ``@spaces.GPU`` so a GPU is allocated
12
+ per call and released on return. The model is moved to CUDA at import (safe
13
+ under ZeroGPU's CUDA emulation).
14
+
15
+ Off the Space (local dev, no GPU, or a boot failure), we never download an 8 GB
16
+ model. ask() degrades to a deterministic narrator that re-voices the facts with
17
+ no AI in the loop — so the /api/ask contract always answers, and always stays
18
+ grounded.
19
+ """
20
+
21
+ import json
22
+ import os
23
+ import re
24
+ import sys
25
+
26
+
27
+ def _log(msg: str) -> None:
28
+ print(f"[FitCheck] {msg}", file=sys.stderr, flush=True)
29
+
30
+ # Default to the prize path (NVIDIA Nemotron Quest). Swap to a clean Apache
31
+ # fallback with no code change: FITCHECK_MODEL=Qwen/Qwen3-4B-Instruct-2507
32
+ MODEL_ID = os.environ.get("FITCHECK_MODEL", "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16")
33
+
34
+ # When to actually load the 8 GB model. We must NOT download it on a free CPU
35
+ # Space (it can fill the disk and break the Space) or on a laptop. So:
36
+ # - ZeroGPU -> load (CUDA is emulated at import; this is the target path).
37
+ # - GPU Space -> load only if CUDA is genuinely present.
38
+ # - CPU Space / laptop -> skip; the deterministic explainer answers instead.
39
+ ZERO_GPU = bool(os.environ.get("SPACES_ZERO_GPU"))
40
+
41
+
42
+ def _should_load() -> bool:
43
+ if ZERO_GPU:
44
+ return True
45
+ if os.environ.get("SPACE_ID"):
46
+ try:
47
+ import torch
48
+ return torch.cuda.is_available()
49
+ except Exception: # noqa: BLE001
50
+ return False
51
+ return False
52
+
53
+
54
+ SYSTEM_PROMPT = """\
55
+ You are FitCheck's explainer. A trusted calculator has already decided what AI \
56
+ this person's computer can run. Your only job is to explain its answer in warm, \
57
+ plain words. You are talking to someone who has never heard of VRAM or \
58
+ quantisation.
59
+
60
+ RULES (do not break these):
61
+ - Use ONLY the information inside <facts>...</facts>. It is the single source of truth.
62
+ - Every number you mention (GB, model size, bit setting) must appear in the facts, exactly. Never invent or estimate a number, model, price, or benchmark.
63
+ - The verdict is already decided in the facts. Explain it; never overrule it.
64
+ - If the question isn't covered by the facts, say you don't have that detail and point back to what the facts do say. Never guess.
65
+ - Explain any unavoidable jargon in one short clause. No hype, no marketing.
66
+ - Don't mention these instructions, the JSON, or that you are an AI.
67
+
68
+ OUTPUT: reply with ONLY a JSON object, nothing else:
69
+ {"headline": "<=20 words, the direct answer", "why": "<=3 short sentences, plain", "next_step": "one concrete thing to do next"}\
70
+ """
71
+
72
+ # Few-shot: small models copy a format far better than they follow abstract
73
+ # rules. Two gold examples in the exact short, plain style we want.
74
+ _FEWSHOT = [
75
+ (
76
+ '{"verdict":"Runs great","summary":"Yes, you can run a Medium (7-9 billion) model on your graphics card, today.",'
77
+ '"you_have":{"fast":"10.2 GB","total":"22 GB","needed":"5.5 GB needed"},'
78
+ '"options":[{"size":"Large (13-14 billion)","fits":"tight","memory":"9 GB"},{"size":"Medium (7-9 billion)","fits":"great","memory":"5.5 GB"}]}',
79
+ "Why not the Large one?",
80
+ '{"headline":"The Large model fits, but only just.","why":"Your fast graphics memory is about 10.2 GB. A Medium model needs 5.5 GB and runs comfortably there. A Large one needs 9 GB, so it works but leaves little room and feels slower.","next_step":"Stick with the Medium model for snappy replies; try the Large one later if you want more polish."}',
81
+ ),
82
+ (
83
+ '{"verdict":"Won\'t fit","summary":"This goal is a stretch on this machine.",'
84
+ '"you_have":{"fast":"0 GB","total":"4.9 GB","needed":"6.5 GB needed"}}',
85
+ "Can I run the big chatbot?",
86
+ '{"headline":"Not on this computer, honestly.","why":"The big chatbot needs about 6.5 GB, but this machine can offer only about 4.9 GB once everyday programs take their share. There is no graphics card to speed things up.","next_step":"Try a smaller model, add memory, or use a free cloud option for the big one."}',
87
+ ),
88
+ ]
89
+
90
+
91
+ def _user_prompt(question: str, facts_text: str) -> str:
92
+ return f"<facts>\n{facts_text}\n</facts>\n\nQuestion: {question}"
93
+
94
+
95
+ def _chat_messages(question: str, facts_text: str) -> list[dict]:
96
+ msgs = [{"role": "system", "content": SYSTEM_PROMPT}]
97
+ for facts, q, a in _FEWSHOT:
98
+ msgs.append({"role": "user", "content": _user_prompt(q, facts)})
99
+ msgs.append({"role": "assistant", "content": a})
100
+ msgs.append({"role": "user", "content": _user_prompt(question, facts_text)})
101
+ return msgs
102
+
103
+
104
+ # --------------------------------------------------------------------------
105
+ # Facts handling (shared by the model path and the fallback)
106
+ # --------------------------------------------------------------------------
107
+
108
+ def _strip_html(s: str) -> str:
109
+ return re.sub(r"\s+", " ", re.sub(r"<[^>]+>", "", s or "")).strip()
110
+
111
+
112
+ def _parse_facts(facts) -> dict:
113
+ if isinstance(facts, dict):
114
+ return facts
115
+ if not facts:
116
+ return {}
117
+ try:
118
+ return json.loads(facts)
119
+ except (json.JSONDecodeError, TypeError):
120
+ return {}
121
+
122
+
123
+ def compact_facts(facts: dict) -> str:
124
+ """Flatten the advise() result into the small, flat JSON the model sees.
125
+
126
+ Flat JSON (not prose) makes grounding a near string-match and keeps the
127
+ prompt short. We pass only what a follow-up answer could need.
128
+ """
129
+ g = facts.get("gauge") or {}
130
+ compact = {
131
+ "verdict": facts.get("verdict_word") or facts.get("verdict"),
132
+ "summary": facts.get("headline"),
133
+ "explanation": _strip_html(facts.get("detail", "")),
134
+ "goal": facts.get("use_case"),
135
+ "you_have": {
136
+ "fast": g.get("fast_gb"),
137
+ "total": g.get("total_gb"),
138
+ "needed": g.get("need_gb"),
139
+ },
140
+ "options": [
141
+ {"size": o.get("model"), "fits": o.get("verdict"),
142
+ "memory": o.get("memory"), "setting": o.get("setting"),
143
+ "speed": o.get("feel")}
144
+ for o in (facts.get("options") or [])
145
+ ],
146
+ "how_to_run": [
147
+ {"label": c.get("label"), "command": c.get("code")}
148
+ for c in ((facts.get("commands") or {}).get("items") or [])
149
+ ],
150
+ "note": facts.get("note") or "",
151
+ }
152
+ # Drop empties so the model isn't tempted to fill nulls.
153
+ compact = {k: v for k, v in compact.items() if v not in (None, "", [], {})}
154
+ if "you_have" in compact:
155
+ compact["you_have"] = {k: v for k, v in compact["you_have"].items() if v}
156
+ return json.dumps(compact, ensure_ascii=False)
157
+
158
+
159
+ # --------------------------------------------------------------------------
160
+ # Faithfulness gate (also used by tests)
161
+ # --------------------------------------------------------------------------
162
+
163
+ # A "figure" = a number tied to a memory/size/quant unit — the kind a model
164
+ # could dangerously invent. Bare ordinals ("first", "3 steps") are ignored.
165
+ _FIGURE = re.compile(r"(\d+(?:\.\d+)?)\s*(gb|-?bit|billion|b)\b", re.I)
166
+
167
+
168
+ def leaked_figures(answer_text: str, facts_text: str) -> list[str]:
169
+ """Numbers-with-units in the answer that don't appear in the facts."""
170
+ facts_nums = set(re.findall(r"\d+(?:\.\d+)?", facts_text))
171
+ return [num for num, _unit in _FIGURE.findall(answer_text)
172
+ if num not in facts_nums]
173
+
174
+
175
+ def _answer_text(ans: dict) -> str:
176
+ return " ".join(str(ans.get(k, "")) for k in ("headline", "why", "next_step"))
177
+
178
+
179
+ def _parse_json_answer(raw: str) -> dict | None:
180
+ """Pull the first {...} object out of the model's text and validate shape."""
181
+ if not raw:
182
+ return None
183
+ m = re.search(r"\{.*\}", raw, re.DOTALL)
184
+ if not m:
185
+ return None
186
+ try:
187
+ obj = json.loads(m.group(0))
188
+ except json.JSONDecodeError:
189
+ return None
190
+ if not isinstance(obj, dict):
191
+ return None
192
+ out = {k: str(obj.get(k, "")).strip() for k in ("headline", "why", "next_step")}
193
+ return out if out["headline"] or out["why"] else None
194
+
195
+
196
+ # --------------------------------------------------------------------------
197
+ # Deterministic fallback narrator (no AI) — local dev + safety net
198
+ # --------------------------------------------------------------------------
199
+
200
+ def _fallback(question: str, facts: dict) -> dict:
201
+ headline = facts.get("headline") or "Here's the honest picture for your machine."
202
+ why = _strip_html(facts.get("detail", ""))
203
+ note = facts.get("note", "")
204
+ if note:
205
+ why = f"{why} {note}".strip()
206
+ items = (facts.get("commands") or {}).get("items") or []
207
+ if items:
208
+ next_step = f"Start with: {items[0]['code']}"
209
+ else:
210
+ next_step = "Pick your hardware and goal above to see exact steps."
211
+ return {
212
+ "headline": headline,
213
+ "why": why or "Fill in your computer and goal above, then ask again.",
214
+ "next_step": next_step,
215
+ "fallback": True,
216
+ }
217
+
218
+
219
+ # --------------------------------------------------------------------------
220
+ # Model load (Space only) + public entry point
221
+ # --------------------------------------------------------------------------
222
+
223
+ _GENERATE = None # set to a @spaces.GPU-wrapped fn when the GPU stack imports
224
+ MODEL_READY = False # GPU stack imported; the model itself loads lazily (below)
225
+ LOAD_ERROR = ""
226
+
227
+ # Loaded on the FIRST /ask call, inside the GPU context — NOT at import. Loading
228
+ # the 8 GB model at import blocked the Space's boot health window and the process
229
+ # got killed (RUNTIME_ERROR with no traceback). Lazy loading lets the app launch
230
+ # instantly; the first question pays the one-time download/load cost, and ask()'s
231
+ # try/except falls back to the deterministic narrator if that first call is slow.
232
+ _state = {"tok": None, "model": None}
233
+
234
+ if _should_load():
235
+ try:
236
+ import spaces # noqa: E402
237
+ import torch # noqa: E402
238
+ from transformers import AutoModelForCausalLM, AutoTokenizer # noqa: E402
239
+
240
+ def _load():
241
+ # Prefer transformers' NATIVE NemotronH class (it guards the
242
+ # mamba-ssm import and falls back to a pure-PyTorch path, so it runs
243
+ # without the painful mamba-ssm CUDA build). Only if that's
244
+ # unavailable do we use NVIDIA's trust_remote_code file, which
245
+ # HARD-requires mamba-ssm.
246
+ try:
247
+ tok = AutoTokenizer.from_pretrained(MODEL_ID)
248
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16)
249
+ except Exception:
250
+ tok = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
251
+ model = AutoModelForCausalLM.from_pretrained(
252
+ MODEL_ID, dtype=torch.bfloat16, trust_remote_code=True)
253
+ _state["tok"] = tok
254
+ _state["model"] = model.to("cuda").eval()
255
+
256
+ @spaces.GPU(duration=120)
257
+ def _generate(question: str, facts_text: str) -> str:
258
+ if _state["model"] is None:
259
+ _load()
260
+ tok, model = _state["tok"], _state["model"]
261
+
262
+ msgs = _chat_messages(question, facts_text)
263
+ # return_dict=True -> a BatchEncoding (input_ids + attention_mask) we
264
+ # can unpack with **inputs. Passing the BatchEncoding positionally to
265
+ # generate() makes it do .shape on a dict -> AttributeError.
266
+ kw = dict(add_generation_prompt=True, return_tensors="pt", return_dict=True)
267
+ try:
268
+ inputs = tok.apply_chat_template(msgs, enable_thinking=False, **kw)
269
+ except TypeError:
270
+ inputs = tok.apply_chat_template(msgs, **kw)
271
+ inputs = inputs.to("cuda")
272
+ prompt_len = inputs["input_ids"].shape[1]
273
+ with torch.no_grad():
274
+ out = model.generate(
275
+ **inputs, max_new_tokens=320, do_sample=False,
276
+ pad_token_id=tok.eos_token_id,
277
+ )
278
+ return tok.decode(out[0][prompt_len:], skip_special_tokens=True).strip()
279
+
280
+ _GENERATE = _generate
281
+ MODEL_READY = True
282
+ except Exception as e: # noqa: BLE001 — any failure → graceful fallback
283
+ LOAD_ERROR = repr(e)
284
+
285
+ _log(f"model brick: should_load={_should_load()} MODEL_READY={MODEL_READY} "
286
+ f"LOAD_ERROR={LOAD_ERROR or 'none'} MODEL_ID={MODEL_ID}")
287
+
288
+
289
+ def ask(question: str, facts: str = "") -> dict:
290
+ """Answer a follow-up question, grounded in the engine's facts.
291
+
292
+ Returns {"headline", "why", "next_step"}. Uses the model on a Space; falls
293
+ back to a deterministic, grounded narrator otherwise. If the model invents a
294
+ figure that isn't in the facts, we reject its answer and fall back too.
295
+ """
296
+ facts_dict = _parse_facts(facts)
297
+ facts_text = compact_facts(facts_dict)
298
+ question = (question or "").strip() or "What can I run?"
299
+
300
+ if _GENERATE is not None:
301
+ try:
302
+ raw = _GENERATE(question, facts_text)
303
+ ans = _parse_json_answer(raw)
304
+ if ans and not leaked_figures(_answer_text(ans), facts_text):
305
+ ans["fallback"] = False
306
+ return ans
307
+ leaked = leaked_figures(_answer_text(ans), facts_text) if ans else "n/a"
308
+ _log(f"answer rejected (parsed={bool(ans)} leaked={leaked}); raw={raw[:600]!r}")
309
+ except Exception as e: # noqa: BLE001 — never 500 the user; degrade instead
310
+ import traceback
311
+ _log(f"model generate failed: {e!r}")
312
+ traceback.print_exc()
313
+
314
+ return _fallback(question, facts_dict)
requirements.txt CHANGED
@@ -1 +1,13 @@
1
- gradio==6.16.0
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FitCheck — UI brick + deterministic engine + model brick (ZeroGPU).
2
+ gradio==6.16.0 # gr.Server (FastAPI) + @app.api queue + ZeroGPU glue
3
+ spaces # @spaces.GPU — ZeroGPU allocation on Hugging Face
4
+ torch>=2.8.0 # ZeroGPU requirement (>=2.8)
5
+ transformers>=4.51.0 # >=4.51 for Qwen3; also runs Nemotron (trust_remote_code)
6
+ accelerate # device placement / efficient loading
7
+
8
+ # Optional speed-up for the Nemotron Mamba-2 kernels. Left unpinned because
9
+ # they compile against CUDA at build time and can fail; transformers falls back
10
+ # to a pure-PyTorch path without them. If the model fails to boot, the clean
11
+ # escape hatch is the env var: FITCHECK_MODEL=Qwen/Qwen3-4B-Instruct-2507
12
+ # mamba-ssm
13
+ # causal-conv1d
static/app.js CHANGED
@@ -74,6 +74,7 @@ const GPUS = {
74
 
75
  const $ = (s) => document.querySelector(s);
76
  const state = { computer: "Windows laptop", provider: "none", priority: "balanced", usecase: "chat", checked: false };
 
77
 
78
  // ---- Build the use-case picker -------------------------------------------
79
  function buildPicker() {
@@ -196,6 +197,7 @@ const VMAP = {
196
  };
197
 
198
  function render(d) {
 
199
  const v = VMAP[d.verdict] || VMAP.tight;
200
  const g = d.gauge || {};
201
  $("#cat-version").textContent = d.catalogue_version || "—";
@@ -255,6 +257,21 @@ function render(d) {
255
  ${cmds ? `<div class="section-title">Copy-paste to get started</div>
256
  <p class="cmd-intro">${d.commands.intro || ""}</p>
257
  <div class="cmd">${cmds}</div>` : ""}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
  </div>`;
259
 
260
  hydrate($("#results"));
@@ -263,6 +280,81 @@ function render(d) {
263
  b.textContent = "Copied ✓"; b.classList.add("done");
264
  setTimeout(() => { b.textContent = "Copy"; b.classList.remove("done"); }, 1500);
265
  }));
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
266
  }
267
 
268
  // ---- Init -----------------------------------------------------------------
 
74
 
75
  const $ = (s) => document.querySelector(s);
76
  const state = { computer: "Windows laptop", provider: "none", priority: "balanced", usecase: "chat", checked: false };
77
+ let lastAdvice = null; // the most recent /api/advise result — facts the model explains
78
 
79
  // ---- Build the use-case picker -------------------------------------------
80
  function buildPicker() {
 
197
  };
198
 
199
  function render(d) {
200
+ lastAdvice = d;
201
  const v = VMAP[d.verdict] || VMAP.tight;
202
  const g = d.gauge || {};
203
  $("#cat-version").textContent = d.catalogue_version || "—";
 
257
  ${cmds ? `<div class="section-title">Copy-paste to get started</div>
258
  <p class="cmd-intro">${d.commands.intro || ""}</p>
259
  <div class="cmd">${cmds}</div>` : ""}
260
+
261
+ <div class="section-title">Ask a follow-up <span class="sub">explained in plain words, from the numbers above</span></div>
262
+ <div class="ask">
263
+ <div class="ask-row">
264
+ <input id="ask-input" type="text" autocomplete="off"
265
+ placeholder="e.g. Why not the bigger model? What does 4-bit mean?" />
266
+ <button id="ask-send" class="ask-btn" title="Ask"><span class="ic" data-ic="arrow"></span></button>
267
+ </div>
268
+ <div class="ask-chips">
269
+ <button class="ask-chip">Why this model?</button>
270
+ <button class="ask-chip">What does the setting mean?</button>
271
+ <button class="ask-chip">Will it feel fast?</button>
272
+ </div>
273
+ <div id="ask-answer" class="ask-answer" hidden></div>
274
+ </div>
275
  </div>`;
276
 
277
  hydrate($("#results"));
 
280
  b.textContent = "Copied ✓"; b.classList.add("done");
281
  setTimeout(() => { b.textContent = "Copy"; b.classList.remove("done"); }, 1500);
282
  }));
283
+ wireAsk();
284
+ }
285
+
286
+ // ---- Follow-up: the model brick (grounded explainer) ---------------------
287
+ function wireAsk() {
288
+ const input = $("#ask-input"), send = $("#ask-send");
289
+ if (!input || !send) return;
290
+ const go = () => askQuestion(input.value);
291
+ send.addEventListener("click", go);
292
+ input.addEventListener("keydown", e => { if (e.key === "Enter") go(); });
293
+ $("#results").querySelectorAll(".ask-chip").forEach(c =>
294
+ c.addEventListener("click", () => { input.value = c.textContent; askQuestion(c.textContent); }));
295
+ }
296
+
297
+ async function askQuestion(question) {
298
+ question = (question || "").trim();
299
+ const box = $("#ask-answer");
300
+ if (!question || !box) return;
301
+ box.hidden = false;
302
+ box.innerHTML = `<div class="ans-loading"><span class="spinner"></span>Thinking it through…</div>`;
303
+ try {
304
+ const a = await callAsk(question, JSON.stringify(lastAdvice || {}));
305
+ renderAnswer(box, a);
306
+ } catch (e) {
307
+ box.innerHTML = `<div class="ans-card"><p>Couldn't reach the explainer just now. The verdict and numbers above still stand.</p></div>`;
308
+ }
309
+ }
310
+
311
+ function renderAnswer(box, a) {
312
+ a = a || {};
313
+ const tag = a.fallback ? `<div class="ans-tag">Quick explainer (the AI model isn't loaded in this environment)</div>` : "";
314
+ box.innerHTML = `
315
+ <div class="ans-card reveal">
316
+ ${a.headline ? `<h3>${a.headline}</h3>` : ""}
317
+ ${a.why ? `<p>${a.why}</p>` : ""}
318
+ ${a.next_step ? `<div class="ans-next"><span class="ic" data-ic="arrow"></span><span>${a.next_step}</span></div>` : ""}
319
+ ${tag}
320
+ </div>`;
321
+ hydrate(box);
322
+ }
323
+
324
+ // On a ZeroGPU Space the JS client is REQUIRED (it forwards the HF iframe auth
325
+ // headers ZeroGPU needs). Locally / non-ZeroGPU we fall back to the raw
326
+ // two-step call so the chat still works with no internet to a CDN.
327
+ let _gradioClient = null;
328
+ async function getClient() {
329
+ if (_gradioClient) return _gradioClient;
330
+ const mod = await import("https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js");
331
+ const Client = mod.Client || mod.client;
332
+ _gradioClient = await Client.connect(window.location.origin);
333
+ return _gradioClient;
334
+ }
335
+
336
+ async function callAsk(question, facts) {
337
+ try {
338
+ const client = await getClient();
339
+ const r = await client.predict("/ask", { question, facts });
340
+ return Array.isArray(r.data) ? r.data[0] : r.data;
341
+ } catch (e) {
342
+ return await callAskRaw(question, facts);
343
+ }
344
+ }
345
+
346
+ async function callAskRaw(question, facts) {
347
+ const post = await fetch("/gradio_api/call/ask", {
348
+ method: "POST", headers: { "Content-Type": "application/json" },
349
+ body: JSON.stringify({ data: [question, facts] }),
350
+ });
351
+ const { event_id } = await post.json();
352
+ const res = await fetch(`/gradio_api/call/ask/${event_id}`);
353
+ const text = await res.text();
354
+ const lines = [...text.matchAll(/data:\s*(.+)/g)]; // SSE data frames
355
+ if (!lines.length) throw new Error("no data in stream");
356
+ const arr = JSON.parse(lines[lines.length - 1][1]); // last frame = result
357
+ return Array.isArray(arr) ? arr[0] : arr;
358
  }
359
 
360
  // ---- Init -----------------------------------------------------------------
static/style.css CHANGED
@@ -363,6 +363,54 @@ details.disc > summary:hover { color: var(--text-primary); }
363
  .copy-btn:hover { color: var(--text-primary); border-color: var(--border-hi); }
364
  .copy-btn.done { color: var(--ok); border-color: var(--ok); }
365
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
  /* Footer */
367
  .foot { text-align: center; color: var(--text-muted); font-size: 13px; margin-top: var(--s-7); line-height: 1.6; }
368
  .foot b { color: var(--text-secondary); }
 
363
  .copy-btn:hover { color: var(--text-primary); border-color: var(--border-hi); }
364
  .copy-btn.done { color: var(--ok); border-color: var(--ok); }
365
 
366
+ /* Ask a follow-up (the model brick) */
367
+ .ask-row { display: flex; gap: var(--s-2); }
368
+ .ask-row input {
369
+ flex: 1; background: var(--bg-inset); border: 1px solid var(--border);
370
+ border-radius: var(--r-md); padding: 12px 14px; font-size: 15px;
371
+ }
372
+ .ask-btn {
373
+ flex: none; width: 46px; border: none; border-radius: var(--r-md);
374
+ background: linear-gradient(135deg, var(--accent), var(--accent-strong));
375
+ color: #fff; display: grid; place-items: center;
376
+ transition: transform .15s, box-shadow .15s, filter .15s;
377
+ }
378
+ .ask-btn:hover { transform: translateY(-2px); box-shadow: var(--glow); filter: brightness(1.05); }
379
+ .ask-btn .ic { font-size: 18px; }
380
+ .ask-chips { display: flex; flex-wrap: wrap; gap: var(--s-2); margin-top: var(--s-3); }
381
+ .ask-chip {
382
+ background: var(--bg-inset); border: 1px solid var(--border);
383
+ color: var(--text-secondary); border-radius: var(--r-pill);
384
+ padding: 6px 13px; font-size: 13px; font-weight: 500;
385
+ transition: border-color .15s, color .15s, background .15s;
386
+ }
387
+ .ask-chip:hover { border-color: var(--accent); color: var(--text-primary); background: var(--accent-soft); }
388
+
389
+ .ask-answer { margin-top: var(--s-4); }
390
+ .ans-card {
391
+ background: var(--bg-inset); border: 1px solid var(--border);
392
+ border-left: 4px solid var(--accent); border-radius: var(--r-md);
393
+ padding: var(--s-4) var(--s-5);
394
+ }
395
+ .ans-card h3 { font-size: 17px; font-weight: 700; margin-bottom: var(--s-2); }
396
+ .ans-card p { color: var(--text-secondary); font-size: 14.5px; line-height: 1.6; }
397
+ .ans-next {
398
+ display: flex; align-items: center; gap: var(--s-2); margin-top: var(--s-3);
399
+ font-size: 14px; font-weight: 600; color: var(--accent);
400
+ }
401
+ .ans-next .ic { font-size: 15px; flex: none; }
402
+ .ans-tag {
403
+ margin-top: var(--s-3); padding-top: var(--s-3); border-top: 1px solid var(--border);
404
+ font-size: 12px; color: var(--text-muted);
405
+ }
406
+ .ans-loading { display: flex; align-items: center; gap: var(--s-2); color: var(--text-muted); font-size: 14px; padding: var(--s-2) 0; }
407
+ .spinner {
408
+ width: 15px; height: 15px; flex: none; border-radius: 50%;
409
+ border: 2px solid var(--border-hi); border-top-color: var(--accent);
410
+ animation: spin .7s linear infinite;
411
+ }
412
+ @keyframes spin { to { transform: rotate(360deg); } }
413
+
414
  /* Footer */
415
  .foot { text-align: center; color: var(--text-muted); font-size: 13px; margin-top: var(--s-7); line-height: 1.6; }
416
  .foot b { color: var(--text-secondary); }