Spaces:

build-small-hackathon
/

FitCheck

Running on Zero

App Files Files Community

cn0303 commited on 1 day ago

Commit

12d2e34

verified ·

1 Parent(s): c97ad08

Deploy FitCheck: engine + Nemotron model brick on ZeroGPU

Browse files

Files changed (14) hide show

README.md +29 -8
app.py +34 -12
engine/__init__.py +20 -0
engine/advisor.py +201 -0
engine/catalogue.py +156 -0
engine/estimator.py +86 -0
engine/explain.py +148 -0
engine/hardware.py +188 -0
engine/runtimes.py +91 -0
engine/ui_adapter.py +254 -0
model_brick.py +314 -0
requirements.txt +13 -1
static/app.js +92 -0
static/style.css +48 -0

README.md CHANGED Viewed

@@ -6,11 +6,23 @@ colorTo: green
 sdk: gradio
 sdk_version: 6.16.0
 app_file: app.py
 pinned: false
 license: mit
 short_description: Honest, plain answers about what AI your computer can run
 ---
 # FitCheck
 **What AI can your computer actually run?**
@@ -35,14 +47,21 @@ Built for the [Build Small hackathon](https://huggingface.co/build-small-hackath
 ## How it is built
-A hand-built HTML, CSS and JS frontend (no framework, no build step) served by
-Gradio server mode (`gr.Server`, which is a FastAPI app). The interface talks to
-a single connector, `POST /api/advise`.
-> Note: this is the UI view. The advice endpoint currently returns input-aware
-> placeholder data so the interface is complete and live. A deterministic
-> calculation engine and a small local model plug into the same `/api/advise`
-> contract next, with no frontend changes.
 ## Run it locally
@@ -53,4 +72,6 @@ pip install -r requirements.txt
 python app.py
 ```
-Then open http://127.0.0.1:7860/ (add `?go` for a sample result).

 sdk: gradio
 sdk_version: 6.16.0
 app_file: app.py
+python_version: "3.12"
 pinned: false
 license: mit
 short_description: Honest, plain answers about what AI your computer can run
+models:
+  - nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16
 ---
+<!--
+ZeroGPU: select "ZeroGPU" hardware in the Space's Settings (the README can't
+set it). The model brick (/api/ask) only loads the LLM when SPACE_ID is set, so
+local `python app.py` stays instant and the chat uses a deterministic fallback.
+Swap the model with no code change via the FITCHECK_MODEL Space secret/variable,
+e.g. FITCHECK_MODEL=Qwen/Qwen3-4B-Instruct-2507 (clean Apache fallback).
+-->
 # FitCheck
 **What AI can your computer actually run?**
 ## How it is built
+Three bricks behind one hand-built HTML/CSS/JS frontend (no framework, no build
+step), served by Gradio server mode (`gr.Server`, which is a FastAPI app):
+1. **The frontend** (`static/`) gathers your setup in plain words.
+2. **The deterministic engine** (`engine/`) does the real memory arithmetic and
+   returns an honest verdict over `POST /api/advise`. No AI in the loop, so
+   every number is inspectable. (LLM goals run on the engine today; vision /
+   image / audio / data goals use a conservative placeholder until the engine
+   models those families.)
+3. **The model brick** (`model_brick.py`) is a small local LLM
+   (NVIDIA Nemotron 3 Nano 4B) that *explains* the engine's numbers in plain
+   words over `/gradio_api/call/ask`. It is a closed-context narrator: it never
+   invents a number, only re-voices the facts the engine produced. On a ZeroGPU
+   Space it runs on a GPU via `@spaces.GPU`; locally it degrades to a
+   deterministic explainer so the chat always answers.
 ## Run it locally
 python app.py
 ```
+Then open http://127.0.0.1:7860/ (add `?go` for a sample result). Locally the
+follow-up chat uses the deterministic explainer; the Nemotron model loads only
+on a Space (when `SPACE_ID` is set).

app.py CHANGED Viewed

@@ -1,14 +1,17 @@
 """
 FitCheck — what AI can your computer actually run?
-This file is the UI BRICK's backend: a `gr.Server` (which IS a FastAPI app)
-that serves the hand-built frontend in static/ and exposes ONE connector
-endpoint, /api/advise.
-IMPORTANT: /api/advise currently returns input-aware *placeholder* results so
-the interface is complete and feels alive (the gauge moves, bands change). The
-real deterministic engine (a separate brick) will plug into this same contract
-later — the frontend won't need to change. Mock logic is fenced below.
 """
 import re
@@ -19,15 +22,19 @@ from fastapi.responses import FileResponse
 from fastapi.staticfiles import StaticFiles
 from pydantic import BaseModel
-CATALOGUE_VERSION = "2026-06-07"
 STATIC = Path(__file__).parent / "static"
 app = gr.Server()
 # ==========================================================================
-#  PLACEHOLDER engine — to be replaced by the real calculation brick.
-#  Numbers are plausible and conservative but are NOT the audited engine yet.
 # ==========================================================================
 _COLORS = {"model": "#818CF8", "chat": "#A5F3C4", "work": "#868E9C"}
@@ -259,7 +266,22 @@ class AdviseIn(BaseModel):
 @app.post("/api/advise")
 def api_advise(payload: AdviseIn):
-    return advise_mock(payload.model_dump())
 app.mount("/static", StaticFiles(directory=STATIC), name="static")

 """
 FitCheck — what AI can your computer actually run?
+This file wires three bricks together behind a `gr.Server` (which IS a FastAPI
+app) that serves the hand-built frontend in static/:
+  - /api/advise  : the honest verdict. For LLM goals it runs the REAL
+                   deterministic engine (engine/, via ui_adapter). Vision /
+                   image-gen / audio / data goals — which the engine doesn't
+                   model yet — still use the input-aware placeholder below.
+  - /gradio_api/call/ask : the model brick (model_brick.ask), a small local LLM
+                   that explains the engine's numbers in plain words. Exposed as
+                   @app.api so it runs on Gradio's queue and gets a ZeroGPU
+                   allocation; called from the browser via @gradio/client.
 """
 import re
 from fastapi.staticfiles import StaticFiles
 from pydantic import BaseModel
+from engine import CATALOGUE_VERSION
+from engine.ui_adapter import advise_for_ui, is_llm_usecase
+from model_brick import ask as model_ask
 STATIC = Path(__file__).parent / "static"
 app = gr.Server()
 # ==========================================================================
+#  PLACEHOLDER engine — vision / image-gen / audio / data goals only.
+#  The real engine (engine/) covers LLM goals; these families aren't modelled
+#  there yet, so they keep these plausible, conservative placeholder numbers.
 # ==========================================================================
 _COLORS = {"model": "#818CF8", "chat": "#A5F3C4", "work": "#868E9C"}
 @app.post("/api/advise")
 def api_advise(payload: AdviseIn):
+    p = payload.model_dump()
+    # LLM goals -> the real, audited engine. Other families -> placeholder.
+    if is_llm_usecase(p.get("usecase", "chat")):
+        return advise_for_ui(p, CATALOGUE_VERSION)
+    return advise_mock(p)
+@app.api(name="ask")
+def api_ask(question: str, facts: str = "") -> dict:
+    """Plain-English follow-up, grounded in the facts /api/advise returned.
+    Exposed at /gradio_api/call/ask (NOT a plain POST) so it runs through
+    Gradio's queue and gets a ZeroGPU allocation. `facts` is the JSON string of
+    the last /api/advise result. Returns {headline, why, next_step}.
+    """
+    return model_ask(question, facts)
 app.mount("/static", StaticFiles(directory=STATIC), name="static")

engine/__init__.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""
+can-i-run-it — the honest local-AI hardware advisor.
+The `engine` package is a *deterministic* compatibility engine. It does the
+real arithmetic (memory budgets, what fits, honest trade-offs) with no AI in
+the loop, so every number it produces can be inspected and trusted.
+A small language model is layered on top *only* to chat and explain — it never
+invents the facts. That separation is the whole point: the numbers are math,
+the words are plain English.
+"""
+from .advisor import advise
+from .hardware import HardwareSpec
+__all__ = ["advise", "HardwareSpec"]
+# Bump this whenever the catalogue or rules change. Shown in the UI so people
+# know how fresh the advice is — credibility over cleverness.
+CATALOGUE_VERSION = "2026-06-07"

engine/advisor.py ADDED Viewed

	@@ -0,0 +1,201 @@

+"""
+The advisor: turn a machine + a goal into an honest verdict.
+Output is organised into three plain bands, because that is what makes the
+tool trustworthy instead of hypey:
+  - WORKS NOW          : runs well, on the fast path, today.
+  - WORKS WITH COMPROMISES : it'll run, but slower or smaller than ideal.
+  - DON'T BOTHER       : not realistic on this machine — said plainly.
+No fake promises. If something doesn't fit, we say so and explain why.
+"""
+from dataclasses import dataclass, field
+from .catalogue import (
+    MODEL_CLASSES,
+    QUANT_TIERS,
+    RECOMMENDED_QUANT,
+    QUANT_BY_KEY,
+    MODEL_BY_KEY,
+    ModelClass,
+    QuantTier,
+    UseCase,
+    USE_CASE_BY_KEY,
+)
+from .estimator import MemoryEstimate, estimate_memory
+from .hardware import HardwareSpec
+from .runtimes import Runtime, pick_runtimes
+# How much text (context) we assume per job, in tokens. ~750 words per 1000.
+_CONTEXT_FOR_USE_CASE = {
+    "chat": 4096,
+    "writing": 4096,
+    "coding": 4096,
+    "agents": 4096,
+    "rag": 8192,
+    "finetune": 2048,
+}
+# We only ever fill a budget to this fraction — the rest is breathing room.
+_SAFETY_FILL = 0.90
+VERDICT_WORKS = "works_now"
+VERDICT_COMPROMISE = "compromises"
+VERDICT_NO = "dont_bother"
+@dataclass
+class ModelVerdict:
+    model: ModelClass
+    verdict: str                 # one of the VERDICT_* constants
+    quant: QuantTier             # the quant we'd actually recommend
+    estimate: MemoryEstimate
+    full_quality_on_fast: bool   # True if it runs on the GPU at fp16/near-full
+    notes: list[str] = field(default_factory=list)
+@dataclass
+class Advice:
+    spec: HardwareSpec
+    use_case: UseCase
+    context_tokens: int
+    verdicts: list[ModelVerdict]            # one per model class, big→small order kept
+    headline: ModelVerdict | None           # the single best pick for this goal
+    runtimes: list[Runtime]
+    meets_goal: bool                         # does the headline satisfy the use case?
+    @property
+    def works_now(self) -> list[ModelVerdict]:
+        return [v for v in self.verdicts if v.verdict == VERDICT_WORKS]
+    @property
+    def compromises(self) -> list[ModelVerdict]:
+        return [v for v in self.verdicts if v.verdict == VERDICT_COMPROMISE]
+    @property
+    def dont_bother(self) -> list[ModelVerdict]:
+        return [v for v in self.verdicts if v.verdict == VERDICT_NO]
+def _evaluate_model(
+    model: ModelClass, spec: HardwareSpec, use_case: UseCase, context_tokens: int
+) -> ModelVerdict:
+    fast = spec.fast_budget_gb
+    total = spec.total_budget_gb
+    of = use_case.overhead_factor
+    q4_bpw = RECOMMENDED_QUANT.bits_per_weight  # the 4-bit quality floor
+    # --- Fast path: best *quality* quant that fits on the GPU/shared mem ---
+    # We only call it "Works now" if it fits fast at 4-bit or better. Cramming
+    # a big model down to 2-bit just to claim it "fits" is exactly the kind of
+    # overpromise this tool refuses to make — that path becomes a compromise.
+    if spec.has_fast_path:
+        for q in QUANT_TIERS:  # ordered best-quality -> smallest
+            if q.bits_per_weight < q4_bpw:
+                break  # don't accept sub-4-bit as a clean "works now"
+            est = estimate_memory(model, q, context_tokens=context_tokens,
+                                  job_overhead_factor=of)
+            if est.total_gb <= fast * _SAFETY_FILL:
+                full_q = q.key in ("fp16", "Q8_0", "Q6_K")
+                notes = []
+                if q is not RECOMMENDED_QUANT and not full_q:
+                    notes.append(f"Runs at {q.plain_name} — even a touch sharper than the usual 4-bit.")
+                return ModelVerdict(model, VERDICT_WORKS, q, est, full_q, notes)
+    # --- Compromise path: fits if we let it use ordinary RAM (slower) ------
+    # Prefer the everyday 4-bit; drop smaller only if needed.
+    for q in (RECOMMENDED_QUANT, QUANT_BY_KEY["Q3_K_M"], QUANT_BY_KEY["Q2_K"]):
+        est = estimate_memory(model, q, context_tokens=context_tokens,
+                              job_overhead_factor=of)
+        if est.total_gb <= total * _SAFETY_FILL:
+            notes = []
+            if not spec.has_fast_path:
+                notes.append("Runs on the processor (no graphics card to speed it up) — expect slow replies.")
+            else:
+                notes.append("Too big to fit the graphics card on its own — part runs on slower memory, so replies come more slowly.")
+            if q is not RECOMMENDED_QUANT:
+                notes.append(f"Had to shrink it to {q.plain_name} to fit — some quality is lost.")
+            return ModelVerdict(model, VERDICT_COMPROMISE, q, est, False, notes)
+    # --- Doesn't fit even at the smallest setting --------------------------
+    est = estimate_memory(model, QUANT_BY_KEY["Q2_K"], context_tokens=context_tokens,
+                          job_overhead_factor=of)
+    short_by = round(est.total_gb - total, 1)
+    notes = [f"Needs about {est.total_gb:g} GB even squeezed down — "
+             f"around {short_by:g} GB more than this machine can give it."]
+    return ModelVerdict(model, VERDICT_NO, QUANT_BY_KEY["Q2_K"], est, False, notes)
+def _rank(model_key: str) -> int:
+    return next(i for i, m in enumerate(MODEL_CLASSES) if m.key == model_key)
+def advise(spec: HardwareSpec, use_case_key: str = "chat") -> Advice:
+    """Produce full advice for a machine and a goal."""
+    use_case = USE_CASE_BY_KEY.get(use_case_key, USE_CASE_BY_KEY["chat"])
+    context_tokens = _CONTEXT_FOR_USE_CASE.get(use_case.key, 4096)
+    # Evaluate every size class, biggest first (so the table reads top-down).
+    verdicts = [
+        _evaluate_model(m, spec, use_case, context_tokens)
+        for m in reversed(MODEL_CLASSES)
+    ]
+    # --- Headline: the single "just use this" pick -----------------------
+    # Priorities, in order:
+    #   1. The biggest model that WORKS NOW (fast + good quality) and is at
+    #      least big enough for the job. Fast-and-capable is the best answer.
+    #   2. If nothing fast is big enough, the best COMPROMISE that does the
+    #      job — sized close to ideal, not needlessly oversized-and-slow.
+    #   3. Otherwise, the best we can honestly offer, flagged as below-par.
+    good_rank = _rank(use_case.good_class)
+    min_rank = _rank(use_case.min_class)
+    q4_bpw = RECOMMENDED_QUANT.bits_per_weight
+    works = [v for v in verdicts if v.verdict == VERDICT_WORKS]
+    comp = [v for v in verdicts if v.verdict == VERDICT_COMPROMISE]
+    def largest(vs):
+        return max(vs, key=lambda v: _rank(v.model.key))
+    def nearest_good(vs):
+        # Closest to the ideal size without overshooting into needless slowness.
+        below = [v for v in vs if _rank(v.model.key) <= good_rank]
+        return largest(below) if below else min(vs, key=lambda v: _rank(v.model.key))
+    def decent(vs):
+        # Don't headline a model that only fits at a desperate sub-4-bit squeeze
+        # if a cleaner option exists — quality matters more than size on the box.
+        return [v for v in vs if v.quant.bits_per_weight >= q4_bpw]
+    works_ok = [v for v in works if _rank(v.model.key) >= min_rank]
+    comp_ok = [v for v in comp if _rank(v.model.key) >= min_rank]
+    headline = None
+    meets_goal = False
+    if works_ok:
+        headline, meets_goal = largest(works_ok), True
+    elif comp_ok:
+        headline, meets_goal = nearest_good(decent(comp_ok) or comp_ok), True
+    elif works:
+        headline, meets_goal = largest(works), False
+    elif comp:
+        headline, meets_goal = nearest_good(decent(comp) or comp), False
+    if headline is not None and not meets_goal:
+        headline.notes.insert(
+            0, f"This is the best this machine can do, but it's on the small "
+               f"side for {use_case.plain_name.lower()} — treat results as 'okay', not great.")
+    return Advice(
+        spec=spec,
+        use_case=use_case,
+        context_tokens=context_tokens,
+        verdicts=verdicts,
+        headline=headline,
+        runtimes=pick_runtimes(spec),
+        meets_goal=meets_goal,
+    )

engine/catalogue.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""
+Static catalogue: the frozen facts the advisor reasons over.
+Everything here is build-time data — no network calls at runtime. That keeps
+the tool fully offline-capable (the "Off the Grid" goal) and means the advice
+can't silently drift when some external API changes.
+Sources for the numbers (so anyone can check our work):
+  - bits-per-weight for GGUF quant families: llama.cpp / Hugging Face GGUF docs
+  - "~2 GB per 1B params at fp16": Hugging Face Transformers optimisation guide
+  - 8-bit ≈ 50% of fp16, 4-bit ≈ 25-30%: bitsandbytes docs
+  - architecture sizes (layers / hidden): typical published configs per size class
+"""
+from dataclasses import dataclass, field
+# --------------------------------------------------------------------------
+# Quantisation tiers
+# --------------------------------------------------------------------------
+# "Quantisation" = squashing the model's numbers into fewer bits so it takes
+# less memory. Fewer bits = smaller + faster, but slightly less sharp.
+# gb_per_billion is just bits_per_weight / 8 (bits -> bytes -> GB per 1B params).
+@dataclass(frozen=True)
+class QuantTier:
+    key: str
+    plain_name: str          # what a normal person sees
+    bits_per_weight: float
+    blurb: str               # one honest sentence about the trade-off
+    recommended: bool = False
+    @property
+    def gb_per_billion(self) -> float:
+        return self.bits_per_weight / 8.0
+QUANT_TIERS: list[QuantTier] = [
+    QuantTier("fp16", "Full quality (fp16)", 16.0,
+              "The original, uncompressed model. Biggest and slowest to load."),
+    QuantTier("Q8_0", "Near-full (8-bit)", 8.5,
+              "Practically indistinguishable from full quality, about half the size."),
+    QuantTier("Q6_K", "High (6-bit)", 6.56,
+              "Very close to full quality, a bit smaller again."),
+    QuantTier("Q5_K_M", "Balanced+ (5-bit)", 5.67,
+              "A touch sharper than 4-bit for a little more memory."),
+    QuantTier("Q4_K_M", "Balanced (4-bit)", 4.83,
+              "The sweet spot most people use: small, fast, and still very good.",
+              recommended=True),
+    QuantTier("Q3_K_M", "Compact (3-bit)", 3.91,
+              "Smaller still, with a slight, usually-acceptable quality dip."),
+    QuantTier("Q2_K", "Tiny (2-bit)", 3.35,
+              "Last resort to make something fit — noticeably less reliable."),
+]
+QUANT_BY_KEY = {q.key: q for q in QUANT_TIERS}
+RECOMMENDED_QUANT = next(q for q in QUANT_TIERS if q.recommended)
+# --------------------------------------------------------------------------
+# Model size classes
+# --------------------------------------------------------------------------
+# We reason in *size classes* rather than individual models, because the
+# memory maths is driven by parameter count + architecture shape. Each class
+# carries an approximate architecture so we can estimate the KV cache (chat
+# memory) honestly. Layers/hidden are conservative typicals, not exact.
+@dataclass(frozen=True)
+class ModelClass:
+    key: str
+    billions: float          # parameter count in billions (representative)
+    plain_name: str
+    good_for: str            # plain-English "what it's actually good at"
+    n_layers: int
+    hidden: int
+    # Example concrete models for the copy-paste commands (real, well-known).
+    example_label: str
+    ollama_tag: str          # what you'd type after `ollama run`
+    gguf_repo: str           # a real Hugging Face GGUF repo for llama.cpp
+MODEL_CLASSES: list[ModelClass] = [
+    ModelClass("tiny", 1.0, "Tiny (around 1 billion)",
+               "Quick simple chat, basic questions, tidying text. Runs on almost anything.",
+               24, 2048, "Llama 3.2 1B", "llama3.2:1b",
+               "bartowski/Llama-3.2-1B-Instruct-GGUF"),
+    ModelClass("small", 3.5, "Small (3-4 billion)",
+               "Surprisingly capable everyday chat, summarising, and light coding help.",
+               28, 3072, "Llama 3.2 3B", "llama3.2:3b",
+               "bartowski/Llama-3.2-3B-Instruct-GGUF"),
+    ModelClass("medium", 8.0, "Medium (7-9 billion)",
+               "A solid all-rounder: good chat, real coding help, decent reasoning.",
+               32, 4096, "Qwen2.5 7B", "qwen2.5:7b",
+               "bartowski/Qwen2.5-7B-Instruct-GGUF"),
+    ModelClass("large", 14.0, "Large (13-14 billion)",
+               "Noticeably smarter and more reliable. Wants a real graphics card.",
+               40, 5120, "Qwen2.5 14B", "qwen2.5:14b",
+               "bartowski/Qwen2.5-14B-Instruct-GGUF"),
+    ModelClass("xlarge", 32.0, "Very large (30-34 billion)",
+               "Near-premium quality. Needs a strong GPU or a lot of memory.",
+               48, 6656, "Qwen2.5 32B", "qwen2.5:32b",
+               "bartowski/Qwen2.5-32B-Instruct-GGUF"),
+    ModelClass("huge", 70.0, "Huge (70 billion)",
+               "Top-tier open quality. Serious hardware only.",
+               80, 8192, "Llama 3.3 70B", "llama3.3:70b",
+               "bartowski/Llama-3.3-70B-Instruct-GGUF"),
+]
+MODEL_BY_KEY = {m.key: m for m in MODEL_CLASSES}
+# --------------------------------------------------------------------------
+# Use cases (jobs people actually want done)
+# --------------------------------------------------------------------------
+# Each maps to a *minimum* sensible size and a *comfortable* size. We never
+# pretend a job works on a model that's too small for it.
+@dataclass(frozen=True)
+class UseCase:
+    key: str
+    plain_name: str
+    description: str
+    min_class: str           # smallest model that does an OK job
+    good_class: str          # where it starts feeling genuinely useful
+    # Extra memory headroom multiplier for this job (RAG/agents need more
+    # context; fine-tuning needs much more). 1.0 = normal inference.
+    overhead_factor: float = 1.0
+    note: str = ""
+USE_CASES: list[UseCase] = [
+    UseCase("chat", "Just chatting / asking questions",
+            "General conversation, explanations, everyday questions.",
+            "tiny", "small"),
+    UseCase("writing", "Writing & summarising",
+            "Drafting emails, rewriting, condensing long text.",
+            "small", "medium"),
+    UseCase("coding", "Coding help",
+            "Explaining code, writing functions, fixing bugs.",
+            "small", "medium",
+            note="Bigger models are much more reliable for code."),
+    UseCase("agents", "Tool use / agents",
+            "Letting the model call tools, search, or take steps for you.",
+            "medium", "medium", overhead_factor=1.15,
+            note="Needs steady instruction-following — go medium or larger."),
+    UseCase("rag", "Document Q&A (your own files)",
+            "Answering questions over your PDFs/notes (a.k.a. RAG).",
+            "small", "medium", overhead_factor=1.25,
+            note="Long documents use extra memory for context."),
+    UseCase("finetune", "Teaching it your own data (fine-tuning)",
+            "Training a small adapter (LoRA/QLoRA) on your examples.",
+            "small", "medium", overhead_factor=2.2,
+            note="Training needs roughly 2-3x the memory of just chatting."),
+]
+USE_CASE_BY_KEY = {u.key: u for u in USE_CASES}

engine/estimator.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""
+The memory maths — the honest heart of the tool.
+We estimate how much memory a given model needs, broken into three parts:
+  1. weights      : the model itself = params x bits-per-weight
+  2. kv_cache     : the model's short-term "chat memory" — grows with how much
+                    text it's holding (context). This is what people forget.
+  3. overhead     : runtime working space + a safety margin.
+Every formula is spelled out and deliberately a little pessimistic. If we're
+going to be wrong, we want to be wrong on the safe side.
+"""
+from dataclasses import dataclass
+from .catalogue import ModelClass, QuantTier
+# Bytes per element in the KV cache. Modern runtimes can store it at 16-bit.
+_KV_BYTES = 2
+# Modern models share key/value heads (this is called "GQA"), which cuts the
+# KV cache dramatically vs. older designs. ~0.30 is a conservative typical
+# factor (i.e. we still assume KV is fairly chunky to stay safe).
+_GQA_FACTOR = 0.30
+# Flat runtime working space (program, buffers) in GB.
+_RUNTIME_OVERHEAD_GB = 0.8
+@dataclass
+class MemoryEstimate:
+    weights_gb: float
+    kv_cache_gb: float
+    overhead_gb: float
+    context_tokens: int
+    @property
+    def total_gb(self) -> float:
+        return round(self.weights_gb + self.kv_cache_gb + self.overhead_gb, 2)
+def estimate_memory(
+    model: ModelClass,
+    quant: QuantTier,
+    *,
+    context_tokens: int = 4096,
+    job_overhead_factor: float = 1.0,
+) -> MemoryEstimate:
+    """Estimate total memory (GB) to run `model` at `quant`.
+    context_tokens: how much text the model holds at once. 4096 (~3000 words)
+        is a sensible default for everyday use.
+    job_overhead_factor: extra multiplier for heavier jobs (RAG, agents,
+        fine-tuning) — see UseCase.overhead_factor.
+    """
+    # 1) Weights ---------------------------------------------------------
+    weights = model.billions * quant.gb_per_billion
+    # 2) KV cache --------------------------------------------------------
+    # bytes = 2(K and V) x layers x hidden x tokens x bytes_per_elem x gqa
+    kv_bytes = (
+        2 * model.n_layers * model.hidden * context_tokens
+        * _KV_BYTES * _GQA_FACTOR
+    )
+    kv = kv_bytes / 1e9
+    # 3) Overhead --------------------------------------------------------
+    # A flat runtime cost, plus ~10% of the weights as working scratch,
+    # all scaled by how demanding the job is.
+    overhead = (_RUNTIME_OVERHEAD_GB + 0.10 * weights) * job_overhead_factor
+    # For training (factor well above 1) the *whole* footprint inflates,
+    # because optimiser state and activations dwarf plain inference.
+    if job_overhead_factor >= 2.0:
+        weights *= 1.0          # weights themselves unchanged...
+        kv *= 1.0
+        overhead = (_RUNTIME_OVERHEAD_GB + weights * (job_overhead_factor - 1.0))
+    return MemoryEstimate(
+        weights_gb=round(weights, 2),
+        kv_cache_gb=round(kv, 2),
+        overhead_gb=round(overhead, 2),
+        context_tokens=context_tokens,
+    )

engine/explain.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""
+Putting it in plain words.
+The advisor produces structured facts; this module turns them into sentences a
+non-technical person actually understands, and into commands they can copy and
+paste. No jargon survives here without being explained.
+"""
+from .advisor import (
+    Advice,
+    ModelVerdict,
+    VERDICT_WORKS,
+    VERDICT_COMPROMISE,
+    VERDICT_NO,
+)
+VERDICT_EMOJI = {
+    VERDICT_WORKS: "🟢",
+    VERDICT_COMPROMISE: "🟡",
+    VERDICT_NO: "🔴",
+}
+VERDICT_WORD = {
+    VERDICT_WORKS: "Works now",
+    VERDICT_COMPROMISE: "Works, with compromises",
+    VERDICT_NO: "Don't bother",
+}
+def speed_hint(v: ModelVerdict, spec) -> str:
+    """A rough, honest feel for how fast replies will come."""
+    if v.verdict == VERDICT_NO:
+        return "—"
+    if v.verdict == VERDICT_COMPROMISE:
+        return "Slow — usable for short tasks, not snappy chat."
+    # Works now (fast path). Bigger models are still slower even on a GPU.
+    if v.model.billions <= 4:
+        return "Fast — replies feel instant."
+    if v.model.billions <= 14:
+        return "Comfortable — quick enough for live chat."
+    return "Steady — fine, just not instant on big answers."
+# --------------------------------------------------------------------------
+# Commands
+# --------------------------------------------------------------------------
+def ollama_command(v: ModelVerdict) -> str:
+    return f"ollama run {v.model.ollama_tag}"
+def llamacpp_command(v: ModelVerdict) -> str:
+    # llama.cpp can pull a GGUF straight from Hugging Face by repo:quant.
+    return (f"llama-server -hf {v.model.gguf_repo}:{v.quant.key} "
+            f"-c {v.estimate.context_tokens}")
+# --------------------------------------------------------------------------
+# Headline summary, in human words
+# --------------------------------------------------------------------------
+def headline_text(advice: Advice) -> str:
+    spec = advice.spec
+    uc = advice.use_case
+    h = advice.headline
+    if h is None:
+        return (
+            f"**Honest answer: this machine can't comfortably run local AI "
+            f"for {uc.plain_name.lower()} yet.**\n\n"
+            f"Even the smallest models need more memory than the "
+            f"{spec.ram_gb:g} GB available here once everything else is "
+            f"running. That's not a failure — small computers just have small "
+            f"budgets. A free cloud option, or adding memory, would open this up."
+        )
+    m = h.model
+    q = h.quant
+    fast = "on the graphics card" if spec.has_fast_path and h.verdict == VERDICT_WORKS else "on the processor"
+    if h.verdict == VERDICT_WORKS:
+        lead = f"**Yes — you can run a {m.plain_name} model {fast}, today.**"
+    elif h.verdict == VERDICT_COMPROMISE:
+        lead = f"**Sort of — a {m.plain_name} model will run, but with trade-offs.**"
+    else:
+        lead = f"**Not really — even a {m.plain_name} model is a stretch here.**"
+    body = (
+        f"\n\nFor **{uc.plain_name.lower()}**, the sweet spot on your machine is a "
+        f"**{m.plain_name}** model at the **{q.plain_name}** setting. "
+        f"{m.good_for}\n\n"
+        f"That needs about **{h.estimate.total_gb:g} GB** of memory "
+        f"(model {h.estimate.weights_gb:g} GB + chat memory "
+        f"{h.estimate.kv_cache_gb:g} GB + working space {h.estimate.overhead_gb:g} GB), "
+        f"and you have roughly **{spec.fast_budget_gb:g} GB** fast / "
+        f"**{spec.total_budget_gb:g} GB** total to play with."
+    )
+    extra = ""
+    if uc.note:
+        extra += f"\n\n*Note for this job:* {uc.note}"
+    if h.notes:
+        extra += "\n\n" + "\n".join(f"- {n}" for n in h.notes)
+    return lead + body + extra
+def jargon_glossary() -> str:
+    return (
+        "**Plain-English glossary**\n\n"
+        "- **Model** — the AI's 'brain'. Bigger = smarter but heavier.\n"
+        "- **Parameters (e.g. 7B)** — how big the brain is. 7B = 7 billion. "
+        "More = smarter and hungrier for memory.\n"
+        "- **Quantisation (4-bit, 8-bit)** — shrinking the model so it fits. "
+        "4-bit is the popular sweet spot: much smaller, barely-noticeable quality loss.\n"
+        "- **VRAM** — the fast memory on a graphics card. The single biggest "
+        "factor in what you can run quickly.\n"
+        "- **RAM** — your computer's normal memory. Models can use it too, but it's slower.\n"
+        "- **KV cache / 'chat memory'** — scratch space the model uses to "
+        "remember the current conversation. Longer chats use more.\n"
+        "- **GGUF** — a single-file model format made for running locally.\n"
+        "- **llama.cpp / Ollama** — the programs that actually run the model on your machine."
+    )
+def how_to_find_specs(os_hint: str = "windows") -> str:
+    common = (
+        "**Not sure of your specs? Here's how to check:**\n\n"
+    )
+    if os_hint == "macos":
+        return common + (
+            "- Click the  Apple menu (top-left) → **About This Mac**.\n"
+            "- It shows your chip (e.g. *Apple M2*) and **Memory** (e.g. *16 GB*).\n"
+            "- On a Mac, that one memory number is all you need — the graphics "
+            "share it."
+        )
+    if os_hint == "linux":
+        return common + (
+            "- RAM: run `free -h` in a terminal.\n"
+            "- Graphics card: run `nvidia-smi` (NVIDIA) or `lspci | grep VGA`.\n"
+        )
+    return common + (
+        "- **RAM:** press `Ctrl + Shift + Esc` → **Performance** tab → **Memory**.\n"
+        "- **Graphics card:** same window → **GPU**. The name is at the top "
+        "right (e.g. *NVIDIA RTX 3060*).\n"
+        "- No GPU section showing a real card? You likely have built-in "
+        "graphics — that's fine, just pick the 'built-in' option."
+    )

engine/hardware.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Describing a machine in terms the maths cares about.
+The whole job here is to turn "I have a Windows laptop with an RTX 3060 and
+16 GB of RAM" into two numbers the advisor can reason about:
+  - fast_budget_gb : memory the model can use *on the fast path* (the GPU, or
+                     on Apple Silicon, the shared memory the GPU can borrow)
+  - total_budget_gb: the absolute most a model can use if we let it spill onto
+                     ordinary RAM (slower, but it runs)
+Everything is deliberately conservative. We'd rather say "this might be tight"
+and be wrong than promise something that then fails to load.
+"""
+from dataclasses import dataclass
+# --------------------------------------------------------------------------
+# Common consumer GPUs -> VRAM (GB). So people pick a name, not a number.
+# VRAM is baked into the label too, because some cards ship in two sizes.
+# --------------------------------------------------------------------------
+GPU_PRESETS: dict[str, float] = {
+    # NVIDIA RTX 50-series
+    "NVIDIA RTX 5090 (32 GB)": 32,
+    "NVIDIA RTX 5080 (16 GB)": 16,
+    "NVIDIA RTX 5070 Ti (16 GB)": 16,
+    "NVIDIA RTX 5070 (12 GB)": 12,
+    "NVIDIA RTX 5060 Ti (16 GB)": 16,
+    "NVIDIA RTX 5060 (8 GB)": 8,
+    # NVIDIA RTX 40-series
+    "NVIDIA RTX 4090 (24 GB)": 24,
+    "NVIDIA RTX 4080 (16 GB)": 16,
+    "NVIDIA RTX 4070 Ti (12 GB)": 12,
+    "NVIDIA RTX 4070 (12 GB)": 12,
+    "NVIDIA RTX 4060 Ti (16 GB)": 16,
+    "NVIDIA RTX 4060 (8 GB)": 8,
+    # NVIDIA RTX 30-series
+    "NVIDIA RTX 3090 (24 GB)": 24,
+    "NVIDIA RTX 3080 (10 GB)": 10,
+    "NVIDIA RTX 3070 (8 GB)": 8,
+    "NVIDIA RTX 3060 (12 GB)": 12,
+    "NVIDIA RTX 3050 (8 GB)": 8,
+    # Older / budget NVIDIA
+    "NVIDIA GTX 1660 (6 GB)": 6,
+    "NVIDIA GTX 1650 (4 GB)": 4,
+    # AMD
+    "AMD RX 7900 XTX (24 GB)": 24,
+    "AMD RX 7800 XT (16 GB)": 16,
+    "AMD RX 7600 (8 GB)": 8,
+    "AMD RX 6700 XT (12 GB)": 12,
+    # Laptop integrated (no real VRAM — uses shared system RAM)
+    "Intel built-in graphics (no separate card)": 0,
+    "AMD built-in graphics (no separate card)": 0,
+}
+# Apple Silicon: there's no separate VRAM. The GPU shares system memory, and
+# macOS lets it borrow a large slice. We treat it specially below.
+APPLE_CHIPS: dict[str, int] = {
+    "Apple M1 / M2 / M3 / M4 (base)": 8,    # default RAM if they don't know
+    "Apple M-series Pro": 16,
+    "Apple M-series Max": 32,
+    "Apple M-series Ultra": 64,
+}
+@dataclass
+class HardwareSpec:
+    """A machine, described just enough to reason about it."""
+    os: str = "windows"               # windows | macos | linux
+    ram_gb: float = 16.0              # system RAM
+    gpu_vendor: str = "none"          # nvidia | amd | apple | intel | none
+    vram_gb: float = 0.0              # dedicated GPU memory (0 if shared/none)
+    is_apple_silicon: bool = False
+    gpu_label: str = "No dedicated graphics card"
+    form_factor: str = "laptop"       # laptop | desktop | mac | sbc
+    # -- derived memory budgets -------------------------------------------
+    @property
+    def fast_budget_gb(self) -> float:
+        """Memory available on the *fast* path (GPU / Apple shared memory)."""
+        if self.is_apple_silicon:
+            # macOS lets the GPU use a large fraction of unified memory.
+            # ~70% is a safe, widely-quoted working figure.
+            return round(self.ram_gb * 0.70, 1)
+        if self.gpu_vendor in ("nvidia", "amd") and self.vram_gb > 0:
+            # Leave headroom for the display, driver, and other apps.
+            return round(self.vram_gb * 0.85, 1)
+        # Integrated graphics / CPU-only: no meaningful fast path.
+        return 0.0
+    @property
+    def os_reserve_gb(self) -> float:
+        """RAM we set aside for the operating system + other open programs.
+        Windows idles heavy; a headless Raspberry Pi barely uses anything.
+        Being honest here matters: too small a reserve over-promises.
+        """
+        return {
+            "sbc": 1.0,
+            "mac": 3.0,
+            "desktop": 3.0,
+            "laptop": 3.5,
+        }.get(self.form_factor, 3.0) if self.os != "linux" else {
+            "sbc": 1.0,
+        }.get(self.form_factor, 2.0)
+    @property
+    def total_budget_gb(self) -> float:
+        """The most a model can use if it spills onto ordinary RAM (slower)."""
+        if self.is_apple_silicon:
+            return self.fast_budget_gb  # unified memory — same pool
+        # Dedicated VRAM (fully usable on the fast path) PLUS a conservative
+        # slice of system RAM for CPU offload, after reserving room for the OS.
+        ram_for_model = max(0.0, self.ram_gb - self.os_reserve_gb) * 0.9
+        return round(self.vram_gb + ram_for_model, 1)
+    @property
+    def has_fast_path(self) -> bool:
+        return self.fast_budget_gb >= 1.0
+def build_spec(
+    *,
+    computer_kind: str,
+    ram_gb: float,
+    gpu_choice: str,
+    apple_chip: str | None = None,
+) -> HardwareSpec:
+    """Turn friendly UI selections into a HardwareSpec.
+    computer_kind: "Windows laptop/desktop", "Mac", "Linux PC",
+                   "Raspberry Pi / mini PC"
+    gpu_choice:    a key from GPU_PRESETS, or one of the "don't know" options.
+    """
+    kind = computer_kind.lower()
+    # ---- Mac / Apple Silicon -------------------------------------------
+    if "mac" in kind:
+        chip = apple_chip or "Apple M1 / M2 / M3 / M4 (base)"
+        return HardwareSpec(
+            os="macos",
+            ram_gb=ram_gb,
+            gpu_vendor="apple",
+            vram_gb=0.0,
+            is_apple_silicon=True,
+            gpu_label=f"{chip} (shares your {ram_gb:g} GB of memory)",
+            form_factor="mac",
+        )
+    # ---- Raspberry Pi / tiny single-board ------------------------------
+    if "raspberry" in kind or "mini" in kind or "sbc" in kind:
+        return HardwareSpec(
+            os="linux",
+            ram_gb=ram_gb,
+            gpu_vendor="none",
+            vram_gb=0.0,
+            gpu_label="No dedicated graphics card (tiny computer)",
+            form_factor="sbc",
+        )
+    # ---- Windows / Linux PC with a possible discrete GPU ---------------
+    os_name = "linux" if "linux" in kind else "windows"
+    form = "desktop" if "desktop" in kind else "laptop"
+    vram = GPU_PRESETS.get(gpu_choice, 0.0)
+    if "nvidia" in gpu_choice.lower():
+        vendor = "nvidia"
+    elif "amd" in gpu_choice.lower() and "built-in" not in gpu_choice.lower():
+        vendor = "amd"
+    elif "built-in" in gpu_choice.lower():
+        vendor = "intel" if "intel" in gpu_choice.lower() else "amd"
+    else:
+        vendor = "none"
+    label = gpu_choice if vram > 0 else "No dedicated graphics card (built-in graphics only)"
+    return HardwareSpec(
+        os=os_name,
+        ram_gb=ram_gb,
+        gpu_vendor=vendor,
+        vram_gb=vram,
+        is_apple_silicon=False,
+        gpu_label=label,
+        form_factor=form,
+    )

engine/runtimes.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""
+Runtimes: the actual programs that run a model on your machine.
+We deliberately keep this list short and well-supported. For each machine we
+surface TWO paths:
+  - the easiest path   : a friendly app a non-technical person can install and
+                         click (Ollama / LM Studio). This is the default.
+  - the power path     : llama.cpp with GGUF files — more control, and the
+                         tool the hackathon's "Llama Champion" goal rewards.
+Plus platform-native options where they genuinely help (MLX on Apple,
+OpenVINO on Intel, vLLM on big Linux GPU boxes).
+"""
+from dataclasses import dataclass
+@dataclass(frozen=True)
+class Runtime:
+    key: str
+    name: str
+    plain_what: str          # what it is, in one friendly line
+    difficulty: str          # "Easiest" | "Moderate" | "Advanced"
+    install_hint: str
+    site: str
+RUNTIMES: dict[str, Runtime] = {
+    "ollama": Runtime(
+        "ollama", "Ollama",
+        "A simple app. You type one line and it downloads and runs a model.",
+        "Easiest", "Download the installer from ollama.com", "https://ollama.com"),
+    "lmstudio": Runtime(
+        "lmstudio", "LM Studio",
+        "A point-and-click app with a chat window — no typing commands.",
+        "Easiest", "Download from lmstudio.ai", "https://lmstudio.ai"),
+    "llamacpp": Runtime(
+        "llamacpp", "llama.cpp",
+        "The lightweight engine under the hood. Runs GGUF model files directly.",
+        "Advanced", "Build from source or grab a release on GitHub",
+        "https://github.com/ggml-org/llama.cpp"),
+    "mlx": Runtime(
+        "mlx", "MLX",
+        "Apple's own framework, built for Mac chips and their shared memory.",
+        "Moderate", "pip install mlx-lm", "https://github.com/ml-explore/mlx"),
+    "openvino": Runtime(
+        "openvino", "OpenVINO",
+        "Intel's toolkit that squeezes good speed out of Intel chips and NPUs.",
+        "Moderate", "pip install optimum[openvino]",
+        "https://docs.openvino.ai"),
+    "vllm": Runtime(
+        "vllm", "vLLM",
+        "A heavy-duty server for big Linux machines with strong NVIDIA GPUs.",
+        "Advanced", "pip install vllm", "https://docs.vllm.ai"),
+}
+def pick_runtimes(spec) -> list[Runtime]:
+    """Choose the runtimes worth recommending for this machine, best-first.
+    `spec` is a HardwareSpec. The first entry is the friendly default; the
+    list always includes llama.cpp (the power / badge path) where it makes
+    sense, and a platform-native option when one clearly helps.
+    """
+    out: list[Runtime] = []
+    # Easiest path first — works almost everywhere and wraps llama.cpp anyway.
+    out.append(RUNTIMES["ollama"])
+    out.append(RUNTIMES["lmstudio"])
+    if spec.is_apple_silicon:
+        out.append(RUNTIMES["mlx"])
+        out.append(RUNTIMES["llamacpp"])
+    elif spec.gpu_vendor == "intel" or (spec.gpu_vendor == "none" and spec.os == "windows"):
+        # Intel-leaning / CPU machines benefit from OpenVINO.
+        out.append(RUNTIMES["openvino"])
+        out.append(RUNTIMES["llamacpp"])
+    else:
+        out.append(RUNTIMES["llamacpp"])
+        # Big Linux NVIDIA box → mention the server-grade option.
+        if spec.os == "linux" and spec.gpu_vendor == "nvidia" and spec.vram_gb >= 16:
+            out.append(RUNTIMES["vllm"])
+    # De-duplicate while preserving order.
+    seen, deduped = set(), []
+    for r in out:
+        if r.key not in seen:
+            seen.add(r.key)
+            deduped.append(r)
+    return deduped

engine/ui_adapter.py ADDED Viewed

	@@ -0,0 +1,254 @@

+"""
+Adapter: turn a frontend payload into the exact JSON the static/ frontend
+renders, using the REAL deterministic engine (not the placeholder).
+The frontend speaks one contract (verdicts ``great|tight|no``, an options list,
+a gauge, tools, commands). The engine speaks another (``works_now|compromises|
+dont_bother`` over ``ModelVerdict`` objects). This module is the seam between
+them, so neither side has to know about the other.
+Scope: the engine currently models the **LLM** family only (its model classes
+are all text models). Vision / image-gen / audio / data goals still fall back to
+the input-aware placeholder in ``app.py`` — that boundary is deliberate and
+honest, not an oversight. ``is_llm_usecase`` below is the routing switch.
+"""
+import re
+from .advisor import (
+    advise,
+    VERDICT_WORKS,
+    VERDICT_COMPROMISE,
+    VERDICT_NO,
+)
+from .catalogue import MODEL_CLASSES
+from .explain import speed_hint, ollama_command, llamacpp_command
+from .hardware import HardwareSpec
+# Bands: engine verdict -> the colour-key the frontend understands.
+_VERDICT_UI = {
+    VERDICT_WORKS: "great",
+    VERDICT_COMPROMISE: "tight",
+    VERDICT_NO: "no",
+}
+_VERDICT_WORD = {"great": "Runs great", "tight": "Tight, but works", "no": "Won't fit"}
+# Gauge breakdown colours (match the placeholder palette in app.py / style.css).
+_C_MODEL = "#818CF8"   # the weights themselves
+_C_WORK = "#868E9C"    # chat memory + working space
+# Goals the engine can answer for real. Everything LLM-shaped maps onto a chat
+# context; "translate"/"custom" are still language models, so they route here.
+_LLM_USECASES = {
+    "chat", "writing", "coding", "agents", "rag", "finetune", "translate", "custom",
+}
+# The engine's own use-case keys. Frontend ids that aren't 1:1 get mapped.
+_USECASE_ALIAS = {"translate": "chat", "custom": "chat"}
+def is_llm_usecase(usecase: str) -> bool:
+    """True if the real engine should answer this goal (vs. the placeholder)."""
+    return usecase in _LLM_USECASES
+# --------------------------------------------------------------------------
+# Frontend payload -> HardwareSpec
+# --------------------------------------------------------------------------
+def _num_in(text: str) -> float:
+    """First '<number> GB' figure in a string, else 0."""
+    m = re.search(r"(\d+(?:\.\d+)?)\s*GB", text or "", re.I)
+    return float(m.group(1)) if m else 0.0
+def spec_from_payload(p: dict) -> HardwareSpec:
+    """Build a HardwareSpec straight from the frontend's gather() payload.
+    We construct the spec directly rather than going through build_spec(),
+    because the frontend carries the vendor and a VRAM-bearing label already,
+    and an Advanced box can override VRAM outright.
+    """
+    computer = (p.get("computer") or "Windows laptop")
+    kind = computer.lower()
+    provider = (p.get("provider") or "none").lower()
+    ram = float(p.get("ram_gb") or 16)
+    # --- Apple Silicon: unified memory, no separate VRAM -------------------
+    if "mac" in kind or provider == "apple":
+        return HardwareSpec(
+            os="macos", ram_gb=ram, gpu_vendor="apple", vram_gb=0.0,
+            is_apple_silicon=True,
+            gpu_label=f"Apple Silicon (shares your {ram:g} GB of memory)",
+            form_factor="mac",
+        )
+    # --- Raspberry Pi / mini PC -------------------------------------------
+    if "raspberry" in kind or "mini" in kind:
+        return HardwareSpec(
+            os="linux", ram_gb=ram, gpu_vendor="none", vram_gb=0.0,
+            gpu_label="No dedicated graphics card (tiny computer)",
+            form_factor="sbc",
+        )
+    os_name = "linux" if "linux" in kind else "windows"
+    form = "desktop" if "desktop" in kind else "laptop"
+    # VRAM: Advanced override wins; else the picker label; else a paste guess.
+    vram = p.get("vram_gb")
+    if not vram:
+        vram = _num_in(p.get("gpu", "")) or _num_in(p.get("paste", ""))
+    vram = float(vram or 0)
+    if provider == "nvidia":
+        vendor = "nvidia"
+    elif provider == "amd":
+        vendor = "amd"
+    elif provider == "intel":
+        vendor = "intel"
+    else:
+        vendor = "none"          # "none" / "unsure": treat as no fast path
+        vram = 0.0
+    label = p.get("gpu") or "No dedicated graphics card (built-in graphics only)"
+    return HardwareSpec(
+        os=os_name, ram_gb=ram, gpu_vendor=vendor, vram_gb=vram,
+        is_apple_silicon=False, gpu_label=label, form_factor=form,
+    )
+# --------------------------------------------------------------------------
+# Advice -> frontend JSON
+# --------------------------------------------------------------------------
+def _where(spec: HardwareSpec, verdict: str) -> str:
+    if verdict == "great":
+        if spec.is_apple_silicon:
+            return "on your Mac"
+        if spec.has_fast_path:
+            return "on your graphics card"
+        return "on your computer"
+    if verdict == "tight":
+        return "using your computer's memory"
+    return ""
+def advise_for_ui(payload: dict, catalogue_version: str) -> dict:
+    """Run the real engine and shape its output for static/app.js render()."""
+    usecase = _USECASE_ALIAS.get(payload.get("usecase", "chat"), payload.get("usecase", "chat"))
+    spec = spec_from_payload(payload)
+    adv = advise(spec, usecase)
+    fast = spec.fast_budget_gb
+    total = spec.total_budget_gb
+    # ---- Options table (already biggest -> smallest from the engine) -----
+    options = []
+    for v in adv.verdicts:
+        ui_v = _VERDICT_UI[v.verdict]
+        options.append({
+            "verdict": ui_v,
+            "model": v.model.plain_name,
+            "desc": v.model.good_for,
+            "setting": v.quant.plain_name,
+            "memory": "Too big" if v.verdict == VERDICT_NO else f"{v.estimate.total_gb:g} GB",
+            "feel": speed_hint(v, spec),
+        })
+    # ---- Headline ---------------------------------------------------------
+    h = adv.headline
+    hv = _VERDICT_UI[h.verdict] if h else "no"
+    where = _where(spec, hv)
+    if h and hv == "great":
+        headline = f"Yes, you can run a {h.model.plain_name} model {where}, today."
+    elif h and hv == "tight":
+        headline = f"Sort of. A {h.model.plain_name} model will run {where}, with trade-offs."
+    else:
+        headline = "This goal is a stretch on this machine. Here's the honest picture."
+    if h:
+        est = h.estimate
+        need_gb = est.total_gb
+        detail = (
+            f"For this goal, the sweet spot is a <b>{h.model.plain_name}</b> model "
+            f"at the <b>{h.quant.plain_name}</b> setting. {h.model.good_for} "
+            f"It needs about <b>{need_gb:g} GB</b> "
+            f"(model {est.weights_gb:g} GB + chat memory {est.kv_cache_gb:g} GB "
+            f"+ working space {est.overhead_gb:g} GB), and you have roughly "
+            f"<b>{fast:g} GB</b> fast / <b>{total:g} GB</b> total to work with."
+        )
+    else:
+        # Nothing fits even squeezed: be honest, show the shortfall.
+        smallest = adv.verdicts[-1]
+        need_gb = smallest.estimate.total_gb
+        detail = (
+            f"Even the smallest model here needs about <b>{need_gb:g} GB</b>, "
+            f"but this machine can offer only about <b>{total:g} GB</b> once the "
+            f"operating system has its share. That's not a failure — small "
+            f"computers just have small budgets. Adding memory, or a free cloud "
+            f"option, would open this up."
+        )
+    # Notes: use-case caveat + the headline's own honest footnotes.
+    note_bits = []
+    if adv.use_case.note:
+        note_bits.append(adv.use_case.note)
+    if h and h.notes:
+        note_bits.extend(h.notes)
+    note = "  ".join(note_bits)
+    # ---- Gauge ------------------------------------------------------------
+    scale = max(total, need_gb, 1) * 1.05
+    if h:
+        model_part = round(h.estimate.weights_gb, 1)
+        work_part = round(need_gb - model_part, 1)
+    else:
+        model_part = round(need_gb * 0.8, 1)
+        work_part = round(need_gb * 0.2, 1)
+    gauge = {
+        "need_gb": f"{need_gb:g} GB needed",
+        "fast_gb": f"{fast:g} GB",
+        "total_gb": f"{total:g} GB",
+        "fill_pct": round(need_gb / scale * 100, 1),
+        "mark_pct": round(fast / scale * 100, 1),
+        "breakdown": [
+            {"label": f"Model {model_part:g} GB", "color": _C_MODEL},
+            {"label": f"Working space {work_part:g} GB", "color": _C_WORK},
+        ],
+    }
+    # ---- Tools (runtimes) -------------------------------------------------
+    tools = [{
+        "name": r.name, "what": r.plain_what,
+        "install": r.install_hint, "tag": r.difficulty,
+    } for r in adv.runtimes]
+    # ---- Commands ---------------------------------------------------------
+    cmd_intro = ("These get you a running model in minutes. Pick the easy one or "
+                 "the power one; they do the same job.")
+    if h:
+        commands = {"intro": cmd_intro, "items": [
+            {"label": "Easy way (Ollama)", "code": ollama_command(h)},
+            {"label": "Power way (llama.cpp)", "code": llamacpp_command(h)},
+        ]}
+    else:
+        tiny = MODEL_CLASSES[0]
+        commands = {"intro": cmd_intro, "items": [
+            {"label": "Smallest you could try (Ollama)", "code": f"ollama run {tiny.ollama_tag}"},
+        ]}
+    return {
+        "catalogue_version": catalogue_version,
+        "verdict": hv,
+        "verdict_word": _VERDICT_WORD[hv],
+        "headline": headline,
+        "detail": detail,
+        "note": note,
+        "gauge": gauge,
+        "options": options,
+        "tools": tools,
+        "commands": commands,
+        # Echoed back so the model brick can narrate the SAME numbers the UI shows.
+        "meets_goal": adv.meets_goal,
+        "use_case": adv.use_case.plain_name,
+    }

model_brick.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""
+The model brick: a closed-context narrator.
+It takes the deterministic engine's structured advice (the exact JSON the UI
+already shows) plus a plain-English follow-up question, and re-voices those
+facts simply. It NEVER invents numbers, models, or benchmarks — every figure it
+states must already be in the facts. All arithmetic stays in engine/.
+Serving (Hugging Face Spaces, ZeroGPU):
+    app.py exposes ask() via ``@app.api(name="ask")`` so it runs on Gradio's
+    queue; _generate() below is wrapped in ``@spaces.GPU`` so a GPU is allocated
+    per call and released on return. The model is moved to CUDA at import (safe
+    under ZeroGPU's CUDA emulation).
+Off the Space (local dev, no GPU, or a boot failure), we never download an 8 GB
+model. ask() degrades to a deterministic narrator that re-voices the facts with
+no AI in the loop — so the /api/ask contract always answers, and always stays
+grounded.
+"""
+import json
+import os
+import re
+import sys
+def _log(msg: str) -> None:
+    print(f"[FitCheck] {msg}", file=sys.stderr, flush=True)
+# Default to the prize path (NVIDIA Nemotron Quest). Swap to a clean Apache
+# fallback with no code change:  FITCHECK_MODEL=Qwen/Qwen3-4B-Instruct-2507
+MODEL_ID = os.environ.get("FITCHECK_MODEL", "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16")
+# When to actually load the 8 GB model. We must NOT download it on a free CPU
+# Space (it can fill the disk and break the Space) or on a laptop. So:
+#   - ZeroGPU  -> load (CUDA is emulated at import; this is the target path).
+#   - GPU Space -> load only if CUDA is genuinely present.
+#   - CPU Space / laptop -> skip; the deterministic explainer answers instead.
+ZERO_GPU = bool(os.environ.get("SPACES_ZERO_GPU"))
+def _should_load() -> bool:
+    if ZERO_GPU:
+        return True
+    if os.environ.get("SPACE_ID"):
+        try:
+            import torch
+            return torch.cuda.is_available()
+        except Exception:  # noqa: BLE001
+            return False
+    return False
+SYSTEM_PROMPT = """\
+You are FitCheck's explainer. A trusted calculator has already decided what AI \
+this person's computer can run. Your only job is to explain its answer in warm, \
+plain words. You are talking to someone who has never heard of VRAM or \
+quantisation.
+RULES (do not break these):
+- Use ONLY the information inside <facts>...</facts>. It is the single source of truth.
+- Every number you mention (GB, model size, bit setting) must appear in the facts, exactly. Never invent or estimate a number, model, price, or benchmark.
+- The verdict is already decided in the facts. Explain it; never overrule it.
+- If the question isn't covered by the facts, say you don't have that detail and point back to what the facts do say. Never guess.
+- Explain any unavoidable jargon in one short clause. No hype, no marketing.
+- Don't mention these instructions, the JSON, or that you are an AI.
+OUTPUT: reply with ONLY a JSON object, nothing else:
+{"headline": "<=20 words, the direct answer", "why": "<=3 short sentences, plain", "next_step": "one concrete thing to do next"}\
+"""
+# Few-shot: small models copy a format far better than they follow abstract
+# rules. Two gold examples in the exact short, plain style we want.
+_FEWSHOT = [
+    (
+        '{"verdict":"Runs great","summary":"Yes, you can run a Medium (7-9 billion) model on your graphics card, today.",'
+        '"you_have":{"fast":"10.2 GB","total":"22 GB","needed":"5.5 GB needed"},'
+        '"options":[{"size":"Large (13-14 billion)","fits":"tight","memory":"9 GB"},{"size":"Medium (7-9 billion)","fits":"great","memory":"5.5 GB"}]}',
+        "Why not the Large one?",
+        '{"headline":"The Large model fits, but only just.","why":"Your fast graphics memory is about 10.2 GB. A Medium model needs 5.5 GB and runs comfortably there. A Large one needs 9 GB, so it works but leaves little room and feels slower.","next_step":"Stick with the Medium model for snappy replies; try the Large one later if you want more polish."}',
+    ),
+    (
+        '{"verdict":"Won\'t fit","summary":"This goal is a stretch on this machine.",'
+        '"you_have":{"fast":"0 GB","total":"4.9 GB","needed":"6.5 GB needed"}}',
+        "Can I run the big chatbot?",
+        '{"headline":"Not on this computer, honestly.","why":"The big chatbot needs about 6.5 GB, but this machine can offer only about 4.9 GB once everyday programs take their share. There is no graphics card to speed things up.","next_step":"Try a smaller model, add memory, or use a free cloud option for the big one."}',
+    ),
+]
+def _user_prompt(question: str, facts_text: str) -> str:
+    return f"<facts>\n{facts_text}\n</facts>\n\nQuestion: {question}"
+def _chat_messages(question: str, facts_text: str) -> list[dict]:
+    msgs = [{"role": "system", "content": SYSTEM_PROMPT}]
+    for facts, q, a in _FEWSHOT:
+        msgs.append({"role": "user", "content": _user_prompt(q, facts)})
+        msgs.append({"role": "assistant", "content": a})
+    msgs.append({"role": "user", "content": _user_prompt(question, facts_text)})
+    return msgs
+# --------------------------------------------------------------------------
+# Facts handling (shared by the model path and the fallback)
+# --------------------------------------------------------------------------
+def _strip_html(s: str) -> str:
+    return re.sub(r"\s+", " ", re.sub(r"<[^>]+>", "", s or "")).strip()
+def _parse_facts(facts) -> dict:
+    if isinstance(facts, dict):
+        return facts
+    if not facts:
+        return {}
+    try:
+        return json.loads(facts)
+    except (json.JSONDecodeError, TypeError):
+        return {}
+def compact_facts(facts: dict) -> str:
+    """Flatten the advise() result into the small, flat JSON the model sees.
+    Flat JSON (not prose) makes grounding a near string-match and keeps the
+    prompt short. We pass only what a follow-up answer could need.
+    """
+    g = facts.get("gauge") or {}
+    compact = {
+        "verdict": facts.get("verdict_word") or facts.get("verdict"),
+        "summary": facts.get("headline"),
+        "explanation": _strip_html(facts.get("detail", "")),
+        "goal": facts.get("use_case"),
+        "you_have": {
+            "fast": g.get("fast_gb"),
+            "total": g.get("total_gb"),
+            "needed": g.get("need_gb"),
+        },
+        "options": [
+            {"size": o.get("model"), "fits": o.get("verdict"),
+             "memory": o.get("memory"), "setting": o.get("setting"),
+             "speed": o.get("feel")}
+            for o in (facts.get("options") or [])
+        ],
+        "how_to_run": [
+            {"label": c.get("label"), "command": c.get("code")}
+            for c in ((facts.get("commands") or {}).get("items") or [])
+        ],
+        "note": facts.get("note") or "",
+    }
+    # Drop empties so the model isn't tempted to fill nulls.
+    compact = {k: v for k, v in compact.items() if v not in (None, "", [], {})}
+    if "you_have" in compact:
+        compact["you_have"] = {k: v for k, v in compact["you_have"].items() if v}
+    return json.dumps(compact, ensure_ascii=False)
+# --------------------------------------------------------------------------
+# Faithfulness gate (also used by tests)
+# --------------------------------------------------------------------------
+# A "figure" = a number tied to a memory/size/quant unit — the kind a model
+# could dangerously invent. Bare ordinals ("first", "3 steps") are ignored.
+_FIGURE = re.compile(r"(\d+(?:\.\d+)?)\s*(gb|-?bit|billion|b)\b", re.I)
+def leaked_figures(answer_text: str, facts_text: str) -> list[str]:
+    """Numbers-with-units in the answer that don't appear in the facts."""
+    facts_nums = set(re.findall(r"\d+(?:\.\d+)?", facts_text))
+    return [num for num, _unit in _FIGURE.findall(answer_text)
+            if num not in facts_nums]
+def _answer_text(ans: dict) -> str:
+    return " ".join(str(ans.get(k, "")) for k in ("headline", "why", "next_step"))
+def _parse_json_answer(raw: str) -> dict | None:
+    """Pull the first {...} object out of the model's text and validate shape."""
+    if not raw:
+        return None
+    m = re.search(r"\{.*\}", raw, re.DOTALL)
+    if not m:
+        return None
+    try:
+        obj = json.loads(m.group(0))
+    except json.JSONDecodeError:
+        return None
+    if not isinstance(obj, dict):
+        return None
+    out = {k: str(obj.get(k, "")).strip() for k in ("headline", "why", "next_step")}
+    return out if out["headline"] or out["why"] else None
+# --------------------------------------------------------------------------
+# Deterministic fallback narrator (no AI) — local dev + safety net
+# --------------------------------------------------------------------------
+def _fallback(question: str, facts: dict) -> dict:
+    headline = facts.get("headline") or "Here's the honest picture for your machine."
+    why = _strip_html(facts.get("detail", ""))
+    note = facts.get("note", "")
+    if note:
+        why = f"{why} {note}".strip()
+    items = (facts.get("commands") or {}).get("items") or []
+    if items:
+        next_step = f"Start with: {items[0]['code']}"
+    else:
+        next_step = "Pick your hardware and goal above to see exact steps."
+    return {
+        "headline": headline,
+        "why": why or "Fill in your computer and goal above, then ask again.",
+        "next_step": next_step,
+        "fallback": True,
+    }
+# --------------------------------------------------------------------------
+# Model load (Space only) + public entry point
+# --------------------------------------------------------------------------
+_GENERATE = None       # set to a @spaces.GPU-wrapped fn when the GPU stack imports
+MODEL_READY = False    # GPU stack imported; the model itself loads lazily (below)
+LOAD_ERROR = ""
+# Loaded on the FIRST /ask call, inside the GPU context — NOT at import. Loading
+# the 8 GB model at import blocked the Space's boot health window and the process
+# got killed (RUNTIME_ERROR with no traceback). Lazy loading lets the app launch
+# instantly; the first question pays the one-time download/load cost, and ask()'s
+# try/except falls back to the deterministic narrator if that first call is slow.
+_state = {"tok": None, "model": None}
+if _should_load():
+    try:
+        import spaces  # noqa: E402
+        import torch  # noqa: E402
+        from transformers import AutoModelForCausalLM, AutoTokenizer  # noqa: E402
+        def _load():
+            # Prefer transformers' NATIVE NemotronH class (it guards the
+            # mamba-ssm import and falls back to a pure-PyTorch path, so it runs
+            # without the painful mamba-ssm CUDA build). Only if that's
+            # unavailable do we use NVIDIA's trust_remote_code file, which
+            # HARD-requires mamba-ssm.
+            try:
+                tok = AutoTokenizer.from_pretrained(MODEL_ID)
+                model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16)
+            except Exception:
+                tok = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+                model = AutoModelForCausalLM.from_pretrained(
+                    MODEL_ID, dtype=torch.bfloat16, trust_remote_code=True)
+            _state["tok"] = tok
+            _state["model"] = model.to("cuda").eval()
+        @spaces.GPU(duration=120)
+        def _generate(question: str, facts_text: str) -> str:
+            if _state["model"] is None:
+                _load()
+            tok, model = _state["tok"], _state["model"]
+            msgs = _chat_messages(question, facts_text)
+            # return_dict=True -> a BatchEncoding (input_ids + attention_mask) we
+            # can unpack with **inputs. Passing the BatchEncoding positionally to
+            # generate() makes it do .shape on a dict -> AttributeError.
+            kw = dict(add_generation_prompt=True, return_tensors="pt", return_dict=True)
+            try:
+                inputs = tok.apply_chat_template(msgs, enable_thinking=False, **kw)
+            except TypeError:
+                inputs = tok.apply_chat_template(msgs, **kw)
+            inputs = inputs.to("cuda")
+            prompt_len = inputs["input_ids"].shape[1]
+            with torch.no_grad():
+                out = model.generate(
+                    **inputs, max_new_tokens=320, do_sample=False,
+                    pad_token_id=tok.eos_token_id,
+                )
+            return tok.decode(out[0][prompt_len:], skip_special_tokens=True).strip()
+        _GENERATE = _generate
+        MODEL_READY = True
+    except Exception as e:  # noqa: BLE001 — any failure → graceful fallback
+        LOAD_ERROR = repr(e)
+_log(f"model brick: should_load={_should_load()} MODEL_READY={MODEL_READY} "
+     f"LOAD_ERROR={LOAD_ERROR or 'none'} MODEL_ID={MODEL_ID}")
+def ask(question: str, facts: str = "") -> dict:
+    """Answer a follow-up question, grounded in the engine's facts.
+    Returns {"headline", "why", "next_step"}. Uses the model on a Space; falls
+    back to a deterministic, grounded narrator otherwise. If the model invents a
+    figure that isn't in the facts, we reject its answer and fall back too.
+    """
+    facts_dict = _parse_facts(facts)
+    facts_text = compact_facts(facts_dict)
+    question = (question or "").strip() or "What can I run?"
+    if _GENERATE is not None:
+        try:
+            raw = _GENERATE(question, facts_text)
+            ans = _parse_json_answer(raw)
+            if ans and not leaked_figures(_answer_text(ans), facts_text):
+                ans["fallback"] = False
+                return ans
+            leaked = leaked_figures(_answer_text(ans), facts_text) if ans else "n/a"
+            _log(f"answer rejected (parsed={bool(ans)} leaked={leaked}); raw={raw[:600]!r}")
+        except Exception as e:  # noqa: BLE001 — never 500 the user; degrade instead
+            import traceback
+            _log(f"model generate failed: {e!r}")
+            traceback.print_exc()
+    return _fallback(question, facts_dict)

requirements.txt CHANGED Viewed

	@@ -1 +1,13 @@
1	- ~~gradio==6~~.~~16.0~~

+# FitCheck — UI brick + deterministic engine + model brick (ZeroGPU).
+gradio==6.16.0            # gr.Server (FastAPI) + @app.api queue + ZeroGPU glue
+spaces                    # @spaces.GPU — ZeroGPU allocation on Hugging Face
+torch>=2.8.0              # ZeroGPU requirement (>=2.8)
+transformers>=4.51.0      # >=4.51 for Qwen3; also runs Nemotron (trust_remote_code)
+accelerate                # device placement / efficient loading
+# Optional speed-up for the Nemotron Mamba-2 kernels. Left unpinned because
+# they compile against CUDA at build time and can fail; transformers falls back
+# to a pure-PyTorch path without them. If the model fails to boot, the clean
+# escape hatch is the env var:  FITCHECK_MODEL=Qwen/Qwen3-4B-Instruct-2507
+# mamba-ssm
+# causal-conv1d

static/app.js CHANGED Viewed

@@ -74,6 +74,7 @@ const GPUS = {
 const $ = (s) => document.querySelector(s);
 const state = { computer: "Windows laptop", provider: "none", priority: "balanced", usecase: "chat", checked: false };
 // ---- Build the use-case picker -------------------------------------------
 function buildPicker() {
@@ -196,6 +197,7 @@ const VMAP = {
 };
 function render(d) {
   const v = VMAP[d.verdict] || VMAP.tight;
   const g = d.gauge || {};
   $("#cat-version").textContent = d.catalogue_version || "—";
@@ -255,6 +257,21 @@ function render(d) {
       ${cmds ? `<div class="section-title">Copy-paste to get started</div>
       <p class="cmd-intro">${d.commands.intro || ""}</p>
       <div class="cmd">${cmds}</div>` : ""}
     </div>`;
   hydrate($("#results"));
@@ -263,6 +280,81 @@ function render(d) {
     b.textContent = "Copied ✓"; b.classList.add("done");
     setTimeout(() => { b.textContent = "Copy"; b.classList.remove("done"); }, 1500);
   }));
 }
 // ---- Init -----------------------------------------------------------------

 const $ = (s) => document.querySelector(s);
 const state = { computer: "Windows laptop", provider: "none", priority: "balanced", usecase: "chat", checked: false };
+let lastAdvice = null;   // the most recent /api/advise result — facts the model explains
 // ---- Build the use-case picker -------------------------------------------
 function buildPicker() {
 };
 function render(d) {
+  lastAdvice = d;
   const v = VMAP[d.verdict] || VMAP.tight;
   const g = d.gauge || {};
   $("#cat-version").textContent = d.catalogue_version || "—";
       ${cmds ? `<div class="section-title">Copy-paste to get started</div>
       <p class="cmd-intro">${d.commands.intro || ""}</p>
       <div class="cmd">${cmds}</div>` : ""}
+      <div class="section-title">Ask a follow-up <span class="sub">explained in plain words, from the numbers above</span></div>
+      <div class="ask">
+        <div class="ask-row">
+          <input id="ask-input" type="text" autocomplete="off"
+                 placeholder="e.g. Why not the bigger model? What does 4-bit mean?" />
+          <button id="ask-send" class="ask-btn" title="Ask"><span class="ic" data-ic="arrow"></span></button>
+        </div>
+        <div class="ask-chips">
+          <button class="ask-chip">Why this model?</button>
+          <button class="ask-chip">What does the setting mean?</button>
+          <button class="ask-chip">Will it feel fast?</button>
+        </div>
+        <div id="ask-answer" class="ask-answer" hidden></div>
+      </div>
     </div>`;
   hydrate($("#results"));
     b.textContent = "Copied ✓"; b.classList.add("done");
     setTimeout(() => { b.textContent = "Copy"; b.classList.remove("done"); }, 1500);
   }));
+  wireAsk();
+}
+// ---- Follow-up: the model brick (grounded explainer) ---------------------
+function wireAsk() {
+  const input = $("#ask-input"), send = $("#ask-send");
+  if (!input || !send) return;
+  const go = () => askQuestion(input.value);
+  send.addEventListener("click", go);
+  input.addEventListener("keydown", e => { if (e.key === "Enter") go(); });
+  $("#results").querySelectorAll(".ask-chip").forEach(c =>
+    c.addEventListener("click", () => { input.value = c.textContent; askQuestion(c.textContent); }));
+}
+async function askQuestion(question) {
+  question = (question || "").trim();
+  const box = $("#ask-answer");
+  if (!question || !box) return;
+  box.hidden = false;
+  box.innerHTML = `<div class="ans-loading"><span class="spinner"></span>Thinking it through…</div>`;
+  try {
+    const a = await callAsk(question, JSON.stringify(lastAdvice || {}));
+    renderAnswer(box, a);
+  } catch (e) {
+    box.innerHTML = `<div class="ans-card"><p>Couldn't reach the explainer just now. The verdict and numbers above still stand.</p></div>`;
+  }
+}
+function renderAnswer(box, a) {
+  a = a || {};
+  const tag = a.fallback ? `<div class="ans-tag">Quick explainer (the AI model isn't loaded in this environment)</div>` : "";
+  box.innerHTML = `
+    <div class="ans-card reveal">
+      ${a.headline ? `<h3>${a.headline}</h3>` : ""}
+      ${a.why ? `<p>${a.why}</p>` : ""}
+      ${a.next_step ? `<div class="ans-next"><span class="ic" data-ic="arrow"></span><span>${a.next_step}</span></div>` : ""}
+      ${tag}
+    </div>`;
+  hydrate(box);
+}
+// On a ZeroGPU Space the JS client is REQUIRED (it forwards the HF iframe auth
+// headers ZeroGPU needs). Locally / non-ZeroGPU we fall back to the raw
+// two-step call so the chat still works with no internet to a CDN.
+let _gradioClient = null;
+async function getClient() {
+  if (_gradioClient) return _gradioClient;
+  const mod = await import("https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js");
+  const Client = mod.Client || mod.client;
+  _gradioClient = await Client.connect(window.location.origin);
+  return _gradioClient;
+}
+async function callAsk(question, facts) {
+  try {
+    const client = await getClient();
+    const r = await client.predict("/ask", { question, facts });
+    return Array.isArray(r.data) ? r.data[0] : r.data;
+  } catch (e) {
+    return await callAskRaw(question, facts);
+  }
+}
+async function callAskRaw(question, facts) {
+  const post = await fetch("/gradio_api/call/ask", {
+    method: "POST", headers: { "Content-Type": "application/json" },
+    body: JSON.stringify({ data: [question, facts] }),
+  });
+  const { event_id } = await post.json();
+  const res = await fetch(`/gradio_api/call/ask/${event_id}`);
+  const text = await res.text();
+  const lines = [...text.matchAll(/data:\s*(.+)/g)];   // SSE data frames
+  if (!lines.length) throw new Error("no data in stream");
+  const arr = JSON.parse(lines[lines.length - 1][1]);  // last frame = result
+  return Array.isArray(arr) ? arr[0] : arr;
 }
 // ---- Init -----------------------------------------------------------------

static/style.css CHANGED Viewed

@@ -363,6 +363,54 @@ details.disc > summary:hover { color: var(--text-primary); }
 .copy-btn:hover { color: var(--text-primary); border-color: var(--border-hi); }
 .copy-btn.done { color: var(--ok); border-color: var(--ok); }
 /* Footer */
 .foot { text-align: center; color: var(--text-muted); font-size: 13px; margin-top: var(--s-7); line-height: 1.6; }
 .foot b { color: var(--text-secondary); }

 .copy-btn:hover { color: var(--text-primary); border-color: var(--border-hi); }
 .copy-btn.done { color: var(--ok); border-color: var(--ok); }
+/* Ask a follow-up (the model brick) */
+.ask-row { display: flex; gap: var(--s-2); }
+.ask-row input {
+  flex: 1; background: var(--bg-inset); border: 1px solid var(--border);
+  border-radius: var(--r-md); padding: 12px 14px; font-size: 15px;
+}
+.ask-btn {
+  flex: none; width: 46px; border: none; border-radius: var(--r-md);
+  background: linear-gradient(135deg, var(--accent), var(--accent-strong));
+  color: #fff; display: grid; place-items: center;
+  transition: transform .15s, box-shadow .15s, filter .15s;
+}
+.ask-btn:hover { transform: translateY(-2px); box-shadow: var(--glow); filter: brightness(1.05); }
+.ask-btn .ic { font-size: 18px; }
+.ask-chips { display: flex; flex-wrap: wrap; gap: var(--s-2); margin-top: var(--s-3); }
+.ask-chip {
+  background: var(--bg-inset); border: 1px solid var(--border);
+  color: var(--text-secondary); border-radius: var(--r-pill);
+  padding: 6px 13px; font-size: 13px; font-weight: 500;
+  transition: border-color .15s, color .15s, background .15s;
+}
+.ask-chip:hover { border-color: var(--accent); color: var(--text-primary); background: var(--accent-soft); }
+.ask-answer { margin-top: var(--s-4); }
+.ans-card {
+  background: var(--bg-inset); border: 1px solid var(--border);
+  border-left: 4px solid var(--accent); border-radius: var(--r-md);
+  padding: var(--s-4) var(--s-5);
+}
+.ans-card h3 { font-size: 17px; font-weight: 700; margin-bottom: var(--s-2); }
+.ans-card p { color: var(--text-secondary); font-size: 14.5px; line-height: 1.6; }
+.ans-next {
+  display: flex; align-items: center; gap: var(--s-2); margin-top: var(--s-3);
+  font-size: 14px; font-weight: 600; color: var(--accent);
+}
+.ans-next .ic { font-size: 15px; flex: none; }
+.ans-tag {
+  margin-top: var(--s-3); padding-top: var(--s-3); border-top: 1px solid var(--border);
+  font-size: 12px; color: var(--text-muted);
+}
+.ans-loading { display: flex; align-items: center; gap: var(--s-2); color: var(--text-muted); font-size: 14px; padding: var(--s-2) 0; }
+.spinner {
+  width: 15px; height: 15px; flex: none; border-radius: 50%;
+  border: 2px solid var(--border-hi); border-top-color: var(--accent);
+  animation: spin .7s linear infinite;
+}
+@keyframes spin { to { transform: rotate(360deg); } }
 /* Footer */
 .foot { text-align: center; color: var(--text-muted); font-size: 13px; margin-top: var(--s-7); line-height: 1.6; }
 .foot b { color: var(--text-secondary); }