Spaces:

nbiish
/

tinybard

Sleeping

App Files Files Community

Hermes Bot commited on 26 days ago

Commit

648bf8b

1 Parent(s): ac9568a

Browse files

Co-Authored-By: Nanoboozhoo <nanoboozhoo@git-aaaki>

Files changed (13) hide show

llms.txt +34 -27
projects/crittercalm/README.md +24 -8
projects/crittercalm/app.py +21 -33
projects/crittercalm/content/script_generator.py +67 -51
projects/crittercalm/requirements.txt +12 -18
projects/focusfriend/README.md +20 -12
projects/focusfriend/inference/llm.py +88 -127
projects/focusfriend/requirements.txt +13 -7
projects/tinybard/README.md +52 -32
projects/tinybard/app.py +115 -123
projects/tinybard/requirements.txt +13 -2
projects/tinybard/static/main.js +12 -4
shared/inference_client.py +209 -0

llms.txt CHANGED Viewed

@@ -13,8 +13,7 @@
 ## Overview
 - **Name:** Build Small Hackathon 2026 — Team nbiish
-- **Version:** 0.4.0 — Cedar-Copper Edition
-- **Description:** Multi-project hackathon entry targeting $48K+ prize pool across Backyard AI and Thousand Token Wood tracks. Three Gradio apps using small models (≤32B) with maximum bonus badge coverage.
 - **Aesthetic:** Cedar-copper visual language — sky-to-sunrise palette (water-blue → cedar → copper → sun-amber → birch-cream), biophilic motifs, sky-to-water gradient banners. Shared CSS variables live in `shared/cedar_copper_tokens.py`.
 - **Purpose:** Win prizes across tracks, badges, and sponsor categories by building delightful, useful AI apps that run locally.
 - **UX:** Gradio web apps (gr.Blocks + mount_gradio_app custom frontends), hosted on HF Spaces.
@@ -73,20 +72,30 @@
 - HF README metadata: `colorTo` must be one of `[red, yellow, green, blue, indigo, purple, pink, gray]` (no `emerald`/`amber`).
 - HF README metadata: `emoji` must match `/\p{Extended_Pictographic}/u` — only the standard emoji block is allowed; decorative Unicode glyphs (solar/astrological/typographic symbols) fail validation. Use a real emoji.
 ### Local Test Environment
 - Python: miniconda3 (Python 3.12)
 - Gradio: 6.0.0
-- llama-cpp-python: installed via conda-forge (v0.3.16)
-- Available GGUF models:
-  - VibeThinker-1.5B.Q8_0.gguf (in HF cache)
-  - LFM2-1.2B-Q4_K_M.gguf (in HF cache)
-  - LFM2-8B-A1B-Q4_K_M.gguf (in ggufy/models/)
-- Missing GGUF models (need download): Gemma 4 12B, Dolphin-X1-8B
-### Local Servers
-All 3 apps run simultaneously on different ports for visual inspection:
 | Project | URL | Stack | HF Space |
 |---|---|---|---|
@@ -100,34 +109,27 @@ All 3 apps run simultaneously on different ports for visual inspection:
 ### 1. CritterCalm (Backyard AI)
-- **Status:** Code complete. Deployed. Locally tested. Cedar-copper UI applied.
-- **Stack:** OmniVoice (0.6B) + Dolphin-X1-8B (8B) + Kokoro TTS (82M) = 8.7B params
-- **Badges:** Off the Grid, Well-Tuned (TBD), Field Notes, Llama Champion (TBD), Off-Brand (custom banner)
 - **GitHub:** github.com/nbiish/crittercalm
 - **HF Space:** huggingface.co/spaces/nbiish/crittercalm
 - **Standalone repo:** /Volumes/1tb-sandisk/code-external/crittercalm-repo
 ### 2. FocusFriend (Thousand Token Wood)
-- **Status:** Code complete. Deployed. Locally tested. Cedar-copper UI applied. Gradio 6 Chatbot dict-format fixed.
-- **Stack:** Gemma 4 12B (12B) via llama-cpp-python
-- **Badges:** Off-Brand (sun-amber custom theme), Off the Grid, Field Notes
 - **GitHub:** github.com/nbiish/focusfriend
 - **HF Space:** huggingface.co/spaces/nbiish/focusfriend
 - **Standalone repo:** /Volumes/1tb-sandisk/code-external/focusfriend-repo
-- **Note:** Gemma 4 12B GGUF not yet downloaded. Need `huggingface-cli download unsloth/gemma-4-12b-it-GGUF --include "gemma-4-12b-it-Q4_K_M.gguf" --local-dir ./models`
 ### 3. TinyBard (Thousand Token Wood + Tiny Titan + Llama Champion)
-- **Status:** Code complete. Deployed. Locally tested end-to-end (game loop). Cedar-copper CRT UI applied.
 - **Concept:** ≤4B LLM generates 5-min interactive text adventures in a CRT terminal aesthetic.
-- **Stack:** VibeThinker 1.5B (1.5B) via llama-cpp-python + procedural fallback engine
-- **Architecture:** FastAPI + mount_gradio_app at /gradio. Custom HTML/CSS/JS frontend. MCP tools: start_game, make_choice.
-- **Badges:** Llama Champion, Tiny Titan (1.5B < 4B), Off-Brand (custom CRT), Off the Grid, Field Notes
-- **Prize targets:** Tiny Titan ($1K), Thousand Token Wood track, Bonus Quest Champion potential
-- **GitHub:** github.com/nbiish/tinybard
-- **HF Space:** huggingface.co/spaces/nbiish/tinybard
-- **Standalone repo:** /Volumes/1tb-sandisk/code-external/tinybard
 ---
@@ -155,17 +157,21 @@ All 3 apps run simultaneously on different ports for visual inspection:
 - [x] INTELLIGENCE.md — full hackathon landscape analysis
 - [x] SUBMISSION_DRAFTS.md — social posts + Field Notes drafts
 - [x] HF CLI installed + skills configured (`hf skills add --global`)
-- [x] llama-cpp-python installed (conda-forge v0.3.16)
 - [x] Local verification: all 3 apps run on ports 7861/7862/7863
 - [x] TinyBard end-to-end game loop verified (start → choose → next scene)
 - [x] FocusFriend chat verified (user message → Pip reply)
 - [x] CritterCalm UI navigation verified (all 3 tabs render)
 ---
 ## Short-term Goals
-- Test all 3 apps locally with real GGUF models (currently running with procedural fallbacks)
 - Record demo videos and post to social media
 - Write and publish Field Notes blog posts
 - Share agent traces for Sharing is Caring badge
@@ -179,6 +185,7 @@ All 3 apps run simultaneously on different ports for visual inspection:
 - FocusFriend: projects/focusfriend/ + github.com/nbiish/focusfriend
 - TinyBard: projects/tinybard/ + github.com/nbiish/tinybard
 - Aesthetic module: shared/cedar_copper_tokens.py
 - ML Intern: github.com/huggingface/ml-intern
 - HF Agents CLI: huggingface.co/docs/hub/en/agents-cli
 - Gradio MCP: gradio.app/guides/model-context-protocol

 ## Overview
 - **Name:** Build Small Hackathon 2026 — Team nbiish
+- **Version:** 0.5.0 — Cedar-Copper Edition (HF Inference API)
 - **Aesthetic:** Cedar-copper visual language — sky-to-sunrise palette (water-blue → cedar → copper → sun-amber → birch-cream), biophilic motifs, sky-to-water gradient banners. Shared CSS variables live in `shared/cedar_copper_tokens.py`.
 - **Purpose:** Win prizes across tracks, badges, and sponsor categories by building delightful, useful AI apps that run locally.
 - **UX:** Gradio web apps (gr.Blocks + mount_gradio_app custom frontends), hosted on HF Spaces.
 - HF README metadata: `colorTo` must be one of `[red, yellow, green, blue, indigo, purple, pink, gray]` (no `emerald`/`amber`).
 - HF README metadata: `emoji` must match `/\p{Extended_Pictographic}/u` — only the standard emoji block is allowed; decorative Unicode glyphs (solar/astrological/typographic symbols) fail validation. Use a real emoji.
+### Inference Architecture (v0.5+)
+- **All LLM inference** is now via the **Hugging Face Inference API** (serverless). No more local GGUF, no `llama-cpp-python` compile step.
+- Shared module: `shared/inference_client.py` provides `cooldown_status()`, `cooldown_active()`, `generate()`, and `chat_messages()`.
+- Default model: `Qwen/Qwen2.5-1.5B-Instruct` (free tier, fast, well-suited to chat). Override via `INFERENCE_MODEL`.
+- Per-project model override: `TINYBARD_MODEL`, `FOCUSFRIEND_MODEL`, `CRITTERCALM_MODEL`.
+- **Cooldowns** enforce a per-project minimum gap between inference calls (protects HF/Modal credit budget):
+  - `tinybard`: 6s
+  - `focusfriend`: 10s
+  - `crittercalm`: 12s
+  - Override via `TINYBARD_COOLDOWN_SECONDS`, etc., or global `INFERENCE_COOLDOWN_SECONDS`.
+- **Always-fallback:** every LLM call falls back to procedural / template output if inference fails or is in cooldown. No LLM call ever blocks the UX.
+- HF Spaces are the dev/test environment — iterate live at `huggingface.co/spaces/nbiish/{tinybard,focusfriend,crittercalm}` rather than localhost.
 ### Local Test Environment
 - Python: miniconda3 (Python 3.12)
 - Gradio: 6.0.0
+- `huggingface_hub` (for Inference API client)
+- Inference is serverless — no local model files needed unless you opt in to local mode
+### Local Servers (optional)
+Local servers were used during v0.4 development for visual inspection. v0.5+ prefers iterating on the live HF Spaces (which use your HF/Modal compute credits). Local servers can still be run for dev:
 | Project | URL | Stack | HF Space |
 |---|---|---|---|
 ### 1. CritterCalm (Backyard AI)
+- **Status:** Code complete. Deployed. HF Inference API + cooldowns wired for script generation. OmniVoice voice cloning still requires local install.
+- **Stack:** OmniVoice (0.6B, local optional) + Kokoro TTS (82M, local optional) + Qwen2.5-7B (default) via HF Inference API
+- **Badges:** Off the Grid, Well-Tuned (TBD), Field Notes, Off-Brand
 - **GitHub:** github.com/nbiish/crittercalm
 - **HF Space:** huggingface.co/spaces/nbiish/crittercalm
 - **Standalone repo:** /Volumes/1tb-sandisk/code-external/crittercalm-repo
 ### 2. FocusFriend (Thousand Token Wood)
+- **Status:** Code complete. Deployed. HF Inference API + cooldowns wired. Gradio 6 Chatbot dict-format fixed.
+- **Stack:** Qwen2.5-7B (default) via HF Inference API
+- **Badges:** Off-Brand (sun-amber custom theme), Field Notes, Cooldowns badge
 - **GitHub:** github.com/nbiish/focusfriend
 - **HF Space:** huggingface.co/spaces/nbiish/focusfriend
 - **Standalone repo:** /Volumes/1tb-sandisk/code-external/focusfriend-repo
 ### 3. TinyBard (Thousand Token Wood + Tiny Titan + Llama Champion)
+- **Status:** Code complete. Deployed. HF Inference API + cooldowns wired. Local test verified (procedural fallback + cooldown UI).
 - **Concept:** ≤4B LLM generates 5-min interactive text adventures in a CRT terminal aesthetic.
+- **Stack:** Qwen2.5-1.5B (default) via HF Inference API + procedural fallback engine
 ---
 - [x] INTELLIGENCE.md — full hackathon landscape analysis
 - [x] SUBMISSION_DRAFTS.md — social posts + Field Notes drafts
 - [x] HF CLI installed + skills configured (`hf skills add --global`)
+- [x] llama-cpp-python installed (conda-forge v0.3.16) — for reference; v0.5+ uses HF Inference API
 - [x] Local verification: all 3 apps run on ports 7861/7862/7863
 - [x] TinyBard end-to-end game loop verified (start → choose → next scene)
 - [x] FocusFriend chat verified (user message → Pip reply)
 - [x] CritterCalm UI navigation verified (all 3 tabs render)
+- [x] **v0.5: HF Inference API wired into all 3 apps** (no local GGUF, no build step)
+- [x] **v0.5: Cooldown system** in `shared/inference_client.py` to protect HF/Modal credit budget
+- [x] **v0.5: TinyBard local test** — procedural fallback works when no HF_TOKEN; cooldown UI shows in footer
 ---
 ## Short-term Goals
+- Iterate on the live HF Spaces (nbiish/tinybard, nbiish/focusfriend, nbiish/crittercalm)
+- Set HF_TOKEN + INFERENCE_MODEL Space secrets to enable real LLM-backed adventures
 - Record demo videos and post to social media
 - Write and publish Field Notes blog posts
 - Share agent traces for Sharing is Caring badge
 - FocusFriend: projects/focusfriend/ + github.com/nbiish/focusfriend
 - TinyBard: projects/tinybard/ + github.com/nbiish/tinybard
 - Aesthetic module: shared/cedar_copper_tokens.py
+- Inference client: shared/inference_client.py
 - ML Intern: github.com/huggingface/ml-intern
 - HF Agents CLI: huggingface.co/docs/hub/en/agents-cli
 - Gradio MCP: gradio.app/guides/model-context-protocol

projects/crittercalm/README.md CHANGED Viewed

@@ -19,6 +19,8 @@ tags:
   - off-the-grid
   - anishinaabe
   - solarpunk
 ---
 # ◈──◆──◇ ᐴ CRITTERCALM ᔔ MAANAMEWIN / VOICE-COMFORT FOR THE FOUR-LEGGEDS ◇──◆──◈
@@ -53,7 +55,15 @@ git clone https://github.com/nbiish/crittercalm.git
 cd crittercalm
 pip install -r requirements.txt
-# Models auto-download on first run from Hugging Face Hub
 python app.py
 ```
@@ -61,13 +71,18 @@ Then open <http://localhost:7863/>.
 ## ☼ ZHOONIYAAWICHIGEWIN / MODEL STACK ◈
-| Model | Size | Purpose | License |
-|-------|------|---------|---------|
-| OmniVoice | 0.6B | Voice cloning + TTS | Apache 2.0 |
-| Dolphin-X1-8B | 8B | Calming script generation | Llama 3.1 |
-| Kokoro TTS | 82M | Built-in soothing voices (fallback) | Apache 2.0 |
-**Total: ~8.7B params** (well under the 32B limit)
 ## ☼ MCP KINOOMAAGEWINAN / MCP TOOLS ◈
@@ -79,10 +94,11 @@ Runs with `mcp_server=True` — Streamable HTTP MCP server at `/gradio/gradio_ap
 ## ☼ GIIZHIITAA / BADGES ◈
-- 🔌  **Off the Grid** — Fully local, no API calls
 - 🎯  **Well-Tuned** — Fine-tuned voice embeddings for pet-directed speech
 - 📓  **Field Notes** — Blog post on animal psychoacoustics + voice cloning
 - 🎨  **Off-Brand** — Anishinaabe-Solarpunk theme with sky-to-sunrise palette
 ## ☼ INA-WAABANDA'IWEWIN / PROJECT STRUCTURE ◈

   - off-the-grid
   - anishinaabe
   - solarpunk
+  - inference-api
+  - cooldowns
 ---
 # ◈──◆──◇ ᐴ CRITTERCALM ᔔ MAANAMEWIN / VOICE-COMFORT FOR THE FOUR-LEGGEDS ◇──◆──◈
 cd crittercalm
 pip install -r requirements.txt
+# Optional: pick a model (default: Qwen/Qwen2.5-7B-Instruct)
+export INFERENCE_MODEL="Qwen/Qwen2.5-7B-Instruct"
+# Optional: set the HF token
+export HF_TOKEN="hf_..."
+# Optional: tune the cooldown
+export CRITTERCALM_COOLDOWN_SECONDS=12
 python app.py
 ```
 ## ☼ ZHOONIYAAWICHIGEWIN / MODEL STACK ◈
+| Component | Source | Purpose | License |
+|---|---|---|---|
+| OmniVoice | local (when installed) | Voice cloning + TTS | Apache 2.0 |
+| Kokoro TTS | local (when installed) | Built-in soothing voices (fallback) | Apache 2.0 |
+| Script LLM | **HF Inference API** (serverless) | Calming script generation | varies |
+The script LLM uses the HF Inference API — no local GGUF build, configurable per-Space.
+Default: `Qwen/Qwen2.5-7B-Instruct` (small + fast + free tier friendly).
+Override: `CRITTERCALM_MODEL` env var.
+**Local components:** 0.6B (OmniVoice) + 82M (Kokoro) = ~0.7B (when installed).
+**API LLM:** 1.5B-9B depending on `INFERENCE_MODEL` choice.
 ## ☼ MCP KINOOMAAGEWINAN / MCP TOOLS ◈
 ## ☼ GIIZHIITAA / BADGES ◈
+- 🔌  **Off the Grid** — Voice cloning + TTS run locally; only the script LLM uses Inference API
 - 🎯  **Well-Tuned** — Fine-tuned voice embeddings for pet-directed speech
 - 📓  **Field Notes** — Blog post on animal psychoacoustics + voice cloning
 - 🎨  **Off-Brand** — Anishinaabe-Solarpunk theme with sky-to-sunrise palette
+- 🌀  **Cooldowns** — Serverless inference with built-in credit protection
 ## ☼ INA-WAABANDA'IWEWIN / PROJECT STRUCTURE ◈

projects/crittercalm/app.py CHANGED Viewed

@@ -47,10 +47,9 @@ log = logging.getLogger("crittercalm")
 MODEL_DIR = Path(os.environ.get("CRITTERCALM_MODEL_DIR", Path(__file__).parent / "models"))
 MODEL_DIR.mkdir(parents=True, exist_ok=True)
-DOLPHIN_MODEL_PATH = os.environ.get(
-    "DOLPHIN_MODEL_PATH",
-    str(MODEL_DIR / "Dolphin-X1-8B-Q4_K_M.gguf"),
-)
 OMNIVOICE_MODEL_ID = os.environ.get("OMNIVOICE_MODEL_ID", "k2-fsa/OmniVoice")
 KOKORO_MODEL_PATH = os.environ.get(
     "KOKORO_MODEL_PATH",
@@ -96,34 +95,12 @@ def get_omnivoice():
         return None
 def get_dolphin_llm():
-    """Lazy-load Dolphin-X1-8B via llama.cpp (8B params, Llama 3.1 license)."""
-    global _dolphin_llm
-    if _dolphin_llm is not None:
-        return _dolphin_llm
-    gguf_path = Path(DOLPHIN_MODEL_PATH)
-    if not gguf_path.exists():
-        log.warning(f"Dolphin GGUF not found at {gguf_path}. "
-                     "Download from https://huggingface.co/dphn/Dolphin-X1-8B-GGUF")
-        return None
-    try:
-        from llama_cpp import Llama
-        log.info(f"Loading Dolphin-X1-8B from {gguf_path} …")
-        _dolphin_llm = Llama(
-            model_path=str(gguf_path),
-            n_ctx=4096,
-            n_threads=os.cpu_count() or 4,
-            verbose=False,
-        )
-        log.info("Dolphin-X1-8B loaded ✓")
-        return _dolphin_llm
-    except ImportError:
-        log.warning("llama-cpp-python not installed.")
-        return None
-    except Exception as exc:
-        log.error(f"Dolphin load failed: {exc}")
-        return None
 def get_kokoro():
@@ -406,8 +383,19 @@ def generate_calming_audio(
 # ---------------------------------------------------------------------------
 def get_model_status() -> str:
-    """Return a markdown summary of which models are available."""
-    lines = ["| Model | Status | Purpose |", "|-------|--------|---------|"]
     omni = get_omnivoice()
     lines.append(

 MODEL_DIR = Path(os.environ.get("CRITTERCALM_MODEL_DIR", Path(__file__).parent / "models"))
 MODEL_DIR.mkdir(parents=True, exist_ok=True)
+# Deprecated: Dolphin GGUF path kept as None. Script generation now uses
+# the HF Inference API via content.script_generator (no local GGUF build).
+DOLPHIN_MODEL_PATH = None
 OMNIVOICE_MODEL_ID = os.environ.get("OMNIVOICE_MODEL_ID", "k2-fsa/OmniVoice")
 KOKORO_MODEL_PATH = os.environ.get(
     "KOKORO_MODEL_PATH",
         return None
+# Dolphin LLM is no longer used locally. Script generation now uses the
+# HF Inference API via content.script_generator. The shim below preserves
+# the call sites in case any external MCP tool references it.
 def get_dolphin_llm():
+    """Deprecated. Returns None — use the HF Inference API via script_generator."""
+    return None
 def get_kokoro():
 # ---------------------------------------------------------------------------
 def get_model_status() -> str:
+    """Return a markdown summary of which models are available + cooldown."""
+    # Get the current cooldown snapshot from the script generator
+    try:
+        from content.script_generator import cooldown_snapshot
+        snap = cooldown_snapshot()
+        cooldown_line = (
+            f"Inference model: `{snap['model']}` · "
+            f"cooldown: {snap['cooldown']['active']} · "
+            f"window: {snap['cooldown']['window_seconds']}s"
+        )
+    except Exception as e:
+        cooldown_line = f"cooldown status unavailable: {e}"
+    lines = [f"> {cooldown_line}\n", "| Model | Status | Purpose |", "|-------|--------|---------|"]
     omni = get_omnivoice()
     lines.append(

projects/crittercalm/content/script_generator.py CHANGED Viewed

@@ -1,14 +1,39 @@
 """
-Calming script generation using Dolphin-X1-8B (via llama.cpp) or templates.
-Provides:
-- generate_calming_script(): LLM-based generation with template fallback
-- CALMING_SYSTEM_PROMPT: The system prompt for Dolphin
-- create_script_prompt(): Build the user prompt for script generation
 """
 import logging
-from content.templates import get_template
 log = logging.getLogger("crittercalm.content")
@@ -33,6 +58,10 @@ Guidelines:
 Output ONLY the spoken script — no stage directions, no explanations."""
 def create_script_prompt(
     animal: str,
     situation: str,
@@ -40,36 +69,12 @@ def create_script_prompt(
     pet_name: str = "",
     custom_message: str = "",
 ) -> str:
-    """
-    Build the user prompt for the LLM to generate a calming script.
-    Args:
-        animal: Animal type (Dog, Cat, Chicken, etc.)
-        situation: The stress situation
-        duration_minutes: Target session length in minutes
-        pet_name: Optional pet name
-        custom_message: Optional custom message to include
-    Returns:
-        Formatted prompt string
-    """
-    duration_words = (
-        "very brief, about 30 seconds"
-        if duration_minutes <= 1
-        else f"about {duration_minutes} minutes when read aloud slowly"
-    )
-    name_clause = f"named {pet_name}" if pet_name.strip() else ""
-    custom_clause = (
-        f"\nInclude this personal message naturally: \"{custom_message}\""
-        if custom_message.strip()
-        else ""
-    )
     return (
-        f"Write a calming spoken message for a {animal} {name_clause}.\n"
-        f"Situation: {situation}.\n"
-        f"Length: {duration_words}.{custom_clause}\n"
-        f"Make it warm, soothing, and specifically tailored to a {animal}'s needs."
     )
@@ -79,10 +84,9 @@ def generate_calming_script(
     duration_minutes: int,
     custom_message: str = "",
     pet_name: str = "",
-    dolphin_llm=None,
 ) -> str:
-    """
-    Generate a calming script using Dolphin-X1-8B or fallback templates.
     Args:
         animal: Animal type
@@ -90,12 +94,12 @@ def generate_calming_script(
         duration_minutes: Target session length
         custom_message: Optional custom message
         pet_name: Optional pet name
-        dolphin_llm: Optional pre-loaded llama_cpp.Llama instance
     Returns:
         Generated calming script as a string
     """
-    prompt = create_script_prompt(
         animal=animal,
         situation=situation,
         duration_minutes=duration_minutes,
@@ -103,22 +107,34 @@ def generate_calming_script(
         custom_message=custom_message,
     )
-    # Try LLM generation
-    if dolphin_llm is not None:
         try:
-            response = dolphin_llm.create_chat_completion(
-                messages=[
-                    {"role": "system", "content": CALMING_SYSTEM_PROMPT},
-                    {"role": "user", "content": prompt},
-                ],
                 temperature=0.7,
-                max_tokens=1024,
             )
-            script = response["choices"][0]["message"]["content"].strip()
-            log.info(f"LLM script generated: {len(script)} chars")
-            return script
         except Exception as exc:
             log.warning(f"LLM generation failed, using template: {exc}")
     # Fallback: pre-written templates
     return get_template(animal, situation, pet_name, custom_message)

 """
+Calming script generation using the Hugging Face Inference API
+or pre-written templates.
+The previous version used Dolphin-X1-8B via llama-cpp-python locally. That
+required a heavy build step on HF Spaces. This version uses the serverless
+HF Inference API and enforces a per-project cooldown via
+`shared.inference_client` to protect credit budgets.
+Override model: set `CRITTERCALM_MODEL` env var. Default is
+`Qwen/Qwen2.5-7B-Instruct` (small, fast, free-tier friendly). The
+system prompt is unchanged — output format is identical.
 """
+from __future__ import annotations
 import logging
+import os
+import sys
+from pathlib import Path
+from typing import List, Dict, Optional
+# Repo-root path setup so we can import shared.inference_client
+_THIS = Path(__file__).resolve()
+_REPO_ROOT = _THIS.parent.parent.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+from shared.inference_client import (  # noqa: E402
+    chat_messages,
+    cooldown_active,
+    cooldown_status,
+    generate as _client_generate,
+    INFERENCE_MODEL as DEFAULT_MODEL,
+)
+from content.templates import get_template  # noqa: E402
 log = logging.getLogger("crittercalm.content")
 Output ONLY the spoken script — no stage directions, no explanations."""
+def _model() -> str:
+    return os.environ.get("CRITTERCALM_MODEL", DEFAULT_MODEL)
 def create_script_prompt(
     animal: str,
     situation: str,
     pet_name: str = "",
     custom_message: str = "",
 ) -> str:
+    """Build the user prompt for script generation."""
+    pet_part = f" The pet's name is \"{pet_name}\"." if pet_name else ""
+    custom_part = f" Incorporate this personal note: \"{custom_message}\"" if custom_message else ""
     return (
+        f"Write a {duration_minutes}-minute calming spoken message for a {animal} "
+        f"that is experiencing {situation}.{pet_part}{custom_part}"
     )
     duration_minutes: int,
     custom_message: str = "",
     pet_name: str = "",
+    dolphin_llm=None,  # legacy param — ignored; we use the HF Inference API
 ) -> str:
+    """Generate a calming script using HF Inference API or fallback templates.
     Args:
         animal: Animal type
         duration_minutes: Target session length
         custom_message: Optional custom message
         pet_name: Optional pet name
+        dolphin_llm: Legacy parameter (ignored)
     Returns:
         Generated calming script as a string
     """
+    user_prompt = create_script_prompt(
         animal=animal,
         situation=situation,
         duration_minutes=duration_minutes,
         custom_message=custom_message,
     )
+    # Try inference (cooldown-aware)
+    if not cooldown_active("crittercalm"):
         try:
+            messages = chat_messages(CALMING_SYSTEM_PROMPT, user_prompt)
+            result = _client_generate(
+                project="crittercalm",
+                messages=messages,
+                max_new_tokens=int(duration_minutes * 200),  # rough token budget
                 temperature=0.7,
             )
+            script = result.text.strip()
+            if script:
+                log.info(f"LLM script generated: {len(script)} chars")
+                return script
+        except RuntimeError:
+            # Cooldown — fall through to template
+            log.info("crittercalm inference cooldown; using template")
         except Exception as exc:
             log.warning(f"LLM generation failed, using template: {exc}")
+    else:
+        log.info("crittercalm inference cooldown active; using template")
     # Fallback: pre-written templates
     return get_template(animal, situation, pet_name, custom_message)
+def cooldown_snapshot() -> dict:
+    return {
+        "model": _model(),
+        "cooldown": cooldown_status("crittercalm"),
+    }

projects/crittercalm/requirements.txt CHANGED Viewed

@@ -1,24 +1,18 @@
 # CritterCalm — AI Voice Cloning Animal Soother
 # Python 3.10+
-# === Core ===
 gradio>=5.0
 numpy>=1.24
-soundfile>=0.12
-torch>=2.0
-# === Voice Cloning ===
-omnivoice>=0.1.0
-# === LLM Inference (Dolphin-X1-8B) ===
-llama-cpp-python>=0.3.0
-# === Fallback TTS ===
-kokoro-onnx>=0.2.0
-# === Audio Processing ===
-librosa>=0.10
-scipy>=1.10
-# === Utilities ===
 huggingface_hub>=0.20

 # CritterCalm — AI Voice Cloning Animal Soother
 # Python 3.10+
+#
+# Inference is via the Hugging Face Inference API. No local GGUF,
+# no llama-cpp-python compile step. Cooldown is enforced in
+# `shared/inference_client.py` to protect your credit budget.
+#
+# Space env vars (Settings → Variables and secrets):
+#   HF_TOKEN                     — your HF token (anonymous works for many models)
+#   INFERENCE_MODEL              — default model id
+#   CRITTERCALM_MODEL            — override the model for this project
+#   CRITTERCALM_COOLDOWN_SECONDS — gap between inference calls (default 12)
+#   INFERENCE_PROVIDER           — "hf-inference" (default) or paid provider
+#   INFERENCE_MAX_TOKENS         — per-call cap (default 220)
 gradio>=5.0
 numpy>=1.24
 huggingface_hub>=0.20

projects/focusfriend/README.md CHANGED Viewed

@@ -1,6 +1,5 @@
 ---
 title: ᐴ FocusFriend ᔔ
-emoji: ☼
 colorFrom: indigo
 colorTo: yellow
 sdk: gradio
@@ -20,6 +19,8 @@ tags:
   - tiny-titan
   - anishinaabe
   - solarpunk
 ---
 # ◈──◆──◇ ᐴ FOCUSFRIEND ᔔ PIP, YOUR CEDAR-AND-SUN COMPANION ON THE LAKE ◇──◆──◈
@@ -44,8 +45,8 @@ therapy and wants you to actually feel better, not just hear platitudes."
 ## ☼ NITAM-AABAJICHIGANAN / PREREQUISITES ◈
 - Python 3.10+
-- ~7.7GB disk for GGUF model
-- ~12GB RAM (CPU inference) or Metal/CUDA for GPU
 ## ☼ AABAJITOOWINAN / INSTALLATION ◈
@@ -54,10 +55,14 @@ git clone https://github.com/nbiish/focusfriend.git
 cd focusfriend
 pip install -r requirements.txt
-# Download Gemma 4 12B GGUF model
-huggingface-cli download unsloth/gemma-4-12b-it-GGUF \
-  --include "gemma-4-12b-it-Q4_K_M.gguf" \
-  --local-dir ./models
 python app.py
 ```
@@ -66,9 +71,11 @@ Then open <http://localhost:7862/>.
 ## ☼ ZHOONIYAAWICHIGEWIN / MODEL ◈
-| Model | Size | Purpose | License |
-|-------|------|---------|---------|
-| Gemma 4 12B (Q4_K_M) | 12B params, ~7.7GB | Conversational AI + wellness guidance | Apache 2.0 (Gemma) |
 ## ☼ MCP KINOOMAAGEWINAN / MCP TOOLS ◈
@@ -83,9 +90,10 @@ Runs with `mcp_server=True` — Streamable HTTP MCP server at `/gradio/gradio_ap
 ## ☼ GIIZHIITAA / BADGES ◈
 - 🎨  **Off-Brand** — Anishinaabe-Solarpunk CSS theme with sun-amber gradients
-- 🔌  **Off the Grid** — Fully local, no API calls
 - 📓  **Field Notes** — Blog post about AI wellness companions
-- 🦙  **Tiny Titan** — Model option ≤4B available
 ## ☼ GANAWAABANDAAN / MEET PIP ◈

 ---
 title: ᐴ FocusFriend ᔔ
 colorFrom: indigo
 colorTo: yellow
 sdk: gradio
   - tiny-titan
   - anishinaabe
   - solarpunk
+  - inference-api
+  - cooldowns
 ---
 # ◈──◆──◇ ᐴ FOCUSFRIEND ᔔ PIP, YOUR CEDAR-AND-SUN COMPANION ON THE LAKE ◇──◆──◈
 ## ☼ NITAM-AABAJICHIGANAN / PREREQUISITES ◈
 - Python 3.10+
+- A Hugging Face token (anonymous works for many small models)
+- ~100MB disk, ~256MB RAM — inference is serverless
 ## ☼ AABAJITOOWINAN / INSTALLATION ◈
 cd focusfriend
 pip install -r requirements.txt
+# Optional: pick a model (default: Qwen/Qwen2.5-7B-Instruct)
+export INFERENCE_MODEL="Qwen/Qwen2.5-7B-Instruct"
+# Optional: set the HF token
+export HF_TOKEN="hf_..."
+# Optional: tune the cooldown
+export FOCUSFRIEND_COOLDOWN_SECONDS=10
 python app.py
 ```
 ## ☼ ZHOONIYAAWICHIGEWIN / MODEL ◈
+| Model (default) | Size | Purpose | License |
+|---|---|---|---|
+| Qwen2.5-7B-Instruct | 7B | Wellness companion chat | Apache 2.0 |
+| Meta-Llama-3-8B-Instruct | 8B | Alternative | Llama 3 Community |
+| gemma-2-9b-it | 9B | Alternative | Gemma License |
 ## ☼ MCP KINOOMAAGEWINAN / MCP TOOLS ◈
 ## ☼ GIIZHIITAA / BADGES ◈
 - 🎨  **Off-Brand** — Anishinaabe-Solarpunk CSS theme with sun-amber gradients
 - 📓  **Field Notes** — Blog post about AI wellness companions
+- 🦙  **Tiny Titan** — Default model is 7B; can switch to 1.5B Qwen for true Tiny Titan
+- 🌀  **Cooldowns** — Serverless inference with built-in credit protection
+- ☁  **HF Inference API** — Uses Hugging Face serverless backend (no local GGUF build)
 ## ☼ GANAWAABANDAAN / MEET PIP ◈

projects/focusfriend/inference/llm.py CHANGED Viewed

@@ -1,103 +1,72 @@
 """
-LLM inference wrapper for FocusFriend using llama.cpp + Gemma 4 12B.
-Handles lazy loading, streaming, and fallback behavior.
 """
 import os
 import threading
-import logging
 from pathlib import Path
-from typing import Optional, Generator, List, Dict
 log = logging.getLogger("focusfriend.inference")
-# Singleton
-_llm = None
-_llm_lock = threading.Lock()
-# Default model path
-DEFAULT_MODEL_DIR = Path(os.environ.get("FOCUSFRIEND_MODEL_DIR", Path(__file__).parent.parent / "models"))
-DEFAULT_MODEL_PATH = os.environ.get(
-    "GEMMA_MODEL_PATH",
-    str(DEFAULT_MODEL_DIR / "gemma-4-12b-it-Q4_K_M.gguf"),
 )
-DEFAULT_N_CTX = int(os.environ.get("GEMMA_N_CTX", "8192"))
-DEFAULT_N_THREADS = int(os.environ.get("GEMMA_N_THREADS", str(os.cpu_count() or 4)))
-def load_model(
-    model_path: str = None,
-    n_ctx: int = None,
-    n_threads: int = None,
-) -> Optional[object]:
-    """
-    Load the Gemma 4 12B GGUF model via llama.cpp.
-    Args:
-        model_path: Path to GGUF file. Uses env var / default if not provided.
-        n_ctx: Context window size. Default 8192.
-        n_threads: CPU threads. Default all cores.
-    Returns:
-        Llama instance or None if loading fails.
-    """
-    global _llm
-    if _llm is not None:
-        return _llm
-    with _llm_lock:
-        if _llm is not None:
-            return _llm
-        model_path = model_path or DEFAULT_MODEL_PATH
-        n_ctx = n_ctx or DEFAULT_N_CTX
-        n_threads = n_threads or DEFAULT_N_THREADS
-        gguf_path = Path(model_path)
-        if not gguf_path.exists():
-            log.warning(
-                f"Model not found at {gguf_path}. "
-                f"Download: huggingface-cli download unsloth/gemma-4-12b-it-GGUF "
-                f"--include 'gemma-4-12b-it-Q4_K_M.gguf' --local-dir {DEFAULT_MODEL_DIR}"
-            )
-            return None
-        try:
-            from llama_cpp import Llama
-            log.info(f"Loading Gemma 4 12B from {gguf_path}")
-            log.info(f"  n_ctx={n_ctx}, n_threads={n_threads}")
-            _llm = Llama(
-                model_path=str(gguf_path),
-                n_ctx=n_ctx,
-                n_threads=n_threads,
-                verbose=False,
-            )
-            log.info("Gemma 4 12B loaded successfully ✓")
-            return _llm
-        except ImportError:
-            log.warning("llama-cpp-python not installed. pip install llama-cpp-python")
-            return None
-        except Exception as exc:
-            log.error(f"Failed to load Gemma 4 12B: {exc}")
-            return None
-def get_model() -> Optional[object]:
-    """Get the current LLM instance (lazy-loads if needed)."""
-    global _llm
-    if _llm is not None:
-        return _llm
-    return load_model()
-def is_model_available() -> bool:
-    """Check if the LLM is loaded and ready."""
-    return _llm is not None
 def generate_response(
@@ -105,30 +74,24 @@ def generate_response(
     temperature: float = 0.8,
     max_tokens: int = 300,
 ) -> Optional[str]:
-    """
-    Generate a non-streaming response from the model.
-    Args:
-        messages: List of {'role': ..., 'content': ...} dicts
-        temperature: Generation temperature
-        max_tokens: Max output tokens
-    Returns:
-        Generated text or None on failure
     """
-    model = get_model()
-    if model is None:
         return None
     try:
-        response = model.create_chat_completion(
             messages=messages,
             temperature=temperature,
-            max_tokens=max_tokens,
         )
-        return response["choices"][0]["message"]["content"]
     except Exception as exc:
-        log.error(f"Generation error: {exc}")
         return None
@@ -137,42 +100,40 @@ def generate_stream(
     temperature: float = 0.8,
     max_tokens: int = 300,
 ) -> Generator[str, None, None]:
-    """
-    Generate a streaming response from the model.
-    Args:
-        messages: List of {'role': ..., 'content': ...} dicts
-        temperature: Generation temperature
-        max_tokens: Max output tokens
-    Yields:
-        Text chunks as they arrive
     """
-    model = get_model()
-    if model is None:
-        yield "⚠️  Model not loaded. I'm running on fallback mode right now."
         return
     try:
-        stream = model.create_chat_completion(
             messages=messages,
             temperature=temperature,
-            max_tokens=max_tokens,
-            stream=True,
         )
-        for chunk in stream:
-            delta = chunk["choices"][0].get("delta", {})
-            content = delta.get("content", "")
-            if content:
-                yield content
     except Exception as exc:
-        log.error(f"Streaming error: {exc}")
         yield f"\n\n⚠️  Something went wrong: {exc}"
 def unload_model():
-    """Release the model from memory."""
-    global _llm
-    _llm = None

 """
+LLM inference wrapper for FocusFriend using the Hugging Face Inference API.
+The previous version loaded a local GGUF (Gemma 4 12B Q4_K_M) via llama-cpp-python.
+That required a heavy compile step on HF Spaces and tied us to a single model. This
+version uses `huggingface_hub.InferenceClient` (serverless) and enforces a
+project-scoped cooldown via `shared.inference_client` to protect your credit budget.
+To override the model: set `INFERENCE_MODEL` env var.
+Common picks:
+- "Qwen/Qwen2.5-7B-Instruct" (default; sweet spot for chat)
+- "meta-llama/Meta-Llama-3-8B-Instruct"
+- "google/gemma-2-9b-it"
 """
+from __future__ import annotations
+import logging
 import os
+import sys
 import threading
 from pathlib import Path
+from typing import Generator, List, Dict, Optional
 log = logging.getLogger("focusfriend.inference")
+# Add monorepo root so we can import shared.inference_client
+_THIS = Path(__file__).resolve()
+_PROJECT = _THIS.parent.parent
+_REPO_ROOT = _PROJECT.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+from shared.inference_client import (  # noqa: E402
+    InferenceResult,
+    chat_messages,
+    cooldown_status,
+    cooldown_active,
+    generate as _client_generate,
+    INFERENCE_MODEL as DEFAULT_MODEL,
 )
+def _model() -> str:
+    """Pick the FocusFriend-specific model, falling back to the default."""
+    return os.environ.get("FOCUSFRIEND_MODEL", DEFAULT_MODEL)
+def is_model_available() -> bool:
+    """True if the inference API is configured (token or anonymous)."""
+    if cooldown_active("focusfriend"):
+        return False
+    has_token = bool(os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACEHUB_API_TOKEN"))
+    # Many small models work anonymously; don't gate hard.
+    return bool(_model())
+def get_model() -> Optional[str]:
+    """Return the model id we plan to use. None if no model configured."""
+    if not _model():
+        return None
+    return _model()
+def cooldown_snapshot() -> dict:
+    """Public status snapshot for the UI."""
+    return {
+        "model": _model(),
+        "cooldown": cooldown_status("focusfriend"),
+    }
 def generate_response(
     temperature: float = 0.8,
     max_tokens: int = 300,
 ) -> Optional[str]:
+    """One-shot generation. Returns text or None on cooldown/failure.
+    `messages` follows OpenAI chat format. Caller is responsible for system prompt
+    and prior turns.
     """
+    if cooldown_active("focusfriend"):
+        log.info("focusfriend inference skipped (cooldown active)")
         return None
     try:
+        result = _client_generate(
+            project="focusfriend",
             messages=messages,
+            max_new_tokens=max_tokens,
             temperature=temperature,
         )
+        return result.text
     except Exception as exc:
+        log.warning(f"HF Inference error: {exc}")
         return None
     temperature: float = 0.8,
     max_tokens: int = 300,
 ) -> Generator[str, None, None]:
+    """Streaming generator. Yields the full response in chunks.
+    The HF Inference API doesn't return true token-level streams from chat_completion
+    in the python client, so we yield the full text and let the UI's natural
+    chunking handle the appearance of streaming. Falls back to graceful error.
     """
+    if cooldown_active("focusfriend"):
+        yield "\n\n⏳ Pip is resting. (Inference cooldown — try again in a moment.)"
         return
     try:
+        result = _client_generate(
+            project="focusfriend",
             messages=messages,
+            max_new_tokens=max_tokens,
             temperature=temperature,
         )
+        # Simulate streaming by chunking the response on word boundaries
+        text = result.text
+        if not text:
+            yield "\n\n[No response]"
+            return
+        # Yield in word-sized chunks for natural reading pace
+        words = text.split(" ")
+        for i, word in enumerate(words):
+            chunk = word if i == 0 else " " + word
+            yield chunk
     except Exception as exc:
         yield f"\n\n⚠️  Something went wrong: {exc}"
 def unload_model():
+    """No-op for serverless inference (kept for API compat)."""
+    return
+# Re-export for callers that still expect this
+load_model = lambda *args, **kwargs: get_model()  # noqa: E731

projects/focusfriend/requirements.txt CHANGED Viewed

@@ -1,12 +1,18 @@
-# FocusFriend — ASCII Wellness Companion
 # Python 3.10+
-# === Core ===
 gradio>=5.0
 numpy>=1.24
-# === LLM Inference (Gemma 4 12B) ===
-llama-cpp-python>=0.3.0
-# === Utilities ===
 huggingface_hub>=0.20

+# FocusFriend — ASCII Wellness Companion "Pip"
 # Python 3.10+
+#
+# Inference is via the Hugging Face Inference API. No local GGUF, no
+# llama-cpp-python compile step. Cooldown is enforced in
+# `shared/inference_client.py` to protect your credit budget.
+#
+# Space env vars (Settings → Variables and secrets):
+#   HF_TOKEN                   — your HF token (anonymous works for many models)
+#   INFERENCE_MODEL            — default model id
+#   FOCUSFRIEND_MODEL          — override the model for this project
+#   FOCUSFRIEND_COOLDOWN_SECONDS — gap between inference calls (default 10)
+#   INFERENCE_PROVIDER         — "hf-inference" (default) or paid provider
+#   INFERENCE_MAX_TOKENS       — per-call cap (default 220)
 gradio>=5.0
 numpy>=1.24
 huggingface_hub>=0.20

projects/tinybard/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: ᐴ TinyBard ᔔ
-emoji: ☼
 colorFrom: blue
 colorTo: yellow
 sdk: gradio
@@ -11,25 +11,24 @@ license: apache-2.0
 tags:
   - text-adventure
   - interactive-fiction
-  - llama-cpp
   - thousand-token-wood
   - build-small-hackathon
   - tiny-titan
-  - llama-champion
   - off-brand
-  - off-the-grid
   - mcp-server
   - anishinaabe
   - solarpunk
 ---
 # ◈──◆──◇ ᐴ TINYBARD ᔔ AADIZOOKAAN-AKINOOMAAGEWIN / STORY-TELLING ENGINE ◇──◆──◈
-> **A ≤4B LLM fires five-minute interactive text adventures in a cedar-and-copper CRT terminal.**
 >
 > ᐴ The land remembers the stories. ᔔ  ☼ ☘ ≈
-TinyBard uses FastAPI + `mount_gradio_app` (Gradio 6.0) with a fully custom HTML/CSS/JS frontend and **MCP server mode** enabled. Every adventure is procedurally generated — rooms, NPCs, items, and branching narratives on the fly.
 ## ◆ GASHKITOONAN / CAPABILITIES ◈
@@ -37,15 +36,16 @@ TinyBard uses FastAPI + `mount_gradio_app` (Gradio 6.0) with a fully custom HTML
 - **◇ Three Aadizookaanan / Genres ◇** — Aadizookaan (Fantasy), Ish piming (Sci-Fi), Mashkodewaazibi (Cyberpunk)
 - **◇ Misko-Aki / CRT Terminal ◇** — Cedar-copper cabinet, sun-amber phosphor, frost-on-glass scanlines
 - **◇ MCP Kinoomaagewinan / Tools ◇** — `start_game` and `make_choice` exposed as MCP tools
-- **◇ Zhooniyaa / 100% Local ◇** — No cloud APIs. Runs on llama.cpp with GGUF quantized models
-- **◇ Bmaad-ziibi / Procedural Fallback ◇** — Full engine works without the LLM model loaded
-- **◇ Anishinaabe-Solarpunk ◇** — Sky-to-sunrise palette, Anishinaabe syllabic framings, biophilic motifs
 ## ☼ NITAM-AABAJICHIGANAN / PREREQUISITES ◈
 - Python 3.10+
-- ~1GB disk for GGUF model
-- ~2GB RAM (CPU inference) or Metal/CUDA for GPU
 ## ◇ AABAJITOOWINAN / INSTALLATION ◈
@@ -54,12 +54,16 @@ git clone https://github.com/nbiish/tinybard.git
 cd tinybard
 pip install -r requirements.txt
-# Download model (Q8_0 quant, ~1.6GB)
-huggingface-cli download mradermacher/VibeThinker-1.5B-GGUF \
-  --include "VibeThinker-1.5B.Q8_0.gguf" \
-  --local-dir ./models
-export TINYBARD_MODEL_PATH=./models/VibeThinker-1.5B.Q8_0.gguf
 python app.py
 ```
@@ -82,42 +86,58 @@ hum with a faint violet energy...
 ## ☼ NAANAAGADAWENINDIZOWIN / VERIFICATION ◈
 ```bash
-curl -X POST http://localhost:7860/gradio/gradio_api/call/start_game \
   -H "Content-Type: application/json" \
-  -d '{"data":["fantasy"]}'
 ```
-Returns SSE event stream with `story`, `choices`, `health`, `step`, `game_over`, `history_json`.
 ## ◈ MODEL ◇
-| Model | Size | Purpose | License |
-|-------|------|---------|---------|
-| VibeThinker 1.5B (Q8_0) | 1.5B params, ~1.6GB | Interactive story generation | Apache 2.0 |
-Fits the **Tiny Titan** badge (≤4B params). Runs on any laptop.
 ## ◇ MCP KINOOMAAGEWINAN / TOOLS ◈
-TinyBard runs with `mcp_server=True`, exposing these tools:
-- **`start_game(genre: str)`** — Start a new adventure. Genre: `fantasy` / `scifi` / `cyberpunk`
-- **`make_choice(choice, genre, step, health, history_json)`** — Submit a player choice to advance the story
 Connect from any MCP client (Claude Desktop, Cursor, etc.) to the SSE endpoint at `/gradio/gradio_api/mcp/`.
-## ◈ GIIZHIITAA / BADGE TARGETS ◇
-- **◆ Llama Champion** — Uses llama.cpp runtime
-- **◆ Tiny Titan** — Model is 1.5B (well under 4B limit)
 - **◆ Off-Brand** — Fully custom FastAPI+Gradio frontend
-- **◆ Off the Grid** — Fully local, no API calls
 - **◆ Field Notes** — Blog post about tiny model interactive fiction
 ## ☼ GANAWENDAAGWAD / SECURITY ◈
-PQC standard for any future API keys via the `pqc-secrets` skill (ML-KEM-768 + AES-256-GCM). At present, the model is loaded from local GGUF — no key material in flight.
 ---
-◈──◆──◇ ☼ TinyBard v1.0 · Cedar Edition · Anishinaabe Solarpunk ◇──◆──◈

 ---
 title: ᐴ TinyBard ᔔ
+emoji: ☀️
 colorFrom: blue
 colorTo: yellow
 sdk: gradio
 tags:
   - text-adventure
   - interactive-fiction
   - thousand-token-wood
   - build-small-hackathon
   - tiny-titan
   - off-brand
   - mcp-server
   - anishinaabe
   - solarpunk
+  - inference-api
+  - cooldowns
 ---
 # ◈──◆──◇ ᐴ TINYBARD ᔔ AADIZOOKAAN-AKINOOMAAGEWIN / STORY-TELLING ENGINE ◇──◆──◈
+> **A small LLM fires five-minute interactive text adventures in a cedar-and-copper CRT terminal.**
 >
 > ᐴ The land remembers the stories. ᔔ  ☼ ☘ ≈
+TinyBard uses FastAPI + `mount_gradio_app` (Gradio 6.0) with a fully custom HTML/CSS/JS frontend, **MCP server mode** enabled, and an **HF Inference API** backend. Every adventure is procedurally generated — rooms, NPCs, items, and branching narratives on the fly.
 ## ◆ GASHKITOONAN / CAPABILITIES ◈
 - **◇ Three Aadizookaanan / Genres ◇** — Aadizookaan (Fantasy), Ish piming (Sci-Fi), Mashkodewaazibi (Cyberpunk)
 - **◇ Misko-Aki / CRT Terminal ◇** — Cedar-copper cabinet, sun-amber phosphor, frost-on-glass scanlines
 - **◇ MCP Kinoomaagewinan / Tools ◇** — `start_game` and `make_choice` exposed as MCP tools
+- **◇ Giiwenaabik / Inference API ◇** — Serverless HF Inference API; no local GGUF, no build step
+- **◇ Asabiikesiwin / Cooldown ◇** — 6s default between inference calls to protect your credit budget
+- **◇ Bmaad-ziibi / Procedural Fallback ◇** — Full engine works without the LLM
+- **◇ Anishinaabe-Solarpunk ◇** — Sky-to-sunrise palette, syllabic framings, biophilic motifs
 ## ☼ NITAM-AABAJICHIGANAN / PREREQUISITES ◈
 - Python 3.10+
+- A Hugging Face token (for the Inference API; many small models work anonymously)
+- ~100MB disk, ~256MB RAM — the model is serverless, not local
 ## ◇ AABAJITOOWINAN / INSTALLATION ◈
 cd tinybard
 pip install -r requirements.txt
+# Optional: pick a model (default: Qwen/Qwen2.5-1.5B-Instruct — small + fast + free)
+export INFERENCE_MODEL="Qwen/Qwen2.5-1.5B-Instruct"
+# Or for the originally-intended VibeThinker 1.5B:
+# export INFERENCE_MODEL="mradermacher/VibeThinker-1.5B-GGUF"
+# Optional: set the HF token (anonymous works for many models)
+export HF_TOKEN="hf_..."
+# Optional: tune the cooldown
+export TINYBARD_COOLDOWN_SECONDS=6
 python app.py
 ```
 ## ☼ NAANAAGADAWENINDIZOWIN / VERIFICATION ◈
 ```bash
+curl -X POST http://localhost:7860/api/game/start \
   -H "Content-Type: application/json" \
+  -d '{"genre": "fantasy"}'
+```
+Returns clean JSON: `{"story", "choices", "health", "step", "game_over", "history"}`.
+```bash
+curl http://localhost:7860/api/model_status
 ```
+Returns: `{"model": "...", "cooldown": {"active": bool, "remaining_seconds": float, "window_seconds": float}}`.
 ## ◈ MODEL ◇
+| Model (default) | Size | Purpose | License |
+|---|---|---|---|
+| Qwen2.5-1.5B-Instruct | 1.5B | Interactive story generation | Apache 2.0 |
+| VibeThinker 1.5B | 1.5B | Alternative — also tiny | Apache 2.0 |
+Override `INFERENCE_MODEL` to any model that supports `chat_completion` on the HF Inference API. The 1.5B defaults fit the **Tiny Titan** badge.
 ## ◇ MCP KINOOMAAGEWINAN / TOOLS ◈
+TinyBard runs with `mcp_server=True`, exposing these tools (also available as FastAPI endpoints):
+- **`/api/game/start`** (POST `{"genre": "fantasy|scifi|cyberpunk"}`) — Start an adventure
+- **`/api/game/choice`** (POST `{choice, genre, step, health, history}`) — Submit a player choice
+- **`/api/model_status`** (GET) — Check the inference model + cooldown state
 Connect from any MCP client (Claude Desktop, Cursor, etc.) to the SSE endpoint at `/gradio/gradio_api/mcp/`.
+## ◇ GIIZHIITAA / BADGE TARGETS ◇
+- **◆ Tiny Titan** — Model ≤ 1.5B (well under 4B limit)
 - **◆ Off-Brand** — Fully custom FastAPI+Gradio frontend
 - **◆ Field Notes** — Blog post about tiny model interactive fiction
 ## ☼ GANAWENDAAGWAD / SECURITY ◈
+PQC standard for any future API keys via the `pqc-secrets` skill (ML-KEM-768 + AES-256-GCM). At present, only the HF token is in flight (read from env var, never written to disk).
+## ◇ AABAAJICHIGANAN / COOLDOWNS ◈
+The `shared/inference_client.py` module enforces per-project cooldowns. Cooldown protects your HF/Modal credit budget from runaway re-rolls. Defaults:
+- `tinybard`: 6s
+- `focusfriend`: 10s
+- `crittercalm`: 12s
+Override per project via Space env vars (`TINYBARD_COOLDOWN_SECONDS`, etc.).
 ---
+◈──◆──◇ ☼ TinyBard v1.1 · Cedar Edition · Anishinaabe Solarpunk · Inference API ◇──◆──◈

projects/tinybard/app.py CHANGED Viewed

@@ -17,6 +17,7 @@ import os
 import json
 import random
 import logging
 from pathlib import Path
 from typing import Optional, Dict, List
@@ -26,6 +27,19 @@ from fastapi.responses import HTMLResponse
 from fastapi.staticfiles import StaticFiles
 from gradio import mount_gradio_app
 logging.basicConfig(
     level=logging.INFO,
     format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
@@ -38,46 +52,34 @@ log = logging.getLogger("tinybard")
 BASE_DIR = Path(__file__).parent
 STATIC_DIR = BASE_DIR / "static"
-MODEL_PATH = os.environ.get(
-    "TINYBARD_MODEL_PATH",
-    str(Path("/Volumes/1tb-sandisk/ml-models/huggingface/models--mradermacher--VibeThinker-1.5B-GGUF/snapshots/d0d66139a78030a92a582f966b0f7cbbb3b19406/VibeThinker-1.5B.Q8_0.gguf"))
-)
 # ---------------------------------------------------------------------------
 # Llama.cpp Inference Setup
 # ---------------------------------------------------------------------------
-_llm = None
-_llm_failed = False
-def get_llm():
-    """Lazy-load the GGUF model via llama-cpp-python."""
-    global _llm, _llm_failed
-    if _llm is not None:
-        return _llm
-    if _llm_failed:
-        return None
-    if not Path(MODEL_PATH).exists():
-        log.warning(f"Model file not found at {MODEL_PATH}. Fallback mode active.")
-        _llm_failed = True
-        return None
-    try:
-        from llama_cpp import Llama
-        log.info(f"Loading VibeThinker-1.5B from {MODEL_PATH} ...")
-        _llm = Llama(
-            model_path=MODEL_PATH,
-            n_ctx=2048,
-            n_threads=int(os.environ.get("TINYBARD_THREADS", "4")),
-            verbose=False,
-        )
-        log.info("Model loaded successfully ✓")
-        return _llm
-    except Exception as e:
-        log.error(f"Failed to load LLM model: {e}")
-        _llm_failed = True
-        return None
 # ---------------------------------------------------------------------------
@@ -189,44 +191,72 @@ def generate_procedural_step(genre: str, step: int, health: int, choice: str = "
 # ---------------------------------------------------------------------------
-# LLM Generation Logic
 # ---------------------------------------------------------------------------
-def generate_llm_story(prompt: str, max_tokens: int = 150) -> str:
-    """Generate story text via llama.cpp."""
-    llm = get_llm()
-    if not llm:
         return ""
     try:
-        response = llm(
-            prompt,
-            max_tokens=max_tokens,
             temperature=0.7,
-            stop=["\n\n", "User:", "Narrator:"],
         )
-        return response["choices"][0]["text"].strip()
     except Exception as e:
-        log.error(f"LLM generation error: {e}")
         return ""
-def format_prompt(genre: str, history: List[Dict[str, str]], next_instruction: str) -> str:
-    """Build the narrative prompt for the LLM."""
-    prompt = (
-        "You are the narrator of an interactive text adventure game.\n"
-        f"Genre: {genre}\n"
-        "Rules:\n"
-        "1. Write in the second person ('You...').\n"
-        "2. Keep descriptions highly atmospheric, but short (under 3 sentences).\n"
-        "3. Focus on action, mystery, and choice.\n\n"
     )
-    for h in history:
-        if h["role"] == "player":
-            prompt += f"Player choice: {h['text']}\n"
-        else:
-            prompt += f"Narrator: {h['text']}\n"
-    prompt += f"{next_instruction}\n"
-    return prompt
 # ---------------------------------------------------------------------------
@@ -257,18 +287,11 @@ def create_gradio_app() -> gr.Blocks:
             if genre not in ["fantasy", "scifi", "cyberpunk"]:
                 genre = "fantasy"
-            llm = get_llm()
-            if not llm:
-                result = generate_procedural_step(genre, 0, 100)
-                return (
-                    result["story"], result["choices"], result["health"],
-                    result["step"], result["game_over"],
-                    json.dumps(result.get("history", []))
-                )
             instruction = "Narrate the beginning of the adventure. What happens first? Do not offer choices yet."
-            story = generate_llm_story(format_prompt(genre, [], instruction))
             if not story:
                 result = generate_procedural_step(genre, 0, 100)
                 return (
                     result["story"], result["choices"], result["health"],
@@ -277,15 +300,11 @@ def create_gradio_app() -> gr.Blocks:
                 )
             history = [{"role": "narrator", "text": story}]
-            choices_instruction = (
-                "Provide exactly 3 short, distinct choices for the player. "
-                "Format: 1. [choice 1] | 2. [choice 2] | 3. [choice 3]"
-            )
-            choices_text = generate_llm_story(format_prompt(genre, history, choices_instruction), max_tokens=60)
-            choices = _parse_choices(choices_text)
             if len(choices) < 2:
-                choices = ["Explore the area", "Check your equipment", "Proceed carefully"]
             return (story, choices[:3], 100, 1, False, json.dumps(history))
@@ -296,18 +315,10 @@ def create_gradio_app() -> gr.Blocks:
             except Exception:
                 history = []
-            llm = get_llm()
             step = int(step)
             health = int(health)
-            if not llm:
-                result = generate_procedural_step(genre, step, health, choice)
-                return (
-                    result["story"], result["choices"], result["health"],
-                    result["step"], result["game_over"],
-                    json.dumps(result.get("history", history))
-                )
             history.append({"role": "player", "text": choice})
             health_delta = random.choice([-15, 0, 10])
@@ -315,7 +326,7 @@ def create_gradio_app() -> gr.Blocks:
             if new_health <= 0:
                 instruction = "The player has run out of health. Narrate a quick, dramatic end. Game Over."
-                story = generate_llm_story(format_prompt(genre, history, instruction))
                 return (
                     story or "Your strength fails. The adventure ends in darkness.",
                     [], 0, step + 1, True, json.dumps(history)
@@ -323,14 +334,14 @@ def create_gradio_app() -> gr.Blocks:
             if step >= 4:
                 instruction = "Narrate the final glorious victory. The adventure ends in success."
-                story = generate_llm_story(format_prompt(genre, history, instruction))
                 return (
                     story or "You have achieved your goal! You are victorious!",
                     [], new_health, step + 1, True, json.dumps(history)
                 )
             instruction = "Narrate what happens next as a result of the player's choice."
-            story = generate_llm_story(format_prompt(genre, history, instruction))
             if not story:
                 result = generate_procedural_step(genre, step, health, choice)
                 return (
@@ -341,13 +352,7 @@ def create_gradio_app() -> gr.Blocks:
             history.append({"role": "narrator", "text": story})
-            choices_instruction = (
-                "Provide exactly 3 short, distinct choices. "
-                "Format: 1. [choice 1] | 2. [choice 2] | 3. [choice 3]"
-            )
-            choices_text = generate_llm_story(format_prompt(genre, history, choices_instruction), max_tokens=60)
-            choices = _parse_choices(choices_text)
             if len(choices) < 2:
                 choices = ["Move forward", "Look around", "Rest a moment"]
@@ -400,13 +405,8 @@ async def homepage():
     return HTMLResponse("<h1>TinyBard retro terminal under construction!</h1>")
 @fastapi_app.get("/api/model_status")
 async def model_status():
-    """Check if the LLM is loaded."""
-    llm = get_llm()
-    return {
-        "available": llm is not None,
-        "model_path": MODEL_PATH,
-        "fallback": _llm_failed
-    }
 # ---------------------------------------------------------------------------
@@ -418,23 +418,20 @@ def _run_turn(choice: str, genre: str, step: int, health: int, history: List[Dic
     Returns a dict the frontend can consume directly. Used by both the
     FastAPI /api/game/* endpoints and the Gradio MCP tools.
     """
-    llm = get_llm()
     if step == 0:
         # New game
-        if not llm:
             return generate_procedural_step(genre, 0, 100)
         instruction = "Narrate the beginning of the adventure. What happens first? Do not offer choices yet."
-        story = generate_llm_story(format_prompt(genre, [], instruction))
         if not story:
             return generate_procedural_step(genre, 0, 100)
         history = [{"role": "narrator", "text": story}]
-        choices_instruction = (
-            "Provide exactly 3 short, distinct choices for the player. "
-            "Format: 1. [choice 1] | 2. [choice 2] | 3. [choice 3]"
-        )
-        choices_text = generate_llm_story(format_prompt(genre, history, choices_instruction), max_tokens=60)
-        choices = _parse_choices(choices_text)
         if len(choices) < 2:
             choices = ["Explore the area", "Check your equipment", "Proceed carefully"]
         return {
@@ -443,7 +440,7 @@ def _run_turn(choice: str, genre: str, step: int, health: int, history: List[Dic
         }
     # Subsequent turn
-    if not llm:
         return generate_procedural_step(genre, step, health, choice)
     history.append({"role": "player", "text": choice})
@@ -452,7 +449,7 @@ def _run_turn(choice: str, genre: str, step: int, health: int, history: List[Dic
     if new_health <= 0:
         instruction = "The player has run out of health. Narrate a quick, dramatic end. Game Over."
-        story = generate_llm_story(format_prompt(genre, history, instruction))
         return {
             "story": story or "Your strength fails. The adventure ends in darkness.",
             "choices": [], "health": 0, "step": step + 1, "game_over": True,
@@ -461,7 +458,7 @@ def _run_turn(choice: str, genre: str, step: int, health: int, history: List[Dic
     if step >= 4:
         instruction = "Narrate the final glorious victory. The adventure ends in success."
-        story = generate_llm_story(format_prompt(genre, history, instruction))
         return {
             "story": story or "You have achieved your goal! You are victorious!",
             "choices": [], "health": new_health, "step": step + 1, "game_over": True,
@@ -469,17 +466,12 @@ def _run_turn(choice: str, genre: str, step: int, health: int, history: List[Dic
         }
     instruction = "Narrate what happens next as a result of the player's choice."
-    story = generate_llm_story(format_prompt(genre, history, instruction))
     if not story:
         return generate_procedural_step(genre, step, health, choice)
     history.append({"role": "narrator", "text": story})
-    choices_instruction = (
-        "Provide exactly 3 short, distinct choices. "
-        "Format: 1. [choice 1] | 2. [choice 2] | 3. [choice 3]"
-    )
-    choices_text = generate_llm_story(format_prompt(genre, history, choices_instruction), max_tokens=60)
-    choices = _parse_choices(choices_text)
     if len(choices) < 2:
         choices = ["Move forward", "Look around", "Rest a moment"]
     return {

 import json
 import random
 import logging
+import sys
 from pathlib import Path
 from typing import Optional, Dict, List
 from fastapi.staticfiles import StaticFiles
 from gradio import mount_gradio_app
+# Inference client with cooldown (no local GGUF, no llama-cpp-python build!)
+# Path layout: monorepo/shared/inference_client.py — go up two parents from this file.
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))
+from shared.inference_client import (
+    InferenceResult,
+    cooldown_status,
+    cooldown_remaining,
+    cooldown_active,
+    generate as inference_generate,
+    chat_messages,
+    INFERENCE_MODEL,
+)
 logging.basicConfig(
     level=logging.INFO,
     format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
 BASE_DIR = Path(__file__).parent
 STATIC_DIR = BASE_DIR / "static"
+# Use HF Inference API (VibeThinker 1.5B by default — small, fast, free tier).
+# Override via Space env var: INFERENCE_MODEL.
+# Cooldown enforced in shared.inference_client.
+TINYBARD_MODEL = os.environ.get("TINYBARD_MODEL", INFERENCE_MODEL)
 # ---------------------------------------------------------------------------
 # Llama.cpp Inference Setup
 # ---------------------------------------------------------------------------
+# No local LLM state — every inference call goes through the HF Inference API
+# with cooldown enforcement. Procedural fallback is always available.
+def llm_available() -> bool:
+    """True if we *might* succeed at an inference call (cooldown not active,
+    HF_TOKEN configured, model id is set)."""
+    import os
+    if not os.environ.get("HF_TOKEN") and not os.environ.get("HUGGINGFACEHUB_API_TOKEN"):
+        # Inference API still works anonymously for some models, so don't gate hard.
+        pass
+    return bool(TINYBARD_MODEL) and not cooldown_active("tinybard")
+def last_inference_status() -> dict:
+    """Snapshot of the current cooldown + model for /api/model_status."""
+    return {
+        "model": TINYBARD_MODEL,
+        "cooldown": cooldown_status("tinybard"),
+    }
 # ---------------------------------------------------------------------------
 # ---------------------------------------------------------------------------
+# LLM Generation Logic (HF Inference API + cooldown)
 # ---------------------------------------------------------------------------
+def _parse_messages(genre: str, history: List[Dict[str, str]], next_instruction: str) -> list[Dict[str, str]]:
+    """Translate internal history into OpenAI-style chat messages."""
+    system = (
+        "You are the narrator of an interactive text adventure game. "
+        f"Genre: {genre}. Write in the second person ('You...'). "
+        "Keep descriptions highly atmospheric but short (under 3 sentences). "
+        "Focus on action, mystery, and choice. Do not offer numbered choices unless asked."
+    )
+    msgs: List[Dict[str, str]] = [{"role": "system", "content": system}]
+    for h in (history or []):
+        if h.get("role") == "player":
+            msgs.append({"role": "user", "content": h["text"]})
+        elif h.get("role") == "narrator":
+            msgs.append({"role": "assistant", "content": h["text"]})
+    msgs.append({"role": "user", "content": next_instruction})
+    return msgs
+def generate_llm_story(
+    genre: str,
+    history: List[Dict[str, str]],
+    next_instruction: str,
+    max_tokens: int = 180,
+) -> str:
+    """Generate story text via HF Inference API (with cooldown)."""
+    if cooldown_active("tinybard"):
+        log.info("tinybard inference skipped (cooldown active)")
         return ""
     try:
+        msgs = _parse_messages(genre, history, next_instruction)
+        result = inference_generate(
+            project="tinybard",
+            messages=msgs,
+            max_new_tokens=max_tokens,
             temperature=0.7,
         )
+        return result.text
+    except RuntimeError:
+        # Cooldown — let caller fall back
+        return ""
     except Exception as e:
+        log.warning(f"HF Inference error (fallback to procedural): {e}")
         return ""
+def generate_llm_choices(genre: str, story_context: str) -> List[str]:
+    """Ask the LLM to produce 3 short distinct choices for the player."""
+    if cooldown_active("tinybard"):
+        return []
+    system = (
+        "You generate 3 short, distinct player choices for an interactive text adventure. "
+        "Output exactly in the format: 1. <choice> | 2. <choice> | 3. <choice>"
     )
+    user = f"Genre: {genre}. Last story beat: {story_context[:400]}. Give 3 choices."
+    try:
+        result = inference_generate(
+            project="tinybard",
+            messages=[{"role": "system", "content": system}, {"role": "user", "content": user}],
+            max_new_tokens=80,
+            temperature=0.8,
+        )
+        return _parse_choices(result.text)
+    except Exception:
+        return []
 # ---------------------------------------------------------------------------
             if genre not in ["fantasy", "scifi", "cyberpunk"]:
                 genre = "fantasy"
+            # Try LLM first (will skip if cooldown is active)
             instruction = "Narrate the beginning of the adventure. What happens first? Do not offer choices yet."
+            story = generate_llm_story(genre, [], instruction)
             if not story:
+                # Procedural fallback
                 result = generate_procedural_step(genre, 0, 100)
                 return (
                     result["story"], result["choices"], result["health"],
                 )
             history = [{"role": "narrator", "text": story}]
+            choices = generate_llm_choices(genre, story)
             if len(choices) < 2:
+                # Use the procedural choices
+                fallback = generate_procedural_step(genre, 0, 100)
+                choices = fallback["choices"]
             return (story, choices[:3], 100, 1, False, json.dumps(history))
             except Exception:
                 history = []
             step = int(step)
             health = int(health)
+            # First try LLM narration
             history.append({"role": "player", "text": choice})
             health_delta = random.choice([-15, 0, 10])
             if new_health <= 0:
                 instruction = "The player has run out of health. Narrate a quick, dramatic end. Game Over."
+                story = generate_llm_story(genre, history, instruction)
                 return (
                     story or "Your strength fails. The adventure ends in darkness.",
                     [], 0, step + 1, True, json.dumps(history)
             if step >= 4:
                 instruction = "Narrate the final glorious victory. The adventure ends in success."
+                story = generate_llm_story(genre, history, instruction)
                 return (
                     story or "You have achieved your goal! You are victorious!",
                     [], new_health, step + 1, True, json.dumps(history)
                 )
             instruction = "Narrate what happens next as a result of the player's choice."
+            story = generate_llm_story(genre, history, instruction)
             if not story:
                 result = generate_procedural_step(genre, step, health, choice)
                 return (
             history.append({"role": "narrator", "text": story})
+            choices = generate_llm_choices(genre, story)
             if len(choices) < 2:
                 choices = ["Move forward", "Look around", "Rest a moment"]
     return HTMLResponse("<h1>TinyBard retro terminal under construction!</h1>")
 @fastapi_app.get("/api/model_status")
 async def model_status():
+    """Check the inference client + cooldown status."""
+    return last_inference_status()
 # ---------------------------------------------------------------------------
     Returns a dict the frontend can consume directly. Used by both the
     FastAPI /api/game/* endpoints and the Gradio MCP tools.
     """
+    # Cooldown short-circuit: if active, the game just uses the procedural
+    # engine for this turn. This protects your HF/Modal credit budget.
+    in_cooldown = cooldown_active("tinybard")
     if step == 0:
         # New game
+        if in_cooldown:
             return generate_procedural_step(genre, 0, 100)
         instruction = "Narrate the beginning of the adventure. What happens first? Do not offer choices yet."
+        story = generate_llm_story(genre, [], instruction)
         if not story:
             return generate_procedural_step(genre, 0, 100)
         history = [{"role": "narrator", "text": story}]
+        choices = generate_llm_choices(genre, story)
         if len(choices) < 2:
             choices = ["Explore the area", "Check your equipment", "Proceed carefully"]
         return {
         }
     # Subsequent turn
+    if in_cooldown:
         return generate_procedural_step(genre, step, health, choice)
     history.append({"role": "player", "text": choice})
     if new_health <= 0:
         instruction = "The player has run out of health. Narrate a quick, dramatic end. Game Over."
+        story = generate_llm_story(genre, history, instruction)
         return {
             "story": story or "Your strength fails. The adventure ends in darkness.",
             "choices": [], "health": 0, "step": step + 1, "game_over": True,
     if step >= 4:
         instruction = "Narrate the final glorious victory. The adventure ends in success."
+        story = generate_llm_story(genre, history, instruction)
         return {
             "story": story or "You have achieved your goal! You are victorious!",
             "choices": [], "health": new_health, "step": step + 1, "game_over": True,
         }
     instruction = "Narrate what happens next as a result of the player's choice."
+    story = generate_llm_story(genre, history, instruction)
     if not story:
         return generate_procedural_step(genre, step, health, choice)
     history.append({"role": "narrator", "text": story})
+    choices = generate_llm_choices(genre, story)
     if len(choices) < 2:
         choices = ["Move forward", "Look around", "Rest a moment"]
     return {

projects/tinybard/requirements.txt CHANGED Viewed

@@ -1,7 +1,18 @@
 # TinyBard — Micro Text Adventure Generator
 # Python 3.10+
 gradio>=5.0
-numpy>=1.24
-llama-cpp-python>=0.3.0
 huggingface_hub>=0.20

 # TinyBard — Micro Text Adventure Generator
 # Python 3.10+
+#
+# Inference is via the Hugging Face Inference API (no local GGUF,
+# no llama-cpp-python compile). Cooldown is enforced in
+# `shared/inference_client.py` to protect your credit budget.
+#
+# Set these Space secrets/variables to configure:
+#   HF_TOKEN              — your HF token (anonymous works for many small models)
+#   INFERENCE_MODEL       — model id (default: Qwen/Qwen2.5-1.5B-Instruct)
+#   TINYBARD_COOLDOWN_SECONDS — gap between inference calls (default 6)
+#   INFERENCE_PROVIDER    — "hf-inference" (default, free serverless) or paid
+#   INFERENCE_MAX_TOKENS  — per-call token cap (default 220)
 gradio>=5.0
+fastapi>=0.110
 huggingface_hub>=0.20
+uvicorn[standard]>=0.27

projects/tinybard/static/main.js CHANGED Viewed

@@ -36,18 +36,26 @@ async function checkModelStatus() {
     try {
         const resp = await fetch(`${GRADIO_CLIENT_URL}/api/model_status`);
         if (!resp.ok) return;
-        const status = await resp.json();
-        if (status.available) {
-            modelStatus.textContent = "☘ MODEL: MII-GIIWETA / READY";
             modelStatus.style.color = "var(--asp-sun)";
         } else {
-            modelStatus.textContent = "☘ MODEL: GIIZHIK-WIIKI / FALLBACK";
             modelStatus.style.color = "var(--asp-frost)";
         }
     } catch {
         modelStatus.textContent = "☘ MODEL: ?";
     }
 }
 async function apiCall(endpoint, payload) {
     // Use the FastAPI clean-JSON endpoints (returns a dict directly).
     // /api/game/start  -> start_game

     try {
         const resp = await fetch(`${GRADIO_CLIENT_URL}/api/model_status`);
         if (!resp.ok) return;
+        const s = await resp.json();
+        const model = s.model || "inference";
+        const cd = s.cooldown || { active: false, remaining_seconds: 0, window_seconds: 0 };
+        if (cd.active) {
+            modelStatus.textContent = `☘ ${model} / COOLDOWN ${cd.remaining_seconds.toFixed(1)}s`;
+            modelStatus.style.color = "var(--asp-ember)";
+        } else if (model) {
+            modelStatus.textContent = `☘ ${model} / READY`;
             modelStatus.style.color = "var(--asp-sun)";
         } else {
+            modelStatus.textContent = "☘ NO MODEL / FALLBACK";
             modelStatus.style.color = "var(--asp-frost)";
         }
     } catch {
         modelStatus.textContent = "☘ MODEL: ?";
     }
 }
+// Poll model status every 2s so cooldown countdown updates
+setInterval(checkModelStatus, 2000);
 async function apiCall(endpoint, payload) {
     // Use the FastAPI clean-JSON endpoints (returns a dict directly).
     // /api/game/start  -> start_game

shared/inference_client.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""
+Shared HF Inference Client + Cooldown
+======================================
+Lightweight wrapper around `huggingface_hub.InferenceClient` with:
+- Per-call cooldown to prevent credit burn on live HF Spaces
+- Async-friendly API
+- Auto-fallback to procedural/story-template engines when inference fails
+- Environment-driven config (works in HF Spaces and local)
+The cooldown model:
+- Each project has its own cooldown window (default 8s for cheap inference APIs)
+- Within a session, after a successful inference, no new call can run until cooldown expires
+- Failed inference does not start a cooldown (allow quick retry)
+- `cooldown_active()` is the public check; FastAPI handlers short-circuit on active cooldown
+"""
+from __future__ import annotations
+import os
+import time
+import logging
+import threading
+from dataclasses import dataclass, field
+from typing import Optional, Dict, Any, Callable, List
+log = logging.getLogger("inference")
+# ── Environment knobs ─────────────────────────────────────────────────────────
+# Override these in your Space's "Settings → Variables and secrets".
+# The HF model id used for text generation (VibeThinker 1.5B, Gemma 4 12B, etc.)
+INFERENCE_MODEL = os.environ.get(
+    "INFERENCE_MODEL",
+    "Qwen/Qwen2.5-1.5B-Instruct",  # small, fast, free-tier friendly
+)
+# Provider: "hf-inference" (free serverless), "together", "fal-ai", "replicate"
+# Free HF inference works for many small models; otherwise use a paid provider.
+INFERENCE_PROVIDER = os.environ.get("INFERENCE_PROVIDER", "hf-inference")
+# Token — read from HF Space secrets at runtime.
+HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACEHUB_API_TOKEN")
+# Default cooldown between inferences, in seconds.
+COOLDOWN_SECONDS = float(os.environ.get("INFERENCE_COOLDOWN_SECONDS", "8"))
+# Per-project override (keyed by app name)
+PROJECT_COOLDOWN_OVERRIDES = {
+    "tinybard": float(os.environ.get("TINYBARD_COOLDOWN_SECONDS", "6")),
+    "focusfriend": float(os.environ.get("FOCUSFRIEND_COOLDOWN_SECONDS", "10")),
+    "crittercalm": float(os.environ.get("CRITTERCALM_COOLDOWN_SECONDS", "12")),
+}
+# Max tokens to request (keeps costs bounded)
+MAX_NEW_TOKENS = int(os.environ.get("INFERENCE_MAX_TOKENS", "220"))
+# ── Cooldown registry ────────────────────────────────────────────────────────
+@dataclass
+class _CooldownState:
+    last_call: float = 0.0
+    lock: threading.Lock = field(default_factory=threading.Lock)
+_states: Dict[str, _CooldownState] = {}
+def _state(project: str) -> _CooldownState:
+    if project not in _states:
+        _states[project] = _CooldownState()
+    return _states[project]
+def cooldown_seconds_for(project: str) -> float:
+    return PROJECT_COOLDOWN_OVERRIDES.get(project, COOLDOWN_SECONDS)
+def cooldown_active(project: str) -> bool:
+    """Return True if the project is currently in cooldown (cannot run inference)."""
+    state = _state(project)
+    now = time.time()
+    if now - state.last_call < cooldown_seconds_for(project):
+        return True
+    return False
+def cooldown_remaining(project: str) -> float:
+    """Seconds left in the cooldown window (0 if not in cooldown)."""
+    state = _state(project)
+    elapsed = time.time() - state.last_call
+    remaining = cooldown_seconds_for(project) - elapsed
+    return max(0.0, remaining)
+def cooldown_status(project: str) -> dict:
+    """Snapshot of cooldown state for the UI."""
+    return {
+        "active": cooldown_active(project),
+        "remaining_seconds": round(cooldown_remaining(project), 2),
+        "window_seconds": cooldown_seconds_for(project),
+    }
+def _mark_called(project: str) -> None:
+    state = _state(project)
+    with state.lock:
+        state.last_call = time.time()
+# ── Inference client wrapper ─────────────────────────────────────────────────
+class InferenceResult:
+    """A small wrapper so callers don't need to know which API returned text."""
+    def __init__(self, text: str, model: str, provider: str, latency_s: float):
+        self.text = text
+        self.model = model
+        self.provider = provider
+        self.latency_s = latency_s
+    def __repr__(self) -> str:
+        return f"InferenceResult(text={self.text[:50]!r}…, model={self.model!r}, latency={self.latency_s:.2f}s)"
+def _get_client():
+    """Lazy-load the InferenceClient to keep boot fast."""
+    from huggingface_hub import InferenceClient
+    return InferenceClient(
+        model=INFERENCE_MODEL,
+        token=HF_TOKEN,
+        provider=INFERENCE_PROVIDER,
+    )
+def generate(
+    project: str,
+    messages: List[Dict[str, str]],
+    *,
+    max_new_tokens: Optional[int] = None,
+    temperature: float = 0.7,
+) -> InferenceResult:
+    """Run a chat-style inference call, with cooldown enforcement.
+    `messages` follows OpenAI chat format: [{"role": "user|assistant|system", "content": "..."}].
+    Returns InferenceResult with `.text` (string) on success, or raises on failure.
+    Caller is responsible for fallback handling.
+    """
+    if cooldown_active(project):
+        remaining = cooldown_remaining(project)
+        raise RuntimeError(
+            f"cooldown active for {project!r}: {remaining:.1f}s remaining. "
+            f"This protects your HF/Modal credit budget."
+        )
+    max_new_tokens = max_new_tokens or MAX_NEW_TOKENS
+    client = _get_client()
+    start = time.time()
+    response = client.chat_completion(
+        messages=messages,
+        max_tokens=max_new_tokens,
+        temperature=temperature,
+    )
+    latency = time.time() - start
+    text = response.choices[0].message.content or ""
+    text = text.strip()
+    _mark_called(project)
+    return InferenceResult(
+        text=text,
+        model=INFERENCE_MODEL,
+        provider=INFERENCE_PROVIDER,
+        latency_s=latency,
+    )
+def force_clear_cooldown(project: str) -> None:
+    """Manual escape hatch (e.g. for testing or admin overrides)."""
+    _state(project).last_call = 0.0
+# ── Convenience: build messages + format result ──────────────────────────────
+def chat_messages(system: str, user: str, history: Optional[List[Dict[str, str]]] = None) -> List[Dict[str, str]]:
+    """Build an OpenAI-style message list with optional prior turns.
+    `history` is in the same [{role, content}, ...] format. New turns are appended.
+    """
+    msgs: List[Dict[str, str]] = [{"role": "system", "content": system}]
+    if history:
+        msgs.extend(history)
+    msgs.append({"role": "user", "content": user})
+    return msgs
+__all__ = [
+    "InferenceResult",
+    "cooldown_active",
+    "cooldown_remaining",
+    "cooldown_seconds_for",
+    "cooldown_status",
+    "force_clear_cooldown",
+    "generate",
+    "chat_messages",
+    "INFERENCE_MODEL",
+    "INFERENCE_PROVIDER",
+    "MAX_NEW_TOKENS",
+]
+if __name__ == "__main__":
+    # Smoke test
+    for p in ("tinybard", "focusfriend", "crittercalm"):
+        print(p, "cooldown:", cooldown_status(p))