| --- |
| title: OffGridSchedula |
| emoji: ποΈ |
| colorFrom: indigo |
| colorTo: purple |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| license: apache-2.0 |
| short_description: Local-first chat-to-calendar agent (Gemma-4 E4B + MiniCPM) |
| tags: |
| - track:backyard |
| - sponsor:openbmb |
| - sponsor:modal |
| - achievement:offgrid |
| - achievement:welltuned |
| - achievement:offbrand |
| - achievement:llama |
| - achievement:sharing |
| - achievement:fieldnotes |
| models: |
| - build-small-hackathon/gemma-4-cal-gguf |
| - openbmb/MiniCPM5-1B-GGUF |
| demo_video: |
| - https://youtu.be/m-o0u9X3tI4 |
| social_posts: |
| - https://x.com/nate_mauer/status/2065973341651882386 |
| - https://x.com/nate_mauer/status/2064920352845709419 |
| - https://x.com/nate_mauer/status/2065661878441750916 |
| - https://www.linkedin.com/feed/update/urn:li:ugcPost:7471440639969132545 |
| blog_post: |
| - https://huggingface.co/blog/build-small-hackathon/offgridschedula |
| made_by: |
| - ParetoOptimal - a.k.a., Nate Mauer |
| --- |
| |
| # ποΈ Message Scheduling Agent |
|
|
| **OffGridSchedula turns a pasted chat (or a flyer screenshot) into calendar events, catches conflicts, and drafts the reply β right from your phone, no app, no account, |
| no setup. iOS allows neither background iMessage access nor a persistent on-device LLM server, so there's no autonomous on-device agent to install; instead, |
| a foreground Shortcut ([docs/automations.md](./docs/automations.md)) hands a thread or screenshot to the agent in two taps (optionally using a remote model via `INFERENCE_BASE_URL`).** |
|
|
| The model runs on **your own server or even on the phone itself** and not on a cloud AI service. Your chats aren't shipped off to a third-party AI to be read; agent reads your snippet in memory and |
| discards it after replying. The run trace you can optionally share is a redacted, sent to the agent you control that turns it into ready-to-add calendar events. |
|
|
| **Hardware-aware.** With under-powered hardware, the app warns users with an upgrade banner rather than hanging, the real model needs a tiny GPU. |
|
|
| ## Build Small submission β the idea & the tech |
|
|
| **The idea.** A busy parent's calendar lives in other people's messages β picture day in the |
| class chat, the practice that moved, the party flyer. OffGridSchedula turns those into calendar |
| events: paste the chat (or snap the flyer) from a phone browser, review the extracted events, the |
| conflicts against your own `.ics`, and a drafted reply β then add to Apple/Google Calendar in a tap. |
|
|
| **The tech.** Two small local models do the work. Extraction is [`gemma-cal` E4B](https://huggingface.co/build-small-hackathon/gemma-4-cal-gguf) |
| (~4B effective params), our QLoRA fine-tune of Gemma-4 E4B that emits a single validated |
| **ActionPlan** (events Β· conflicts Β· reply Β· clarifying question), served with **vision** through |
| the official **llama.cpp** server inside this Docker Gradio Space β no cloud AI APIs. The |
| fine-tune + its 60-example task eval ran entirely on **Modal** serverless GPUs, behind an |
| eval gate that rejected eight regressed models before this one shipped. Conflict math is |
| deterministic Python, the UI is fully custom, the agent doubles as an **MCP tool server**, and |
| redacted run traces are public on the [Hub](https://huggingface.co/datasets/ParetoOptimal/offgridschedula-traces). |
| Click **Run the agents** and a local **OpenBMB MiniCPM** planner (a second local llama-server) |
| drives this same Space's MCP tools as a multi-step agent β extract β check conflicts β render |
| `.ics` β with every step visible. Still zero cloud AI; every model under 32B. |
|
|
| **What's new.** Extraction now reads the *logistics*, not just the date (see below): arrival-aware |
| start times, durationβend conversion, type-based reminders, and calendar-ready titles β each |
| guaranteed by deterministic post-processing even when the model wobbles, and each shipped through |
| a measured A/B eval ([full result tables](./training/data/ab_results.md): regex vs text-LLM vs |
| **vision-LLM reading rendered screenshots only**). Calendar out got one-click too: a unified |
| **Connect your calendar** block (Google OAuth β the token lives in *your* browser, never on the |
| server; Outlook/Apple need no sign-in) and per-event **Google Β· Outlook Β· iCal** links, with the |
| Google push verified end-to-end (push β readback β delete, 11/11). |
| **The UX.** One decision β **Offline or Online** β re-themes the whole workflow card and sets the |
| path: off-grid `.ics` only, or a **one-click "Connect your calendar"** whose Google OAuth token |
| lives *only in the browser* (server-verified each visit; the client secret never leaves the |
| server). Results land in a single card: events, conflicts, the drafted reply, and per-event |
| **Google Β· Outlook Β· iCal Β· .ics** quick-add links. **Activity β This week** tallies events |
| captured, conflicts caught, and time saved; a per-device **Memory** (localStorage, one-click |
| samples) feeds names and preferences back into extraction. |
|
|
| **Submission links:** [requirement-by-requirement mapping](./docs/build-small-submission.md) Β· |
| [demo video](https://youtu.be/m-o0u9X3tI4) Β· |
| social posts [1](https://x.com/nate_mauer/status/2064920352845709419) Β· |
| [2](https://x.com/nate_mauer/status/2065661878441750916) |
|
|
| ## Who this is for |
|
|
| A busy parent whose kid's school and activity events are buried in a noisy class group chat β |
| picture day Thursday, the practice that moved to Tuesday, the birthday-party RSVP. They read it once, |
| mean to add it later, and miss it. With this, they **paste the chat** (or a **screenshot** of a flyer |
| or invite) from their phone's browser and get back: the events, a **conflict check** against their |
| calendar, and a **ready-to-send reply** β all surfaced for review before anything is saved. Output is |
| a local `.ics` they can add to any calendar, with optional Google Calendar push. |
|
|
| No app to install and no account. It reads nothing automatically β the parent pastes only what they |
| choose. Inference runs **in the Space** via `llama.cpp` (no cloud AI APIs), and works out of the box |
| with no GPU (see *Accuracy upgrade* below). |
|
|
| ## The model: `gemma-cal` E4B β one calendar-native LLM, built for exactly this |
|
|
| What makes this platform different isn't a prompt wrapped around a generic chatbot β it's |
| **[`gemma-cal` E4B](https://huggingface.co/build-small-hackathon/gemma-4-cal-gguf), our own fine-tune of |
| Gemma-4 E4B purpose-built for one job: turning messy human conversation into calendar-ready |
| structure.** The model doesn't chat. It reads a thread (or a flyer photo) and emits a single |
| validated **ActionPlan** β events with exact ISO datetimes, conflicts, proposed alternatives, a |
| drafted reply, and a clarifying question when the plan is too vague to schedule. **It is the one |
| and only model the platform runs**, everywhere from the production Space to a laptop. |
|
|
| - **Edge-sized by design.** ~5 GB at Q4 β serves on a **~$0.40/hr 16 GB T4** (vs $4+/hr A100-class |
| for big models), a gaming GPU, or an Apple-silicon laptop, with full **vision** |
| (screenshots/flyers) via its mmproj. Local-first isn't a tagline; it's the parameter count. |
| - **Schema-bulletproof.** The fine-tune holds **100% schema validity even with no system prompt**, |
| with stronger no-event discipline (doesn't invent events from "thanks!") and a higher rate of |
| *asking* when a date is TBD β the failure modes that actually burn users of generic models. |
| - **Convention-trained.** It learns *this product's* date semantics ("next Tuesday" means next |
| week's Tuesday; weekday-anchored relative dates) instead of whatever a base model absorbed |
| from the internet. |
| - **Eval-gated, never vibes-shipped.** Every retrain runs a 60-example task eval (start-exact |
| datetime matching, F1, validity, clarification) and **cannot reach production unless it clears |
| the gate** β the pipeline has rejected eight regressed models to date. The full, honest scorecard |
| lives in [`docs/eval-roadmap.md`](./docs/eval-roadmap.md) and the |
| [post-mortem write-up](./docs/blog-eval-gated-finetuning.md). |
|
|
| **Hackathon size constraint (β€ 32B):** easily β E4B is ~4B effective parameters. See the in-app |
| **π Submission** tab for the full compliance scorecard. |
|
|
| ### Reads the logistics, not just the date |
|
|
| A confirmation like *"Time: 10:30 AM Β· Duration: approx. 30β45 min Β· (Please arrive 15 minutes |
| early to complete intake forms) Β· π 112A West 72nd Streetβ¦"* becomes one correct event: |
|
|
| - **Arrival-aware start** β the event starts at **10:15** (when you must show up), the official |
| 10:30 is preserved in the notes, and the **end is anchored to the stated time + duration** |
| (11:00), so the calendar block covers the forms *and* the visit. |
| - **Type-based notifications** β an explicitly stated lead time always wins ("remind me 2 hours |
| before" β 120); otherwise doctor/medical visits get 60 minutes, parties 30, carpools and school |
| events 45. |
| - **Real-world addresses** β multi-line and π-emoji locations join into one string; |
| "(Upper West Side β 72nd & Columbus)" glosses and SMS footers ("Reply C to confirmβ¦ call us |
| at 212-223-0349") don't confuse it. |
| - **Calendar-ready titles** β an action+subject summary ("Pick up Priya β Terminal 4"), not a |
| quote of the message. |
|
|
| The model is *taught* these conventions (prompt + fine-tune data), but the load-bearing ones are |
| also **guaranteed by deterministic post-processing** (`apply_text_rules` in |
| [`server/agent.py`](./server/agent.py)) β same philosophy as the conflict engine: must-hold |
| logistics are never left to model temperament. Every behavior above shipped through a measured |
| A/B eval β regex baseline vs text-LLM vs **vision-LLM reading rendered chat screenshots only** β |
| with the full tables in [`training/data/ab_results.md`](./training/data/ab_results.md) |
| (headline: text-LLM event F1 0.96 structured / 0.89 unstructured vs regex 0.60/0.67; the |
| screenshot-only vision arm lands within a point of text). |
|
|
| ## Try it in 30 seconds |
|
|
| Open the Space in your phone's browser β **Schedule** tab β tap **Try a sample** (or paste your own |
| group chat, and optionally a screenshot or your `.ics`) β review the detected events β **Download |
| .ics**. The **Activity β This week** panel then shows what you've captured and the time it saved. |
|
|
| ## How it works |
|
|
| ``` |
| Paste a thread / screenshot βββΆ HF Space βββΆ llama.cpp βββΆ events + conflicts + reply |
| (phone browser) β β |
| custom Gradio UI βββ review βββ ββββββ |
| βΌ βΌ |
| .ics download / optional Google Calendar |
| ``` |
|
|
| The **primary path needs nothing but a browser**: paste text and/or attach a screenshot in the |
| Schedule tab. (Power users can also auto-feed messages from a Mac β see *Optional: Mac collector*.) |
|
|
| For the full solution-architecture view β every workflow and which LLM (if any) it calls, |
| plus the eval-gated fine-tuning loop β see **[docs/architecture.md](./docs/architecture.md)**. |
|
|
| ## Can it process multiple invites at once? |
|
|
| **Yes β multiple invites in one paste is the designed path** (on the live Space, where the real |
| model runs). `ActionPlan.events` is a *list*, and the extraction prompt explicitly tells the model |
| that one thread often holds several events β a drop-off AND a pickup, or two appointments, are |
| separate events (`server/agent.py`). Everything downstream is built for N events: the results card |
| shows "*N events found*" with one card per invite, the editable table gets one row each, the `.ics` |
| contains one `VEVENT` per event, each event carries its own Google/Outlook/Apple quick-add links, |
| and the conflict check runs across all of them. Screenshot input is multi-file too β attach several |
| flyers and they're all read in one run. |
|
|
| Two caveats: |
|
|
| - **Stub mode extracts only the first invite.** The local-dev heuristic (`_stub_plan` in |
| `server/agent.py`, enabled by `USE_STUB_EXTRACTOR=1`) works with no model and no GPU β and it's |
| now a decent parser in its own right (labeled times, explicit dates, multi-line/π locations, |
| durations, arrival-early shifts, type-based reminders) β but it still returns at most **one** |
| event. If you paste a multi-invite thread locally and get one event back, that's the stub, not |
| the product; the deployed Space uses the multi-event model path. |
| - **Simultaneous *runs* are serialized, not parallel.** If two users (or two tabs) hit *Run the |
| agents* at once, both complete, but inference executes one request at a time β `server/model.py` |
| holds the llama.cpp instance behind a `threading.Lock`, and Gradio queues the events. On a |
| single-GPU Space that's intentional (one model copy in memory); the second run simply waits its |
| turn, then streams its own pipeline progress. |
|
|
| ## Repo layout |
|
|
| ``` |
| app.py # Gradio + FastAPI entrypoint (the Space) |
| server/ |
| agent.py # thread (+images) -> validated ActionPlan |
| orchestrator.py # Run the agents: MiniCPM planner driving our own MCP tools |
| schema.py # Event / Conflict / ActionPlan pydantic models |
| model.py # llama.cpp load: GGUF + vision mmproj, constrained JSON |
| imageutil.py # image -> base64 data URI |
| ui/blocks.py # custom Gradio Blocks (reasoning, events, conflicts, reply) |
| static/app.css # custom CSS (Off-Brand) |
| calendar_out/ |
| ics.py # .ics generation (off-grid default) |
| freebusy.py # parse existing .ics + deterministic conflict detection |
| gcal.py # optional Google Calendar push |
| collector/collector.py # Mac-side iMessage collector (text + image attachments) |
| training/ # dataset build + QLoRA fine-tune + GGUF/mmproj export |
| Dockerfile # dedicated-GPU Space: builds llama.cpp (0.3.28) WITH CUDA |
| requirements-docker.txt # runtime deps for the Docker image (llama.cpp built separately) |
| PLAN.md # full design + build plan |
| ``` |
|
|
| ## Quick start (local dev) β no GPU needed |
|
|
| ```bash |
| pip install -r requirements.txt |
| |
| # Runs the whole app with the built-in heuristic agent β no model, no GPU: |
| export USE_STUB_EXTRACTOR=1 INGEST_TOKEN="dev-secret" |
| python app.py # http://localhost:7860 |
| ``` |
|
|
| Open it, go to the **Schedule** tab, and tap **Try a sample** β or paste a thread, attach chat |
| **screenshots**, and optionally upload your current calendar **`.ics`** for conflict checks. |
| (Heads-up: the stub agent extracts only the **first** invite in a thread β multi-invite extraction |
| needs the real model; see *Can it process multiple invites at once?* above.) Tip for |
| self-hosted installs: set `CAL_ICS_PATH=/path/to/calendar.ics` and conflict checks use that file |
| automatically whenever no `.ics` is uploaded β step 4 completes itself, fully offline. Review |
| the detected events, conflicts, proposed times, and the suggested reply, then add any event with |
| its **Add to: Google Β· Outlook Β· iCal Β· .ics** links (iCal and .ics both download the event's |
| `.ics` file; with 2+ events an **iCal β all N events** link grabs everything at once). |
| The **Activity β This week** panel shows what you've captured. |
|
|
| ## This week (impact) |
|
|
| The Activity tab has a **This week** panel that persists across restarts: **events captured**, |
| **conflicts caught**, and **estimated time saved**. A "capture" is counted when a run surfaces |
| events for review (adding to a calendar happens through the per-event links, which the server |
| can't observe). |
|
|
| `minutes_saved` is a deliberately conservative, **configurable estimate β not a measurement**: |
| `IMPACT_MIN_PER_EVENT` (default **8** min per captured event) + `IMPACT_MIN_PER_CONFLICT` (default |
| **15** min per conflict caught). Override either via env. State persists to `IMPACT_PATH` |
| (default `/tmp/impact_weeks.json`; point it at a persistent disk on a Space to survive rebuilds). |
|
|
| ## Accuracy upgrade (optional) β serve the real `gemma-cal` LLM |
|
|
| The stub agent above makes the demo work with **no GPU**. The production Space serves our |
| fine-tuned **`gemma-cal` E4B** through `llama-server` β no cloud AI APIs either way. The same |
| config works anywhere llama.cpp runs: |
|
|
| ```bash |
| export USE_STUB_EXTRACTOR=0 |
| export MODEL_HF_REPO="build-small-hackathon/gemma-4-cal-gguf" |
| export MODEL_FILE="gemma-cal-e4b-Q4_K_M.gguf" # ~5 GB edge fine-tune (what the Space serves) |
| export MMPROJ_REPO="unsloth/gemma-4-E4B-it-GGUF" # the E4B's own vision projector |
| export MMPROJ_FILE="mmproj-F16.gguf" # enables screenshot/vision input |
| bash scripts/start_space.sh |
| ``` |
|
|
| This is the platform's **only** model β the same ~5 GB GGUF serves the production Space (16 GB |
| T4), a gaming GPU, or a laptop. (`MODEL_FILE` is explicit on purpose: the model repo also stores |
| legacy training artifacts, so the `-hf repo:Q4_K_M` shorthand is ambiguous.) |
|
|
| ## Optional: Mac collector (power users) |
|
|
| The phone-paste path above needs nothing installed. If you'd rather have new iMessages fed in |
| automatically, run the collector on a Mac where iMessages sync (iOS exposes no API for message |
| content, so a Mac is the only auto-feed source): |
|
|
| ```bash |
| cd collector && cp .env.example .env # edit SPACE_URL + INGEST_TOKEN |
| python collector.py |
| ``` |
|
|
| > β οΈ The collector needs **Full Disk Access** (System Settings β Privacy & Security) to read `chat.db`. |
|
|
| ## Autonomous & on a phone |
|
|
| There's a single backend endpoint β **`POST /agent`** (bearer `INGEST_TOKEN`) β that takes a thread |
| (or messages, + optional screenshot/`.ics`) and returns the extracted events, conflicts, and reply as |
| JSON (optionally an `.ics` or a Google Calendar push). Every front-end calls it: |
|
|
| - **Fully autonomous (Mac) β set-and-forget:** `INGEST_TOKEN=β¦ MODEL_GGUF=~/models/hermes.gguf |
| scripts/setup_mac.sh` installs three launchd jobs (Hermes `llama-server` + autonomous backend + |
| collector). New iMessages **you send or accept** become calendar events automatically, deduped per |
| chat. Triggers on outgoing messages by default (`TRIGGER_ON=outgoing`; `any` to widen). |
| - **Hermes "grows-with-you" brain:** point `INFERENCE_BASE_URL` at a Hermes `llama-server`; its |
| personal **memory** (peopleβroles, "you decline Mondays") improves extraction over time and is shown |
| in the dashboard **Memory** tab. See **[docs/hermes.md](./docs/hermes.md)**. |
| - **iPhone, one tap:** an iOS **Shortcut** shares a thread/screenshot to `/agent` and adds the events |
| to Apple Calendar natively β no `.ics` import. |
| - **Android, hands-off:** a Tasker/MacroDroid rule on a notification/SMS calls `/agent` and inserts |
| events. See **[docs/android-tasker.md](./docs/android-tasker.md)**. |
| - **On-device model:** set `INFERENCE_BASE_URL` to a local `llama-server` (e.g. Gemma **E4B** or a |
| small Hermes in Termux) so inference runs *on the phone* β same agent, env-selected. |
|
|
| > **iOS can't read iMessage in the background** (no message API), so fully-autonomous iMessage needs |
| > the Mac collector; the iPhone path is one-gesture. See **[docs/automations.md](./docs/automations.md)** |
| > and **[docs/on-device.md](./docs/on-device.md)**. |
|
|
| ## Build Small β prizes & quests |
|
|
| **Track: π‘ Backyard AI** (`track:backyard`) β a practical app for a specific real person: a busy |
| parent whose family calendar is buried in a noisy class group chat. |
|
|
| ### Sponsor awards we compete for |
|
|
| | Award | Why this submission qualifies | |
| |---|---| |
| | π’ **Modal Awards** (best Modal-powered apps) | **Modal powered the development of the platform's model end-to-end** β required note, gladly given: [`training/modal_train.py`](./training/modal_train.py) (QLoRA fine-tune on serverless A100/H100s, Volumes caching weights), [`training/modal_eval.py`](./training/modal_eval.py) + [`modal_quant_eval.py`](./training/modal_quant_eval.py) (the task eval served on llama.cpp inside Modal, incl. an f16/Q8_0/Q4_K_M quantization study and the regex/text/vision A/B harness), and [`training/gated_retrain.py`](./training/gated_retrain.py) (train β staging β eval β promote *only past the gate* β eight regressed models rejected, every run a Modal job). | |
| | π± **OpenBMB Awards** (standout MiniCPM builds, per track) | The **agent is planned by OpenBMB MiniCPM** (`openbmb/MiniCPM4.1-8B-GGUF`, Q4; the 1B variant is a config switch) on a second local llama-server, driving this Space's own MCP tools (`extract_events β check_conflicts β make_ics`) as a visible multi-step agent ([`server/orchestrator.py`](./server/orchestrator.py)). MiniCPM is the agent's brain, not a garnish. | |
|
|
| *(Not claimed: the OpenAI Track β no Codex-attributed commits β and the NVIDIA Nemotron Quest β |
| different model family. We'd rather be honest than eligible.)* |
|
|
| ### Special awards β our case |
|
|
| | Award | Our case | |
| |---|---| |
| | ποΈ **Bonus Quest Champion** | All **six** collectable quests claimed with evidence β the full sash (table below). | |
| | π¨ **Off-Brand Award** | Custom landing page, hero + carousel, grouped nav, bespoke results cards and Activity dashboard β [`ui/blocks.py`](./ui/blocks.py) + [`static/app.css`](./static/app.css), far past the stock Gradio look. | |
| | π **Tiny Titan** | The platform's one and only model is **Gemma E4B β ~4B *effective* parameters** (~5 GB at Q4, serves on a 16 GB T4 or a laptop), and a 1B MiniCPM planner variant is a config switch. Honest framing: E4B is a MatFormer "effective-4B" β judges' call whether that's tiny enough. | |
| | π¬ **Best Demo** | App + demo video + social post as one package β storyboard with every quest named on-camera in [`docs/demo-script.md`](./docs/demo-script.md). | |
| | π€ **Best Agent** | The MiniCPM-planned, MCP-tool-driven agent above β real multi-step tool use, every model under the 32B cap. | |
| | π **Judges' Wildcard** | No entry needed β but if "eval-gated fine-tuning with a public failure post-mortem" fits no category, we know where to find you. | |
|
|
| ### Collectable quests β all six claimed |
|
|
| | Quest | Evidence | |
| |---|---| |
| | π **Off the Grid** (local-first, no cloud APIs) | All inference is llama.cpp inside the Space; the only optional outbound call is the user's own Google Calendar push. | |
| | π― **Well-Tuned** (published fine-tune) | [`gemma-cal` E4B](https://huggingface.co/build-small-hackathon/gemma-4-cal-gguf) β our QLoRA fine-tune **is the model production serves**, shipped through the eval gate with the [honest scorecard public](./docs/eval-roadmap.md). | |
| | π¨ **Off-Brand** (custom UI) | See the Off-Brand Award case above. | |
| | π¦ **Llama Champion** (llama.cpp runtime) | The official `ghcr.io/ggml-org/llama.cpp` server image runs the GGUF + vision mmproj ([`Dockerfile`](./Dockerfile), [`scripts/start_space.sh`](./scripts/start_space.sh)). | |
| | π‘ **Sharing is Caring** (open trace on the Hub) | Redacted agent traces published to [`ParetoOptimal/offgridschedula-traces`](https://huggingface.co/datasets/ParetoOptimal/offgridschedula-traces) β one click from the Activity tab. | |
| | π **Field Notes** (write-up) | [`FIELD_NOTES.md`](./FIELD_NOTES.md) + the [eval-gated fine-tuning post-mortem](./docs/blog-eval-gated-finetuning.md) + [project blog](https://huggingface.co/blog/build-small-hackathon/offgridschedula). | |
|
|
| ## Fine-tune on Modal (GPU) |
|
|
| `training/modal_train.py` runs the whole fine-tune on a serverless GPU and publishes the GGUF to |
| HF β no local GPU needed. It's a thin wrapper that ships this repo to Modal and runs the existing |
| pipeline (`make_dataset.py` β `train_qlora.py` β `export_gguf.sh`) on an A100/H100, then uploads the |
| quantized GGUF + `mmproj` to your HF repo. This is all *offline* prep, so **Off the Grid** is |
| untouched (the rule applies to the running app's inference, not dataset/training prep). |
|
|
| ```bash |
| pip install modal |
| modal token new |
| modal secret create huggingface HF_TOKEN=hf_xxxxxxxx # your HF *write* token |
| |
| # Validate the full pipeline cheaply first (cheap edge model, ~a couple $): |
| modal run training/modal_train.py --base-model google/gemma-4-E4B-it |
| |
| # Then the real run (default A100-80GB; --gpu H100 for speed): |
| modal run training/modal_train.py |
| modal run training/modal_train.py --gpu H100 --num-epochs 3 |
| ``` |
|
|
| On finish it prints the `MODEL_REPO` / `MODEL_FILE` / `MMPROJ_FILE` to set on the Space. Two |
| persistent Modal Volumes cache the base-model download and the outputs across runs, so iterating on |
| `training/data/dataset.jsonl` only re-pays for the training itself. |
|
|
| > Cost (A100-80GB β $2.5/hr, per-second billing): a few-hundred-to-2000-example QLoRA run is |
| > ~1β3 hr β $5β15, so ~$250 of credit β 15β40 full iterations. Expand the dataset before the |
| > first real 31B run β the seeds in `make_dataset.py` are a smoke test, not a training set. |
| |
| ### Publish your fine-tune & point the Space at it |
| |
| The training run is the one step that spends **your** GPU/Modal credits β it's not done for you. |
| Once you've run it, the path is turnkey: |
| |
| 1. **Recommended:** `python training/gated_retrain.py` β train β staging upload β 60-example eval β |
| **promote only if it beats the gate**. A regressed model cannot reach production. (Raw |
| `modal run training/modal_train.py` is the ungated equivalent for experiments.) |
| 2. Point the Space at *your* model via **Space variables** (`scripts/start_space.sh` reads them at |
| launch; set in *Settings β Variables* or with `HfApi().add_space_variable`): |
| ``` |
| MODEL_HF_REPO = <you>/gemma-cal-gguf |
| MODEL_FILE = gemma-cal-e4b-Q4_K_M.gguf # explicit file β repo may hold several quants/tiers |
| MMPROJ_REPO = unsloth/gemma-4-E4B-it-GGUF # projector repo, if different from the LLM's |
| MMPROJ_FILE = mmproj-F16.gguf # enables screenshot/vision input |
| ``` |
| The deploy workflow stays a plain git mirror β the model is pulled at runtime, never committed. |
| 3. Push to `main` β CI deploys β the Space now serves your fine-tune (**Well-Tuned**). |
|
|
| ## Share a trace (Sharing is Caring) |
|
|
| Want others to learn from a run? In the **Activity** tab, click **β¬ Download trace (JSON)** β the |
| trace stays on your device, and the hosted Space holds **no Hub token**. Personal data is redacted by |
| default (the activity log only carries counts + status; the one chat-name field is stripped). Then |
| publish it from your own machine, with your own login: |
|
|
| ```bash |
| huggingface-cli login # or export HF_TOKEN=... |
| python training/share_trace.py trace.json --public # -> a HF dataset repo of traces |
| ``` |
|
|
| ## Field notes |
|
|
| [**FIELD_NOTES.md**](./FIELD_NOTES.md) is the build retrospective β the iOSβ`chat.db` pivot, the |
| `attributedBody` trap, why conflict math is deterministic, stub-first architecture, the |
| reframe-around-one-person lesson, and the Off-the-Grid trade-offs. |
|
|
| ## Remote automation (runs without an interactive session) |
|
|
| | Workflow | Trigger | What it does | Needs | |
| |---|---|---|---| |
| | `.github/workflows/ci.yml` β **test** | push / PR | compile + `pytest` (stub mode, no GPU) | nothing | |
| | `.github/workflows/ci.yml` β **deploy** | push to `main`, after tests pass | `huggingface-cli upload` the repo to the HF Space (Gradio SDK; model excluded, pulled at runtime) | secret `HF_TOKEN`, var `SPACE_ID` | |
| | `.github/workflows/maintenance.yml` | daily + manual | ping the Space `/health`, audit outdated deps β open/update a GitHub issue | var `SPACE_HEALTH_URL` | |
|
|
| One-time setup for deploy + monitoring: |
|
|
| ```bash |
| gh secret set HF_TOKEN # HF write token |
| gh variable set SPACE_ID -b "<owner>/<space>" |
| gh variable set SPACE_HEALTH_URL -b "https://<owner>-<space>.hf.space/health" |
| ``` |
|
|
| CI installs `requirements-ci.txt` (excludes `llama-cpp-python` and the Google libs β both are |
| imported lazily and not needed for the stub-mode tests). A weekly Claude `/schedule` routine handles |
| the judgment work (grow `training/data/dataset.jsonl` β PR, triage CI failures). |
|
|