# Implementation Plan — "Cook With Me" > Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces. > > **Hackathon:** Small models / Big adventures — June 2026 > **Read first:** `plan.md` (the *what* and *why*) and `estrategia.md` (the *how* at a strategic level). This document is the *how* at a tactical level — turn this into code. --- ## 0. Locked decisions (do not re-discuss) | Decision | Value | Reason | |---|---|---| | UI framework | **Gradio** | Hackathon requirement | | Hosting | **Hugging Face Space** | Hackathon requirement | | Inference runtime (text + vision) | **llama.cpp** via `llama-cpp-python` | Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal | | Image generation | **FLUX.2 Klein 9B** (`black-forest-labs/FLUX.2-klein-9B`) | Sponsor model; runs in the Space if a GPU Space is rented (or via `enable_model_cpu_offload()` as fallback). Plan to migrate this specific component to Modal post-hackathon | | Recipe planner / reasoning | **`openbmb/MiniCPM-V-4`** (GGUF) | Provided requirement | | Vision (ingredient ID + progress validator) | **`openbmb/MiniCPM-V-4.6`** (GGUF) | Provided requirement | | Text-to-speech | **OpenBMB VoxCPM2** | Provided requirement | | Recipe dataset | **`thedevastator/better-recipes-for-a-better-life`** (Kaggle) — international cuisine | Provided requirement; not limited to Mexican food | | App language | **English only** | Provided requirement | | Final output | **Recipe + step images + voice + nutritional values** | Provided requirement | | External API calls at runtime | **None** | "llama.cpp inside the Space" mandate | --- ## 1. Architecture (final, English-only, llama.cpp-first) ``` ┌──────────────────────────────────────┐ │ Hugging Face Space (Gradio) │ │ (CPU + optional GPU upgrade) │ ├──────────────────────────────────────┤ 📸 Fridge photo ─────▶│ [Vision Agent] │ │ MiniCPM-V-4.6 GGUF (llama.cpp) │ │ → list[ingredient] │ │ │ │ │ ▼ │ 🥘 User picks dish ───▶│ [Recipe Planner] │ │ MiniCPM-V-4 GGUF (llama.cpp) │ │ + retrieval over Kaggle dataset │ │ → Recipe JSON (steps, nutrition) │ │ │ │ │ ▼ │ │ [Step Illustrator] │ │ FLUX.2 Klein 9B (diffusers) │ │ → PNG per step + final dish │ │ │ │ │ ▼ │ │ [Narrator] │ │ VoxCPM2 → MP3 per step │ │ │ │ │ ▼ │ 📸 Progress photo ────▶│ [Progress Validator] │ │ MiniCPM-V-4.6 (vision compare) │ │ → "go / wait / fix" + tip │ └──────────────────────────────────────┘ ``` **Total parameter count (≤ 32B requirement):** - MiniCPM-V-4 (reasoning) ≈ 4B - MiniCPM-V-4.6 (vision) ≈ 4.6B - FLUX.2 Klein ≈ 9B - VoxCPM2 ≈ 1B (estimate) - **Total ≈ 18.6B ✓** --- ## 2. Repository layout ``` cook-with-me/ ├── app.py # Gradio entrypoint (Space looks for this) ├── requirements.txt ├── packages.txt # apt packages (ffmpeg, libsndfile1) ├── README.md # Space card (HF requires YAML frontmatter) ├── .gitignore ├── src/ │ ├── __init__.py │ ├── config.py # paths, model IDs, constants │ ├── models/ │ │ ├── __init__.py │ │ ├── vision.py # MiniCPM-V-4.6 wrapper (llama-cpp) │ │ ├── planner.py # MiniCPM-V-4 wrapper (llama-cpp) │ │ ├── illustrator.py # FLUX.2 Klein wrapper (diffusers) │ │ ├── narrator.py # VoxCPM2 wrapper │ │ └── loader.py # lazy singletons + GGUF download │ ├── agents/ │ │ ├── mise_en_place.py # ingredient identification │ │ ├── recipe_planner.py # builds Recipe object │ │ ├── step_illustrator.py # per-step image gen │ │ ├── narrator.py # per-step TTS │ │ └── progress_validator.py │ ├── data/ │ │ ├── recipe_index.py # loads Kaggle dataset, builds retrieval │ │ └── nutrition.py # USDA-style nutrition computation │ ├── pipeline.py # Recipe state machine, orchestration │ ├── prompts/ │ │ ├── vision_prompt.txt │ │ ├── planner_system.txt │ │ └── validator_prompt.txt │ └── ui/ │ ├── theme.py # custom CSS (Off-Brand badge) │ └── components.py # reusable Gradio Blocks pieces ├── scripts/ │ ├── download_models.py # pre-warms GGUF + Flux weights at build time │ ├── build_recipe_index.py # caches Kaggle dataset locally │ └── smoke_test.py # end-to-end validation before push └── assets/ ├── sample_fridge_1.jpg └── sample_progress_1.jpg ``` --- ## 3. Phase-by-phase plan (10 days) > Each phase has: **goal**, **tasks**, **deliverable**, **verification check**. Do not move to the next phase if verification fails. --- ### Phase 0 — Day 0 (½ day): Account + tooling setup **Goal:** every credential and CLI is ready before writing code. **Tasks** 1. Create or confirm Hugging Face account; generate a **write token** (Settings → Access Tokens). Store as `HF_TOKEN` env var locally. 2. Install Hugging Face CLI: `pip install -U huggingface_hub` then `huggingface-cli login`. 3. Install Kaggle CLI: `pip install kaggle`. Place `kaggle.json` (Account → API → Create New Token) in `~/.kaggle/kaggle.json` with `chmod 600`. 4. Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active. 5. Install local Python 3.11 venv: `python -m venv .venv && source .venv/bin/activate`. 6. Create the repo locally: `git init cook-with-me && cd cook-with-me`. 7. Create an empty Hugging Face Space: huggingface.co → New Space → SDK = **Gradio**, Hardware = **CPU basic** (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it. 8. Verify model availability: open in a browser and confirm pages exist: - `huggingface.co/openbmb/MiniCPM-V-4` - `huggingface.co/openbmb/MiniCPM-V-4-6` - `huggingface.co/openbmb/VoxCPM2` (or whatever the exact repo name is — search "VoxCPM" on HF) - `huggingface.co/black-forest-labs/FLUX.2-klein-9B` **Deliverable:** empty Space deployed showing "Hello World" Gradio. **Verify:** `https://huggingface.co/spaces//cook-with-me` loads. --- ### Phase 1 — Day 1: Project skeleton + recipe dataset ingestion **Goal:** the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval. **Tasks** 1. Write `requirements.txt` (initial version — packages will be added as phases progress): ```text gradio>=4.44 huggingface_hub>=0.24 llama-cpp-python>=0.3.2 numpy pandas Pillow pydantic>=2 sentence-transformers ``` 2. Write `packages.txt`: ```text ffmpeg libsndfile1 ``` 3. Write `scripts/build_recipe_index.py`: - Use `kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path)` — discover `file_path` by listing the dataset files first via `kagglehub.dataset_download`. - Normalize columns: `name`, `ingredients` (list[str]), `instructions` (list[str]), `cuisine` (str if present, else "international"), `prep_time`, `servings`. - Drop rows missing critical fields. Lowercase + strip ingredient strings. - Save to `data/recipes.parquet` (~5–50MB depending on dataset size). - Build sentence embeddings of the recipe **name + first 3 ingredients** using `sentence-transformers/all-MiniLM-L6-v2` and save to `data/recipes_emb.npy`. - This script runs **once locally**; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in `app.py`). If files exceed 100MB, push to a HF Dataset repo: `/cook-with-me-recipes`. 4. Write `src/data/recipe_index.py`: - `class RecipeIndex` with `.search(ingredients: list[str], top_k=5) -> list[RecipeRow]`. - Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k. **Deliverable:** `python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))"` prints 5 sensible recipes. **Verify:** at least 3 of the top-5 results contain ≥2 of the input ingredients. --- ### Phase 2 — Day 2: Vision agent (Mise en Place) — MiniCPM-V-4.6 via llama.cpp **Goal:** given a fridge photo, return a clean list of English ingredient names. **Background:** llama.cpp supports multimodal models through a vision projector (`mmproj-*.gguf`) plus the language model GGUF. MiniCPM-V family ships both files on the Hub. **Tasks** 1. Find the GGUF release of MiniCPM-V-4.6. Search HF for `MiniCPM-V-4_6-gguf` or `openbmb/MiniCPM-V-4_6-gguf`. You need **two** files: - `Model-Q4_K_M.gguf` (or similar quant) - `mmproj-model-f16.gguf` (the vision projector) 2. Write `src/models/loader.py`: ```python from huggingface_hub import hf_hub_download from llama_cpp import Llama from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler # or matching handler _vision = None def get_vision_model(): global _vision if _vision is None: model_path = hf_hub_download( repo_id="openbmb/MiniCPM-V-4_6-gguf", # confirm exact repo filename="Model-Q4_K_M.gguf", ) mmproj_path = hf_hub_download( repo_id="openbmb/MiniCPM-V-4_6-gguf", filename="mmproj-model-f16.gguf", ) handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path) _vision = Llama( model_path=model_path, chat_handler=handler, n_ctx=4096, n_threads=4, verbose=False, ) return _vision ``` 3. Write `src/agents/mise_en_place.py`: ```python import base64, io, json from PIL import Image from src.models.loader import get_vision_model PROMPT = ( "You are an ingredient detector. Look at the fridge/pantry photo and " "list every edible ingredient you can identify. Return strict JSON: " '{"ingredients": ["chicken", "onion", "tomato", ...]} ' "Lowercase, English, no brand names, no containers." ) def _img_to_data_url(img: Image.Image) -> str: buf = io.BytesIO(); img.save(buf, "JPEG", quality=85) b64 = base64.b64encode(buf.getvalue()).decode() return f"data:image/jpeg;base64,{b64}" def identify_ingredients(image: Image.Image) -> list[str]: llm = get_vision_model() out = llm.create_chat_completion(messages=[ {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": _img_to_data_url(image)}}, {"type": "text", "text": PROMPT}, ]} ], temperature=0.2, response_format={"type": "json_object"}) data = json.loads(out["choices"][0]["message"]["content"]) return [s.lower().strip() for s in data["ingredients"]] ``` 4. Test locally with 5 sample fridge photos. **Deliverable:** the function returns a non-empty English list with ≥80% precision on a clean fridge photo. **Verify:** stash these 5 results in `tests/vision_smoke.json` for regression checks. --- ### Phase 3 — Day 3: Recipe Planner — MiniCPM-V-4 via llama.cpp + retrieval **Goal:** given a list of ingredients (and optionally a chosen dish), return a fully structured `Recipe` JSON including steps, durations, visual descriptions, and nutritional values. **Tasks** 1. Find or convert MiniCPM-V-4 to GGUF. Likely repo: `openbmb/MiniCPM-V-4-gguf` or community quants. Pick `Q4_K_M`. 2. Add to `src/models/loader.py` a `get_planner_model()` (same pattern as vision but without `chat_handler`). 3. Write `src/agents/recipe_planner.py`: - **Step A — propose:** call planner with `Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON.` - **Step B — retrieve:** for the chosen dish name, call `RecipeIndex.search(...)` and pick the closest match. Use it as a *grounded reference*. - **Step C — restructure:** prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical `Recipe` JSON schema below. The retrieval grounds the model and prevents hallucinated steps. - **Step D — nutrition:** from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5. 4. Define the canonical schema in `src/pipeline.py` using Pydantic: ```python from pydantic import BaseModel from typing import Optional class Step(BaseModel): n: int instruction: str # English, imperative duration: str # "4 minutes" visual: str # English visual description for FLUX prompt tip: Optional[str] = None class Nutrition(BaseModel): calories: int # per serving protein_g: float carbs_g: float fat_g: float fiber_g: float class Recipe(BaseModel): name: str cuisine: str servings: int total_time_minutes: int options: list[dict] # only populated on "propose" call ingredients_have: list[str] ingredients_missing: list[str] substitutes: dict[str, list[str]] steps: list[Step] final_dish_visual: str nutrition_per_serving: Nutrition ``` 5. Write the system prompt (`src/prompts/planner_system.txt`): - Persona: international chef - Hard rule: output JSON only, matching schema - Hard rule: prefer dishes feasible with available ingredients - Hard rule: 5–7 steps, each ≤ 25 words, each with a concrete `visual` field for image generation - Hard rule: include `nutrition_per_serving` (model is allowed to estimate; you'll override with `data/nutrition.py` for accuracy) 6. Use `response_format={"type": "json_object"}` in the chat completion call. Set `temperature=0.7, top_p=0.95, enable_thinking=True` for the propose step (creative); `temperature=0.4` for the structured-output step (deterministic). **Deliverable:** for `["chicken","onion","tomato","tortilla","cheese"]` and chosen dish "chicken tinga", the function returns a valid `Recipe` Pydantic object with 5–7 steps. **Verify:** the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds. --- ### Phase 3.5 — Day 3 (afternoon): Nutritional values **Goal:** the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM). **Approach:** small, embedded reference table beats LLM math. **Tasks** 1. Bundle `data/nutrition_table.csv` — a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo. 2. Write `src/data/nutrition.py`: - `parse_quantity(line: str) -> (grams, ingredient_name)` — handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35). - `compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutrition` — sum per-100g values weighted by grams, divide by servings. - If a line cannot be parsed, skip it and log; don't crash. 3. After the planner returns a recipe, **overwrite** `recipe.nutrition_per_serving` with the computed value. Keep the LLM's value only as a fallback when the parser yields zero. **Deliverable:** for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within ±25% of online references. --- ### Phase 4 — Day 4: Step Illustrator — FLUX.2 Klein 9B **Goal:** generate an appetizing image for the final dish + one image per step. **Constraint:** FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths: - **Path A (recommended for the hackathon):** upgrade the Space to a GPU instance (T4 or A10G — paid, but $20 HF credits cover it for a week of development). Code stays unchanged. - **Path B (fallback):** run FLUX in `enable_model_cpu_offload()` mode with `num_inference_steps=4` and accept ~3 min/image — only feasible for pre-rendered demo recipes, not live runs. **Tasks** 1. Add to `requirements.txt`: ```text diffusers>=0.31 transformers>=4.45 accelerate torch safetensors ``` 2. Write `src/models/illustrator.py`: ```python import torch from diffusers import Flux2KleinPipeline _pipe = None def get_flux(): global _pipe if _pipe is None: dtype = torch.bfloat16 _pipe = Flux2KleinPipeline.from_pretrained( "black-forest-labs/FLUX.2-klein-9B", torch_dtype=dtype, ) _pipe.enable_model_cpu_offload() return _pipe def render(prompt: str, seed: int = 0) -> "PIL.Image.Image": pipe = get_flux() device = "cuda" if torch.cuda.is_available() else "cpu" img = pipe( prompt=prompt, height=1024, width=1024, guidance_scale=1.0, num_inference_steps=4, generator=torch.Generator(device=device).manual_seed(seed), ).images[0] return img ``` 3. Write `src/agents/step_illustrator.py`: - For each `Step.visual`, build a prompt like: > `f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing."` - Generate the **final dish image first**, then the per-step images, all in **one Python loop** (no parallelism — FLUX holds the GPU). - Cache results on disk keyed by `hash(prompt)` to avoid re-renders on re-runs. - Emit Gradio progress updates so the UI doesn't appear frozen. 4. **Critical tuning:** keep `num_inference_steps=4` (Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale. **Deliverable:** for a 5-step recipe, all 6 images (final + 5 steps) render in: - < 1 minute on T4 GPU Space - < 8 minutes on CPU offload (acceptable only for pre-cached demos) **Verify:** show the 6 images to an unprompted human; ≥4 should be described as "appetizing". --- ### Phase 5 — Day 5: Narrator — VoxCPM2 **Goal:** every step's instruction is rendered to an MP3 in a warm, clear English voice. **Tasks** 1. Confirm the exact VoxCPM2 repo name on HF (`openbmb/VoxCPM2` or similar). Read its README for the inference snippet — TTS APIs vary widely between models. 2. Add to `requirements.txt`: `soundfile`, `torchaudio`, `numpy`. If VoxCPM2 ships GGUF, use it via `llama-cpp-python` audio extension (if available); otherwise load via `transformers` directly. 3. Write `src/models/narrator.py`: ```python _tts = None def get_tts(): global _tts if _tts is None: # placeholder — replace with the exact VoxCPM2 loading code from its README from transformers import AutoModel, AutoProcessor _tts = ... # load on CPU; VoxCPM2 is small (~1B) return _tts def synthesize(text: str, voice: str = "warm_female_en") -> bytes: """Returns MP3 bytes.""" tts = get_tts() wav = tts.generate(text, voice=voice) # API depends on VoxCPM2 # encode wav -> mp3 with soundfile + ffmpeg-python or pydub return mp3_bytes ``` 4. Write `src/agents/narrator.py`: - For each step, synthesize `step.instruction`. If `step.tip` is set, synthesize a separate "tip" clip. - Save MP3 files in a per-recipe temp directory; return file paths to Gradio. 5. Pre-render all step audio when the recipe is finalized — never stream per-step in the demo (too much UI lag). **Deliverable:** clicking "Play" on step 1 in the UI plays clear English narration. **Verify:** on a 5-step recipe, total TTS rendering time < 30 seconds on CPU. --- ### Phase 6 — Day 6: Gradio UI (Off-Brand) **Goal:** the Space looks like a recipe magazine, not stock Gradio. **Tasks** 1. Write `src/ui/theme.py`: ```python import gradio as gr theme = gr.themes.Soft( primary_hue="orange", neutral_hue="stone", font=[gr.themes.GoogleFont("Inter"), "sans-serif"], font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"], ) CSS = """ .gradio-container { background: #f5ecd9 !important; } .recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; } .recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; } .step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; } .nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; } .nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; } """ ``` 2. Write `app.py` with three tabs: - **Tab 1 — Cook**: fridge photo input → ingredient chips → 3 dish options → selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom. - **Tab 2 — Check Progress**: upload a progress photo + select active step → validator returns badge (`go/wait/fix`) + tip + audio. - **Tab 3 — About / Tech**: README-style explanation, badges, model list. 3. Use `gr.Blocks` with `gr.State` to hold the current `Recipe` Pydantic object across UI events. Serialize to/from `dict` since Pydantic objects don't survive Gradio state by default — wrap in `state.value = recipe.model_dump()`. 4. Wire callbacks: - `btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])` - `dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])` - `progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])` **Deliverable:** end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere. --- ### Phase 7 — Day 7: Progress Validator (closed loop) **Goal:** user uploads a progress photo, app says "go / wait / fix" with a voiced tip. **Tasks** 1. Write `src/agents/progress_validator.py`: ```python PROMPT = """Compare these two cooking photos. Photo 1 (target): how it should look after the step "{instruction}". Photo 2 (user's pan/plate): the user's current progress. Reply strict JSON: {"verdict": "go|wait|fix", "feedback": "...", "tip": "..."} - "go": looks right, move to next step - "wait": needs more time, do not change anything yet - "fix": something is off; suggest a concrete adjustment in one sentence """ def validate(target_img, user_img, step_instruction): ... ``` 2. Use the same vision model singleton as Phase 2 — both calls share weights. 3. Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2. **Deliverable:** running the validator on 5 real progress photos returns the correct verdict on ≥3. --- ### Phase 8 — Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge) > **Important caveat:** The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still **requires GPU**, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does **not** violate the runtime constraint — only the build pipeline touches a GPU. **Goal:** publish a fine-tuned Planner GGUF to the Hub and load it from the Space. **Tasks** 1. **Build SFT dataset** (`scripts/build_sft_dataset.py`): - Load Kaggle `better-recipes` dataset. - For each recipe, build a `(prompt, completion)` pair where `prompt` is `"Available ingredients: X, Y, Z. Propose recipe."` and `completion` is the full canonical `Recipe` JSON. - Generate ~1000 pairs, push to `/cook-with-me-sft` HF Dataset. 2. **LoRA training** (`scripts/train_planner.py` — to be run on a GPU machine, not the Space): ```python # peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4 # r=16, alpha=32, lr=2e-4, epochs=2, batch=4 # push_to_hub=True, hub_model_id="/cook-with-me-planner-4b" ``` 3. **Convert to GGUF** (Day 8 evening): - Use `llama.cpp/convert_hf_to_gguf.py` then `quantize` to `Q4_K_M`. - Push GGUF to `/cook-with-me-planner-4b-gguf`. 4. Update `src/models/loader.py` to point at your GGUF instead of the base model. **Deliverable:** the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set. --- ### Phase 9 — Day 9: End-to-end test, performance pass, pre-warm cache **Goal:** the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware. **Tasks** 1. Write `scripts/smoke_test.py` that runs the full pipeline on 3 sample fridge photos and asserts: - Each ingredient list is non-empty - Each recipe has 5–7 steps - Each step has a non-empty image and audio path - Nutrition has all 5 macros set 2. Implement **on-disk caching** for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to `~/.cache/cook-with-me/flux/`. 3. Pre-render and commit **3 fully-prepared demo recipes** (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click. 4. Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace. 5. Add a "Loading models..." progress bar on first request — first cold start can take 90s. **Deliverable:** smoke test passes on the live Space. --- ### Phase 10 — Day 10: README, demo video, social post, submit **Tasks** 1. Write `README.md` with the required HF Space frontmatter: ```yaml --- title: Cook With Me emoji: 🍲 colorFrom: orange colorTo: yellow sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 --- ``` Followed by: - One-paragraph pitch - 60-second demo video embed - Architecture diagram (export from `arquitectura.html` as PNG) - Section: "How closed-loop visual cooking guidance works" - Models used (with HF links + total parameter count) - Badges declared - Build / run instructions 2. Record a 60–90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera. 3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 — here's why" or "Closed-loop validation needs the same vision model on both sides"). 4. Social post on X / LinkedIn with the demo video. 5. Submit on the hackathon platform. --- ## 4. Tools usage matrix (when to reach for what) | Phase | Primary tools | Why | |---|---|---| | 0 — setup | HF CLI, Kaggle CLI, OpenAI Codex CLI | one-shot config | | 1 — data | `kagglehub`, `pandas`, `sentence-transformers` | offline dataset prep | | 2 — vision | `llama-cpp-python` + `MiniCPMv26ChatHandler` | runs inside Space, badge: Llama Champion | | 3 — planner | `llama-cpp-python` + retrieval over local parquet | grounded JSON output | | 3.5 — nutrition | local CSV + regex parser | reliable, no LLM math | | 4 — illustrator | `diffusers` + `Flux2KleinPipeline` | sponsor model showcase | | 5 — narrator | VoxCPM2 via `transformers` (or its native API) | local TTS | | 6 — UI | `gradio` + custom CSS theme | Off-Brand badge | | 7 — validator | same vision singleton as phase 2 | closed-loop innovation, Best Agent | | 8 — fine-tune | `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine | Well-Tuned badge | | 9 — test/cache | `pytest`, `hashlib`, on-disk FLUX cache | demo reliability | | 10 — submit | HF Spaces, video tool, social | shipping | --- ## 5. Performance budget on the HF Space | Operation | Target latency | Hardware needed | |---|---|---| | Vision: ingredient ID | < 8 s | CPU 4-thread | | Planner: propose 3 dishes | < 12 s | CPU 4-thread | | Planner: build full recipe JSON | < 20 s | CPU 4-thread | | Nutrition computation | < 0.1 s | CPU | | FLUX: 1 image (4 steps) | < 12 s on T4 / < 90 s on CPU offload | GPU strongly recommended | | FLUX: 6 images (final + 5 steps) | < 80 s on T4 | GPU | | VoxCPM2: 1 step narration | < 5 s | CPU | | Validator: 1 progress check | < 8 s | CPU | | **Full recipe end-to-end** | **< 2 min on T4 Space** | — | **Hardware decision:** rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours. --- ## 6. Risks and mitigations (delta from `estrategia.md`) | Risk | Mitigation | |---|---| | MiniCPM-V-4 has no public GGUF | Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. | | llama-cpp-python's MiniCPM-V chat handler version mismatch | Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. | | FLUX.2 Klein 9B too slow on free CPU Space | Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. | | VoxCPM2 docs sparse | Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. | | Kaggle dataset has format quirks (HTML in instructions, missing fields) | The Phase 1 normalization step handles this; budget 2 hours. | | Nutrition CSV missing exotic ingredients | Skip-and-log strategy already designed; demo-day recipes use common ingredients only. | | Total params >32B if VoxCPM2 turns out to be 7B | Check size in Phase 0; if too large, drop to a smaller TTS. | --- ## 7. "Day-1 hello world" checklist Before writing any agent code, get this minimal end-to-end loop working — it proves your stack: 1. ☐ Empty Gradio Space deployed, shows "Hello" 2. ☐ `huggingface-cli login` works locally 3. ☐ `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds 4. ☐ `from llama_cpp import Llama` runs in your venv 5. ☐ Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip 6. ☐ Push the round-trip to the Space; confirm it answers in the cloud **Only after all 6 are checked, start Phase 1.** --- ## 8. Where this plan differs from `estrategia.md` (deltas to communicate) | Topic | `estrategia.md` (Spanish, Mexican-cuisine focus) | This document (current requirements) | |---|---|---| | Language | Spanish-first | **English only** | | Cuisine | Mexican | **International** (Kaggle dataset) | | Voice models | OpenBMB voice + Cohere Labs | **VoxCPM2** only (single voice) | | Vision model | MiniCPM-V 2.6 / 4 | **MiniCPM-V-4.6** | | Reasoning model | MiniCPM-4 4B | **MiniCPM-V-4** | | FLUX runtime | Modal endpoint | **Inside Space (llama.cpp principle)**; Modal kept as a future migration target only | | External APIs at runtime | Allowed (Modal, OpenAI optional) | **None** — full local inference inside Space | | Nutritional info | Not specified | **Required** at end of recipe | | Fine-tune dataset | 200 synthetic Mexican recipes | **Kaggle better-recipes (international)** | If anything in `plan.md` or `estrategia.md` conflicts with this document, **this document wins** — it reflects the latest user requirements. --- ## 9. Definition of done The implementation is complete when **all** of these are true: - [ ] Public HF Space `https://huggingface.co/spaces//cook-with-me` loads - [ ] App is fully in English - [ ] Fridge photo → ingredient list → 3 dish options → full recipe with images, audio, and nutrition works end-to-end - [ ] Progress validator returns sensible verdicts on 3+ test photos - [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers — **no external API calls at runtime** - [ ] Total parameters declared in README ≤ 32B - [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge) - [ ] Demo video (60–90s) recorded with a real person cooking - [ ] Field Notes blog post published - [ ] Submitted on the hackathon platform before deadline