Spaces:
Running on Zero
Running on Zero
| # Implementation Plan β "Cook With Me" | |
| > Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces. | |
| > | |
| > **Hackathon:** Small models / Big adventures β June 2026 | |
| > **Read first:** `plan.md` (the *what* and *why*) and `estrategia.md` (the *how* at a strategic level). This document is the *how* at a tactical level β turn this into code. | |
| --- | |
| ## 0. Locked decisions (do not re-discuss) | |
| | Decision | Value | Reason | | |
| |---|---|---| | |
| | UI framework | **Gradio** | Hackathon requirement | | |
| | Hosting | **Hugging Face Space** | Hackathon requirement | | |
| | Inference runtime (text + vision) | **llama.cpp** via `llama-cpp-python` | Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal | | |
| | Image generation | **FLUX.2 Klein 9B** (`black-forest-labs/FLUX.2-klein-9B`) | Sponsor model; runs in the Space if a GPU Space is rented (or via `enable_model_cpu_offload()` as fallback). Plan to migrate this specific component to Modal post-hackathon | | |
| | Recipe planner / reasoning | **`openbmb/MiniCPM-V-4`** (GGUF) | Provided requirement | | |
| | Vision (ingredient ID + progress validator) | **`openbmb/MiniCPM-V-4.6`** (GGUF) | Provided requirement | | |
| | Text-to-speech | **OpenBMB VoxCPM2** | Provided requirement | | |
| | Recipe dataset | **`thedevastator/better-recipes-for-a-better-life`** (Kaggle) β international cuisine | Provided requirement; not limited to Mexican food | | |
| | App language | **English only** | Provided requirement | | |
| | Final output | **Recipe + step images + voice + nutritional values** | Provided requirement | | |
| | External API calls at runtime | **None** | "llama.cpp inside the Space" mandate | | |
| --- | |
| ## 1. Architecture (final, English-only, llama.cpp-first) | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β Hugging Face Space (Gradio) β | |
| β (CPU + optional GPU upgrade) β | |
| ββββββββββββββββββββββββββββββββββββββββ€ | |
| πΈ Fridge photo ββββββΆβ [Vision Agent] β | |
| β MiniCPM-V-4.6 GGUF (llama.cpp) β | |
| β β list[ingredient] β | |
| β β β | |
| β βΌ β | |
| π₯ User picks dish ββββΆβ [Recipe Planner] β | |
| β MiniCPM-V-4 GGUF (llama.cpp) β | |
| β + retrieval over Kaggle dataset β | |
| β β Recipe JSON (steps, nutrition) β | |
| β β β | |
| β βΌ β | |
| β [Step Illustrator] β | |
| β FLUX.2 Klein 9B (diffusers) β | |
| β β PNG per step + final dish β | |
| β β β | |
| β βΌ β | |
| β [Narrator] β | |
| β VoxCPM2 β MP3 per step β | |
| β β β | |
| β βΌ β | |
| πΈ Progress photo βββββΆβ [Progress Validator] β | |
| β MiniCPM-V-4.6 (vision compare) β | |
| β β "go / wait / fix" + tip β | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Total parameter count (β€ 32B requirement):** | |
| - MiniCPM-V-4 (reasoning) β 4B | |
| - MiniCPM-V-4.6 (vision) β 4.6B | |
| - FLUX.2 Klein β 9B | |
| - VoxCPM2 β 1B (estimate) | |
| - **Total β 18.6B β** | |
| --- | |
| ## 2. Repository layout | |
| ``` | |
| cook-with-me/ | |
| βββ app.py # Gradio entrypoint (Space looks for this) | |
| βββ requirements.txt | |
| βββ packages.txt # apt packages (ffmpeg, libsndfile1) | |
| βββ README.md # Space card (HF requires YAML frontmatter) | |
| βββ .gitignore | |
| βββ src/ | |
| β βββ __init__.py | |
| β βββ config.py # paths, model IDs, constants | |
| β βββ models/ | |
| β β βββ __init__.py | |
| β β βββ vision.py # MiniCPM-V-4.6 wrapper (llama-cpp) | |
| β β βββ planner.py # MiniCPM-V-4 wrapper (llama-cpp) | |
| β β βββ illustrator.py # FLUX.2 Klein wrapper (diffusers) | |
| β β βββ narrator.py # VoxCPM2 wrapper | |
| β β βββ loader.py # lazy singletons + GGUF download | |
| β βββ agents/ | |
| β β βββ mise_en_place.py # ingredient identification | |
| β β βββ recipe_planner.py # builds Recipe object | |
| β β βββ step_illustrator.py # per-step image gen | |
| β β βββ narrator.py # per-step TTS | |
| β β βββ progress_validator.py | |
| β βββ data/ | |
| β β βββ recipe_index.py # loads Kaggle dataset, builds retrieval | |
| β β βββ nutrition.py # USDA-style nutrition computation | |
| β βββ pipeline.py # Recipe state machine, orchestration | |
| β βββ prompts/ | |
| β β βββ vision_prompt.txt | |
| β β βββ planner_system.txt | |
| β β βββ validator_prompt.txt | |
| β βββ ui/ | |
| β βββ theme.py # custom CSS (Off-Brand badge) | |
| β βββ components.py # reusable Gradio Blocks pieces | |
| βββ scripts/ | |
| β βββ download_models.py # pre-warms GGUF + Flux weights at build time | |
| β βββ build_recipe_index.py # caches Kaggle dataset locally | |
| β βββ smoke_test.py # end-to-end validation before push | |
| βββ assets/ | |
| βββ sample_fridge_1.jpg | |
| βββ sample_progress_1.jpg | |
| ``` | |
| --- | |
| ## 3. Phase-by-phase plan (10 days) | |
| > Each phase has: **goal**, **tasks**, **deliverable**, **verification check**. Do not move to the next phase if verification fails. | |
| --- | |
| ### Phase 0 β Day 0 (Β½ day): Account + tooling setup | |
| **Goal:** every credential and CLI is ready before writing code. | |
| **Tasks** | |
| 1. Create or confirm Hugging Face account; generate a **write token** (Settings β Access Tokens). Store as `HF_TOKEN` env var locally. | |
| 2. Install Hugging Face CLI: `pip install -U huggingface_hub` then `huggingface-cli login`. | |
| 3. Install Kaggle CLI: `pip install kaggle`. Place `kaggle.json` (Account β API β Create New Token) in `~/.kaggle/kaggle.json` with `chmod 600`. | |
| 4. Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active. | |
| 5. Install local Python 3.11 venv: `python -m venv .venv && source .venv/bin/activate`. | |
| 6. Create the repo locally: `git init cook-with-me && cd cook-with-me`. | |
| 7. Create an empty Hugging Face Space: huggingface.co β New Space β SDK = **Gradio**, Hardware = **CPU basic** (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it. | |
| 8. Verify model availability: open in a browser and confirm pages exist: | |
| - `huggingface.co/openbmb/MiniCPM-V-4` | |
| - `huggingface.co/openbmb/MiniCPM-V-4-6` | |
| - `huggingface.co/openbmb/VoxCPM2` (or whatever the exact repo name is β search "VoxCPM" on HF) | |
| - `huggingface.co/black-forest-labs/FLUX.2-klein-9B` | |
| **Deliverable:** empty Space deployed showing "Hello World" Gradio. | |
| **Verify:** `https://huggingface.co/spaces/<you>/cook-with-me` loads. | |
| --- | |
| ### Phase 1 β Day 1: Project skeleton + recipe dataset ingestion | |
| **Goal:** the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval. | |
| **Tasks** | |
| 1. Write `requirements.txt` (initial version β packages will be added as phases progress): | |
| ```text | |
| gradio>=4.44 | |
| huggingface_hub>=0.24 | |
| llama-cpp-python>=0.3.2 | |
| numpy | |
| pandas | |
| Pillow | |
| pydantic>=2 | |
| sentence-transformers | |
| ``` | |
| 2. Write `packages.txt`: | |
| ```text | |
| ffmpeg | |
| libsndfile1 | |
| ``` | |
| 3. Write `scripts/build_recipe_index.py`: | |
| - Use `kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path)` β discover `file_path` by listing the dataset files first via `kagglehub.dataset_download`. | |
| - Normalize columns: `name`, `ingredients` (list[str]), `instructions` (list[str]), `cuisine` (str if present, else "international"), `prep_time`, `servings`. | |
| - Drop rows missing critical fields. Lowercase + strip ingredient strings. | |
| - Save to `data/recipes.parquet` (~5β50MB depending on dataset size). | |
| - Build sentence embeddings of the recipe **name + first 3 ingredients** using `sentence-transformers/all-MiniLM-L6-v2` and save to `data/recipes_emb.npy`. | |
| - This script runs **once locally**; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in `app.py`). If files exceed 100MB, push to a HF Dataset repo: `<you>/cook-with-me-recipes`. | |
| 4. Write `src/data/recipe_index.py`: | |
| - `class RecipeIndex` with `.search(ingredients: list[str], top_k=5) -> list[RecipeRow]`. | |
| - Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k. | |
| **Deliverable:** `python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))"` prints 5 sensible recipes. | |
| **Verify:** at least 3 of the top-5 results contain β₯2 of the input ingredients. | |
| --- | |
| ### Phase 2 β Day 2: Vision agent (Mise en Place) β MiniCPM-V-4.6 via llama.cpp | |
| **Goal:** given a fridge photo, return a clean list of English ingredient names. | |
| **Background:** llama.cpp supports multimodal models through a vision projector (`mmproj-*.gguf`) plus the language model GGUF. MiniCPM-V family ships both files on the Hub. | |
| **Tasks** | |
| 1. Find the GGUF release of MiniCPM-V-4.6. Search HF for `MiniCPM-V-4_6-gguf` or `openbmb/MiniCPM-V-4_6-gguf`. You need **two** files: | |
| - `Model-Q4_K_M.gguf` (or similar quant) | |
| - `mmproj-model-f16.gguf` (the vision projector) | |
| 2. Write `src/models/loader.py`: | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| from llama_cpp import Llama | |
| from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler # or matching handler | |
| _vision = None | |
| def get_vision_model(): | |
| global _vision | |
| if _vision is None: | |
| model_path = hf_hub_download( | |
| repo_id="openbmb/MiniCPM-V-4_6-gguf", # confirm exact repo | |
| filename="Model-Q4_K_M.gguf", | |
| ) | |
| mmproj_path = hf_hub_download( | |
| repo_id="openbmb/MiniCPM-V-4_6-gguf", | |
| filename="mmproj-model-f16.gguf", | |
| ) | |
| handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path) | |
| _vision = Llama( | |
| model_path=model_path, | |
| chat_handler=handler, | |
| n_ctx=4096, | |
| n_threads=4, | |
| verbose=False, | |
| ) | |
| return _vision | |
| ``` | |
| 3. Write `src/agents/mise_en_place.py`: | |
| ```python | |
| import base64, io, json | |
| from PIL import Image | |
| from src.models.loader import get_vision_model | |
| PROMPT = ( | |
| "You are an ingredient detector. Look at the fridge/pantry photo and " | |
| "list every edible ingredient you can identify. Return strict JSON: " | |
| '{"ingredients": ["chicken", "onion", "tomato", ...]} ' | |
| "Lowercase, English, no brand names, no containers." | |
| ) | |
| def _img_to_data_url(img: Image.Image) -> str: | |
| buf = io.BytesIO(); img.save(buf, "JPEG", quality=85) | |
| b64 = base64.b64encode(buf.getvalue()).decode() | |
| return f"data:image/jpeg;base64,{b64}" | |
| def identify_ingredients(image: Image.Image) -> list[str]: | |
| llm = get_vision_model() | |
| out = llm.create_chat_completion(messages=[ | |
| {"role": "user", "content": [ | |
| {"type": "image_url", "image_url": {"url": _img_to_data_url(image)}}, | |
| {"type": "text", "text": PROMPT}, | |
| ]} | |
| ], temperature=0.2, response_format={"type": "json_object"}) | |
| data = json.loads(out["choices"][0]["message"]["content"]) | |
| return [s.lower().strip() for s in data["ingredients"]] | |
| ``` | |
| 4. Test locally with 5 sample fridge photos. | |
| **Deliverable:** the function returns a non-empty English list with β₯80% precision on a clean fridge photo. | |
| **Verify:** stash these 5 results in `tests/vision_smoke.json` for regression checks. | |
| --- | |
| ### Phase 3 β Day 3: Recipe Planner β MiniCPM-V-4 via llama.cpp + retrieval | |
| **Goal:** given a list of ingredients (and optionally a chosen dish), return a fully structured `Recipe` JSON including steps, durations, visual descriptions, and nutritional values. | |
| **Tasks** | |
| 1. Find or convert MiniCPM-V-4 to GGUF. Likely repo: `openbmb/MiniCPM-V-4-gguf` or community quants. Pick `Q4_K_M`. | |
| 2. Add to `src/models/loader.py` a `get_planner_model()` (same pattern as vision but without `chat_handler`). | |
| 3. Write `src/agents/recipe_planner.py`: | |
| - **Step A β propose:** call planner with `Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON.` | |
| - **Step B β retrieve:** for the chosen dish name, call `RecipeIndex.search(...)` and pick the closest match. Use it as a *grounded reference*. | |
| - **Step C β restructure:** prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical `Recipe` JSON schema below. The retrieval grounds the model and prevents hallucinated steps. | |
| - **Step D β nutrition:** from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5. | |
| 4. Define the canonical schema in `src/pipeline.py` using Pydantic: | |
| ```python | |
| from pydantic import BaseModel | |
| from typing import Optional | |
| class Step(BaseModel): | |
| n: int | |
| instruction: str # English, imperative | |
| duration: str # "4 minutes" | |
| visual: str # English visual description for FLUX prompt | |
| tip: Optional[str] = None | |
| class Nutrition(BaseModel): | |
| calories: int # per serving | |
| protein_g: float | |
| carbs_g: float | |
| fat_g: float | |
| fiber_g: float | |
| class Recipe(BaseModel): | |
| name: str | |
| cuisine: str | |
| servings: int | |
| total_time_minutes: int | |
| options: list[dict] # only populated on "propose" call | |
| ingredients_have: list[str] | |
| ingredients_missing: list[str] | |
| substitutes: dict[str, list[str]] | |
| steps: list[Step] | |
| final_dish_visual: str | |
| nutrition_per_serving: Nutrition | |
| ``` | |
| 5. Write the system prompt (`src/prompts/planner_system.txt`): | |
| - Persona: international chef | |
| - Hard rule: output JSON only, matching schema | |
| - Hard rule: prefer dishes feasible with available ingredients | |
| - Hard rule: 5β7 steps, each β€ 25 words, each with a concrete `visual` field for image generation | |
| - Hard rule: include `nutrition_per_serving` (model is allowed to estimate; you'll override with `data/nutrition.py` for accuracy) | |
| 6. Use `response_format={"type": "json_object"}` in the chat completion call. Set `temperature=0.7, top_p=0.95, enable_thinking=True` for the propose step (creative); `temperature=0.4` for the structured-output step (deterministic). | |
| **Deliverable:** for `["chicken","onion","tomato","tortilla","cheese"]` and chosen dish "chicken tinga", the function returns a valid `Recipe` Pydantic object with 5β7 steps. | |
| **Verify:** the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds. | |
| --- | |
| ### Phase 3.5 β Day 3 (afternoon): Nutritional values | |
| **Goal:** the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM). | |
| **Approach:** small, embedded reference table beats LLM math. | |
| **Tasks** | |
| 1. Bundle `data/nutrition_table.csv` β a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo. | |
| 2. Write `src/data/nutrition.py`: | |
| - `parse_quantity(line: str) -> (grams, ingredient_name)` β handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35). | |
| - `compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutrition` β sum per-100g values weighted by grams, divide by servings. | |
| - If a line cannot be parsed, skip it and log; don't crash. | |
| 3. After the planner returns a recipe, **overwrite** `recipe.nutrition_per_serving` with the computed value. Keep the LLM's value only as a fallback when the parser yields zero. | |
| **Deliverable:** for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within Β±25% of online references. | |
| --- | |
| ### Phase 4 β Day 4: Step Illustrator β FLUX.2 Klein 9B | |
| **Goal:** generate an appetizing image for the final dish + one image per step. | |
| **Constraint:** FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths: | |
| - **Path A (recommended for the hackathon):** upgrade the Space to a GPU instance (T4 or A10G β paid, but $20 HF credits cover it for a week of development). Code stays unchanged. | |
| - **Path B (fallback):** run FLUX in `enable_model_cpu_offload()` mode with `num_inference_steps=4` and accept ~3 min/image β only feasible for pre-rendered demo recipes, not live runs. | |
| **Tasks** | |
| 1. Add to `requirements.txt`: | |
| ```text | |
| diffusers>=0.31 | |
| transformers>=4.45 | |
| accelerate | |
| torch | |
| safetensors | |
| ``` | |
| 2. Write `src/models/illustrator.py`: | |
| ```python | |
| import torch | |
| from diffusers import Flux2KleinPipeline | |
| _pipe = None | |
| def get_flux(): | |
| global _pipe | |
| if _pipe is None: | |
| dtype = torch.bfloat16 | |
| _pipe = Flux2KleinPipeline.from_pretrained( | |
| "black-forest-labs/FLUX.2-klein-9B", | |
| torch_dtype=dtype, | |
| ) | |
| _pipe.enable_model_cpu_offload() | |
| return _pipe | |
| def render(prompt: str, seed: int = 0) -> "PIL.Image.Image": | |
| pipe = get_flux() | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| img = pipe( | |
| prompt=prompt, | |
| height=1024, width=1024, | |
| guidance_scale=1.0, | |
| num_inference_steps=4, | |
| generator=torch.Generator(device=device).manual_seed(seed), | |
| ).images[0] | |
| return img | |
| ``` | |
| 3. Write `src/agents/step_illustrator.py`: | |
| - For each `Step.visual`, build a prompt like: | |
| > `f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing."` | |
| - Generate the **final dish image first**, then the per-step images, all in **one Python loop** (no parallelism β FLUX holds the GPU). | |
| - Cache results on disk keyed by `hash(prompt)` to avoid re-renders on re-runs. | |
| - Emit Gradio progress updates so the UI doesn't appear frozen. | |
| 4. **Critical tuning:** keep `num_inference_steps=4` (Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale. | |
| **Deliverable:** for a 5-step recipe, all 6 images (final + 5 steps) render in: | |
| - < 1 minute on T4 GPU Space | |
| - < 8 minutes on CPU offload (acceptable only for pre-cached demos) | |
| **Verify:** show the 6 images to an unprompted human; β₯4 should be described as "appetizing". | |
| --- | |
| ### Phase 5 β Day 5: Narrator β VoxCPM2 | |
| **Goal:** every step's instruction is rendered to an MP3 in a warm, clear English voice. | |
| **Tasks** | |
| 1. Confirm the exact VoxCPM2 repo name on HF (`openbmb/VoxCPM2` or similar). Read its README for the inference snippet β TTS APIs vary widely between models. | |
| 2. Add to `requirements.txt`: `soundfile`, `torchaudio`, `numpy`. If VoxCPM2 ships GGUF, use it via `llama-cpp-python` audio extension (if available); otherwise load via `transformers` directly. | |
| 3. Write `src/models/narrator.py`: | |
| ```python | |
| _tts = None | |
| def get_tts(): | |
| global _tts | |
| if _tts is None: | |
| # placeholder β replace with the exact VoxCPM2 loading code from its README | |
| from transformers import AutoModel, AutoProcessor | |
| _tts = ... # load on CPU; VoxCPM2 is small (~1B) | |
| return _tts | |
| def synthesize(text: str, voice: str = "warm_female_en") -> bytes: | |
| """Returns MP3 bytes.""" | |
| tts = get_tts() | |
| wav = tts.generate(text, voice=voice) # API depends on VoxCPM2 | |
| # encode wav -> mp3 with soundfile + ffmpeg-python or pydub | |
| return mp3_bytes | |
| ``` | |
| 4. Write `src/agents/narrator.py`: | |
| - For each step, synthesize `step.instruction`. If `step.tip` is set, synthesize a separate "tip" clip. | |
| - Save MP3 files in a per-recipe temp directory; return file paths to Gradio. | |
| 5. Pre-render all step audio when the recipe is finalized β never stream per-step in the demo (too much UI lag). | |
| **Deliverable:** clicking "Play" on step 1 in the UI plays clear English narration. | |
| **Verify:** on a 5-step recipe, total TTS rendering time < 30 seconds on CPU. | |
| --- | |
| ### Phase 6 β Day 6: Gradio UI (Off-Brand) | |
| **Goal:** the Space looks like a recipe magazine, not stock Gradio. | |
| **Tasks** | |
| 1. Write `src/ui/theme.py`: | |
| ```python | |
| import gradio as gr | |
| theme = gr.themes.Soft( | |
| primary_hue="orange", | |
| neutral_hue="stone", | |
| font=[gr.themes.GoogleFont("Inter"), "sans-serif"], | |
| font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"], | |
| ) | |
| CSS = """ | |
| .gradio-container { background: #f5ecd9 !important; } | |
| .recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; } | |
| .recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; } | |
| .step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; } | |
| .nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; } | |
| .nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; } | |
| """ | |
| ``` | |
| 2. Write `app.py` with three tabs: | |
| - **Tab 1 β Cook**: fridge photo input β ingredient chips β 3 dish options β selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom. | |
| - **Tab 2 β Check Progress**: upload a progress photo + select active step β validator returns badge (`go/wait/fix`) + tip + audio. | |
| - **Tab 3 β About / Tech**: README-style explanation, badges, model list. | |
| 3. Use `gr.Blocks` with `gr.State` to hold the current `Recipe` Pydantic object across UI events. Serialize to/from `dict` since Pydantic objects don't survive Gradio state by default β wrap in `state.value = recipe.model_dump()`. | |
| 4. Wire callbacks: | |
| - `btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])` | |
| - `dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])` | |
| - `progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])` | |
| **Deliverable:** end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere. | |
| --- | |
| ### Phase 7 β Day 7: Progress Validator (closed loop) | |
| **Goal:** user uploads a progress photo, app says "go / wait / fix" with a voiced tip. | |
| **Tasks** | |
| 1. Write `src/agents/progress_validator.py`: | |
| ```python | |
| PROMPT = """Compare these two cooking photos. | |
| Photo 1 (target): how it should look after the step "{instruction}". | |
| Photo 2 (user's pan/plate): the user's current progress. | |
| Reply strict JSON: {"verdict": "go|wait|fix", "feedback": "...", "tip": "..."} | |
| - "go": looks right, move to next step | |
| - "wait": needs more time, do not change anything yet | |
| - "fix": something is off; suggest a concrete adjustment in one sentence | |
| """ | |
| def validate(target_img, user_img, step_instruction): ... | |
| ``` | |
| 2. Use the same vision model singleton as Phase 2 β both calls share weights. | |
| 3. Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2. | |
| **Deliverable:** running the validator on 5 real progress photos returns the correct verdict on β₯3. | |
| --- | |
| ### Phase 8 β Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge) | |
| > **Important caveat:** The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still **requires GPU**, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does **not** violate the runtime constraint β only the build pipeline touches a GPU. | |
| **Goal:** publish a fine-tuned Planner GGUF to the Hub and load it from the Space. | |
| **Tasks** | |
| 1. **Build SFT dataset** (`scripts/build_sft_dataset.py`): | |
| - Load Kaggle `better-recipes` dataset. | |
| - For each recipe, build a `(prompt, completion)` pair where `prompt` is `"Available ingredients: X, Y, Z. Propose recipe."` and `completion` is the full canonical `Recipe` JSON. | |
| - Generate ~1000 pairs, push to `<you>/cook-with-me-sft` HF Dataset. | |
| 2. **LoRA training** (`scripts/train_planner.py` β to be run on a GPU machine, not the Space): | |
| ```python | |
| # peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4 | |
| # r=16, alpha=32, lr=2e-4, epochs=2, batch=4 | |
| # push_to_hub=True, hub_model_id="<you>/cook-with-me-planner-4b" | |
| ``` | |
| 3. **Convert to GGUF** (Day 8 evening): | |
| - Use `llama.cpp/convert_hf_to_gguf.py` then `quantize` to `Q4_K_M`. | |
| - Push GGUF to `<you>/cook-with-me-planner-4b-gguf`. | |
| 4. Update `src/models/loader.py` to point at your GGUF instead of the base model. | |
| **Deliverable:** the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set. | |
| --- | |
| ### Phase 9 β Day 9: End-to-end test, performance pass, pre-warm cache | |
| **Goal:** the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware. | |
| **Tasks** | |
| 1. Write `scripts/smoke_test.py` that runs the full pipeline on 3 sample fridge photos and asserts: | |
| - Each ingredient list is non-empty | |
| - Each recipe has 5β7 steps | |
| - Each step has a non-empty image and audio path | |
| - Nutrition has all 5 macros set | |
| 2. Implement **on-disk caching** for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to `~/.cache/cook-with-me/flux/`. | |
| 3. Pre-render and commit **3 fully-prepared demo recipes** (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click. | |
| 4. Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace. | |
| 5. Add a "Loading models..." progress bar on first request β first cold start can take 90s. | |
| **Deliverable:** smoke test passes on the live Space. | |
| --- | |
| ### Phase 10 β Day 10: README, demo video, social post, submit | |
| **Tasks** | |
| 1. Write `README.md` with the required HF Space frontmatter: | |
| ```yaml | |
| --- | |
| title: Cook With Me | |
| emoji: π² | |
| colorFrom: orange | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| --- | |
| ``` | |
| Followed by: | |
| - One-paragraph pitch | |
| - 60-second demo video embed | |
| - Architecture diagram (export from `arquitectura.html` as PNG) | |
| - Section: "How closed-loop visual cooking guidance works" | |
| - Models used (with HF links + total parameter count) | |
| - Badges declared | |
| - Build / run instructions | |
| 2. Record a 60β90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera. | |
| 3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 β here's why" or "Closed-loop validation needs the same vision model on both sides"). | |
| 4. Social post on X / LinkedIn with the demo video. | |
| 5. Submit on the hackathon platform. | |
| --- | |
| ## 4. Tools usage matrix (when to reach for what) | |
| | Phase | Primary tools | Why | | |
| |---|---|---| | |
| | 0 β setup | HF CLI, Kaggle CLI, OpenAI Codex CLI | one-shot config | | |
| | 1 β data | `kagglehub`, `pandas`, `sentence-transformers` | offline dataset prep | | |
| | 2 β vision | `llama-cpp-python` + `MiniCPMv26ChatHandler` | runs inside Space, badge: Llama Champion | | |
| | 3 β planner | `llama-cpp-python` + retrieval over local parquet | grounded JSON output | | |
| | 3.5 β nutrition | local CSV + regex parser | reliable, no LLM math | | |
| | 4 β illustrator | `diffusers` + `Flux2KleinPipeline` | sponsor model showcase | | |
| | 5 β narrator | VoxCPM2 via `transformers` (or its native API) | local TTS | | |
| | 6 β UI | `gradio` + custom CSS theme | Off-Brand badge | | |
| | 7 β validator | same vision singleton as phase 2 | closed-loop innovation, Best Agent | | |
| | 8 β fine-tune | `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine | Well-Tuned badge | | |
| | 9 β test/cache | `pytest`, `hashlib`, on-disk FLUX cache | demo reliability | | |
| | 10 β submit | HF Spaces, video tool, social | shipping | | |
| --- | |
| ## 5. Performance budget on the HF Space | |
| | Operation | Target latency | Hardware needed | | |
| |---|---|---| | |
| | Vision: ingredient ID | < 8 s | CPU 4-thread | | |
| | Planner: propose 3 dishes | < 12 s | CPU 4-thread | | |
| | Planner: build full recipe JSON | < 20 s | CPU 4-thread | | |
| | Nutrition computation | < 0.1 s | CPU | | |
| | FLUX: 1 image (4 steps) | < 12 s on T4 / < 90 s on CPU offload | GPU strongly recommended | | |
| | FLUX: 6 images (final + 5 steps) | < 80 s on T4 | GPU | | |
| | VoxCPM2: 1 step narration | < 5 s | CPU | | |
| | Validator: 1 progress check | < 8 s | CPU | | |
| | **Full recipe end-to-end** | **< 2 min on T4 Space** | β | | |
| **Hardware decision:** rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours. | |
| --- | |
| ## 6. Risks and mitigations (delta from `estrategia.md`) | |
| | Risk | Mitigation | | |
| |---|---| | |
| | MiniCPM-V-4 has no public GGUF | Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. | | |
| | llama-cpp-python's MiniCPM-V chat handler version mismatch | Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. | | |
| | FLUX.2 Klein 9B too slow on free CPU Space | Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. | | |
| | VoxCPM2 docs sparse | Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. | | |
| | Kaggle dataset has format quirks (HTML in instructions, missing fields) | The Phase 1 normalization step handles this; budget 2 hours. | | |
| | Nutrition CSV missing exotic ingredients | Skip-and-log strategy already designed; demo-day recipes use common ingredients only. | | |
| | Total params >32B if VoxCPM2 turns out to be 7B | Check size in Phase 0; if too large, drop to a smaller TTS. | | |
| --- | |
| ## 7. "Day-1 hello world" checklist | |
| Before writing any agent code, get this minimal end-to-end loop working β it proves your stack: | |
| 1. β Empty Gradio Space deployed, shows "Hello" | |
| 2. β `huggingface-cli login` works locally | |
| 3. β `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds | |
| 4. β `from llama_cpp import Llama` runs in your venv | |
| 5. β Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip | |
| 6. β Push the round-trip to the Space; confirm it answers in the cloud | |
| **Only after all 6 are checked, start Phase 1.** | |
| --- | |
| ## 8. Where this plan differs from `estrategia.md` (deltas to communicate) | |
| | Topic | `estrategia.md` (Spanish, Mexican-cuisine focus) | This document (current requirements) | | |
| |---|---|---| | |
| | Language | Spanish-first | **English only** | | |
| | Cuisine | Mexican | **International** (Kaggle dataset) | | |
| | Voice models | OpenBMB voice + Cohere Labs | **VoxCPM2** only (single voice) | | |
| | Vision model | MiniCPM-V 2.6 / 4 | **MiniCPM-V-4.6** | | |
| | Reasoning model | MiniCPM-4 4B | **MiniCPM-V-4** | | |
| | FLUX runtime | Modal endpoint | **Inside Space (llama.cpp principle)**; Modal kept as a future migration target only | | |
| | External APIs at runtime | Allowed (Modal, OpenAI optional) | **None** β full local inference inside Space | | |
| | Nutritional info | Not specified | **Required** at end of recipe | | |
| | Fine-tune dataset | 200 synthetic Mexican recipes | **Kaggle better-recipes (international)** | | |
| If anything in `plan.md` or `estrategia.md` conflicts with this document, **this document wins** β it reflects the latest user requirements. | |
| --- | |
| ## 9. Definition of done | |
| The implementation is complete when **all** of these are true: | |
| - [ ] Public HF Space `https://huggingface.co/spaces/<you>/cook-with-me` loads | |
| - [ ] App is fully in English | |
| - [ ] Fridge photo β ingredient list β 3 dish options β full recipe with images, audio, and nutrition works end-to-end | |
| - [ ] Progress validator returns sensible verdicts on 3+ test photos | |
| - [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers β **no external API calls at runtime** | |
| - [ ] Total parameters declared in README β€ 32B | |
| - [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge) | |
| - [ ] Demo video (60β90s) recorded with a real person cooking | |
| - [ ] Field Notes blog post published | |
| - [ ] Submitted on the hackathon platform before deadline | |