# Implementation Plan — "Cook With Me"

> Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces.
>
> **Hackathon:** Small models / Big adventures — June 2026
> **Read first:** `plan.md` (the *what* and *why*) and `estrategia.md` (the *how* at a strategic level). This document is the *how* at a tactical level — turn this into code.

---

## 0. Locked decisions (do not re-discuss)

| Decision | Value | Reason |
|---|---|---|
| UI framework | **Gradio** | Hackathon requirement |
| Hosting | **Hugging Face Space** | Hackathon requirement |
| Inference runtime (text + vision) | **llama.cpp** via `llama-cpp-python` | Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal |
| Image generation | **FLUX.2 Klein 9B** (`black-forest-labs/FLUX.2-klein-9B`) | Sponsor model; runs in the Space if a GPU Space is rented (or via `enable_model_cpu_offload()` as fallback). Plan to migrate this specific component to Modal post-hackathon |
| Recipe planner / reasoning | **`openbmb/MiniCPM-V-4`** (GGUF) | Provided requirement |
| Vision (ingredient ID + progress validator) | **`openbmb/MiniCPM-V-4.6`** (GGUF) | Provided requirement |
| Text-to-speech | **OpenBMB VoxCPM2** | Provided requirement |
| Recipe dataset | **`thedevastator/better-recipes-for-a-better-life`** (Kaggle) — international cuisine | Provided requirement; not limited to Mexican food |
| App language | **English only** | Provided requirement |
| Final output | **Recipe + step images + voice + nutritional values** | Provided requirement |
| External API calls at runtime | **None** | "llama.cpp inside the Space" mandate |

---

## 1. Architecture (final, English-only, llama.cpp-first)

```
                          ┌──────────────────────────────────────┐
                          │     Hugging Face Space (Gradio)      │
                          │   (CPU + optional GPU upgrade)       │
                          ├──────────────────────────────────────┤
   📸 Fridge photo  ─────▶│  [Vision Agent]                      │
                          │   MiniCPM-V-4.6 GGUF (llama.cpp)     │
                          │   → list[ingredient]                  │
                          │              │                        │
                          │              ▼                        │
   🥘 User picks dish ───▶│  [Recipe Planner]                    │
                          │   MiniCPM-V-4 GGUF (llama.cpp)       │
                          │   + retrieval over Kaggle dataset    │
                          │   → Recipe JSON (steps, nutrition)   │
                          │              │                        │
                          │              ▼                        │
                          │  [Step Illustrator]                   │
                          │   FLUX.2 Klein 9B (diffusers)        │
                          │   → PNG per step + final dish        │
                          │              │                        │
                          │              ▼                        │
                          │  [Narrator]                           │
                          │   VoxCPM2 → MP3 per step             │
                          │              │                        │
                          │              ▼                        │
   📸 Progress photo ────▶│  [Progress Validator]                │
                          │   MiniCPM-V-4.6 (vision compare)     │
                          │   → "go / wait / fix" + tip          │
                          └──────────────────────────────────────┘
```

**Total parameter count (≤ 32B requirement):**
- MiniCPM-V-4 (reasoning) ≈ 4B
- MiniCPM-V-4.6 (vision) ≈ 4.6B
- FLUX.2 Klein ≈ 9B
- VoxCPM2 ≈ 1B (estimate)
- **Total ≈ 18.6B ✓**

---

## 2. Repository layout

```
cook-with-me/
├── app.py                      # Gradio entrypoint (Space looks for this)
├── requirements.txt
├── packages.txt                # apt packages (ffmpeg, libsndfile1)
├── README.md                   # Space card (HF requires YAML frontmatter)
├── .gitignore
├── src/
│   ├── __init__.py
│   ├── config.py               # paths, model IDs, constants
│   ├── models/
│   │   ├── __init__.py
│   │   ├── vision.py           # MiniCPM-V-4.6 wrapper (llama-cpp)
│   │   ├── planner.py          # MiniCPM-V-4 wrapper (llama-cpp)
│   │   ├── illustrator.py      # FLUX.2 Klein wrapper (diffusers)
│   │   ├── narrator.py         # VoxCPM2 wrapper
│   │   └── loader.py           # lazy singletons + GGUF download
│   ├── agents/
│   │   ├── mise_en_place.py    # ingredient identification
│   │   ├── recipe_planner.py   # builds Recipe object
│   │   ├── step_illustrator.py # per-step image gen
│   │   ├── narrator.py         # per-step TTS
│   │   └── progress_validator.py
│   ├── data/
│   │   ├── recipe_index.py     # loads Kaggle dataset, builds retrieval
│   │   └── nutrition.py        # USDA-style nutrition computation
│   ├── pipeline.py             # Recipe state machine, orchestration
│   ├── prompts/
│   │   ├── vision_prompt.txt
│   │   ├── planner_system.txt
│   │   └── validator_prompt.txt
│   └── ui/
│       ├── theme.py            # custom CSS (Off-Brand badge)
│       └── components.py       # reusable Gradio Blocks pieces
├── scripts/
│   ├── download_models.py      # pre-warms GGUF + Flux weights at build time
│   ├── build_recipe_index.py   # caches Kaggle dataset locally
│   └── smoke_test.py           # end-to-end validation before push
└── assets/
    ├── sample_fridge_1.jpg
    └── sample_progress_1.jpg
```

---

## 3. Phase-by-phase plan (10 days)

> Each phase has: **goal**, **tasks**, **deliverable**, **verification check**. Do not move to the next phase if verification fails.

---

### Phase 0 — Day 0 (½ day): Account + tooling setup

**Goal:** every credential and CLI is ready before writing code.

**Tasks**
1. Create or confirm Hugging Face account; generate a **write token** (Settings → Access Tokens). Store as `HF_TOKEN` env var locally.
2. Install Hugging Face CLI: `pip install -U huggingface_hub` then `huggingface-cli login`.
3. Install Kaggle CLI: `pip install kaggle`. Place `kaggle.json` (Account → API → Create New Token) in `~/.kaggle/kaggle.json` with `chmod 600`.
4. Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active.
5. Install local Python 3.11 venv: `python -m venv .venv && source .venv/bin/activate`.
6. Create the repo locally: `git init cook-with-me && cd cook-with-me`.
7. Create an empty Hugging Face Space: huggingface.co → New Space → SDK = **Gradio**, Hardware = **CPU basic** (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it.
8. Verify model availability: open in a browser and confirm pages exist:
   - `huggingface.co/openbmb/MiniCPM-V-4`
   - `huggingface.co/openbmb/MiniCPM-V-4-6`
   - `huggingface.co/openbmb/VoxCPM2` (or whatever the exact repo name is — search "VoxCPM" on HF)
   - `huggingface.co/black-forest-labs/FLUX.2-klein-9B`

**Deliverable:** empty Space deployed showing "Hello World" Gradio.

**Verify:** `https://huggingface.co/spaces/<you>/cook-with-me` loads.

---

### Phase 1 — Day 1: Project skeleton + recipe dataset ingestion

**Goal:** the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval.

**Tasks**
1. Write `requirements.txt` (initial version — packages will be added as phases progress):
   ```text
   gradio>=4.44
   huggingface_hub>=0.24
   llama-cpp-python>=0.3.2
   numpy
   pandas
   Pillow
   pydantic>=2
   sentence-transformers
   ```
2. Write `packages.txt`:
   ```text
   ffmpeg
   libsndfile1
   ```
3. Write `scripts/build_recipe_index.py`:
   - Use `kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path)` — discover `file_path` by listing the dataset files first via `kagglehub.dataset_download`.
   - Normalize columns: `name`, `ingredients` (list[str]), `instructions` (list[str]), `cuisine` (str if present, else "international"), `prep_time`, `servings`.
   - Drop rows missing critical fields. Lowercase + strip ingredient strings.
   - Save to `data/recipes.parquet` (~5–50MB depending on dataset size).
   - Build sentence embeddings of the recipe **name + first 3 ingredients** using `sentence-transformers/all-MiniLM-L6-v2` and save to `data/recipes_emb.npy`.
   - This script runs **once locally**; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in `app.py`). If files exceed 100MB, push to a HF Dataset repo: `<you>/cook-with-me-recipes`.
4. Write `src/data/recipe_index.py`:
   - `class RecipeIndex` with `.search(ingredients: list[str], top_k=5) -> list[RecipeRow]`.
   - Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k.

**Deliverable:** `python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))"` prints 5 sensible recipes.

**Verify:** at least 3 of the top-5 results contain ≥2 of the input ingredients.

---

### Phase 2 — Day 2: Vision agent (Mise en Place) — MiniCPM-V-4.6 via llama.cpp

**Goal:** given a fridge photo, return a clean list of English ingredient names.

**Background:** llama.cpp supports multimodal models through a vision projector (`mmproj-*.gguf`) plus the language model GGUF. MiniCPM-V family ships both files on the Hub.

**Tasks**
1. Find the GGUF release of MiniCPM-V-4.6. Search HF for `MiniCPM-V-4_6-gguf` or `openbmb/MiniCPM-V-4_6-gguf`. You need **two** files:
   - `Model-Q4_K_M.gguf` (or similar quant)
   - `mmproj-model-f16.gguf` (the vision projector)
2. Write `src/models/loader.py`:
   ```python
   from huggingface_hub import hf_hub_download
   from llama_cpp import Llama
   from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler  # or matching handler

   _vision = None

   def get_vision_model():
       global _vision
       if _vision is None:
           model_path = hf_hub_download(
               repo_id="openbmb/MiniCPM-V-4_6-gguf",  # confirm exact repo
               filename="Model-Q4_K_M.gguf",
           )
           mmproj_path = hf_hub_download(
               repo_id="openbmb/MiniCPM-V-4_6-gguf",
               filename="mmproj-model-f16.gguf",
           )
           handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path)
           _vision = Llama(
               model_path=model_path,
               chat_handler=handler,
               n_ctx=4096,
               n_threads=4,
               verbose=False,
           )
       return _vision
   ```
3. Write `src/agents/mise_en_place.py`:
   ```python
   import base64, io, json
   from PIL import Image
   from src.models.loader import get_vision_model

   PROMPT = (
     "You are an ingredient detector. Look at the fridge/pantry photo and "
     "list every edible ingredient you can identify. Return strict JSON: "
     '{"ingredients": ["chicken", "onion", "tomato", ...]} '
     "Lowercase, English, no brand names, no containers."
   )

   def _img_to_data_url(img: Image.Image) -> str:
       buf = io.BytesIO(); img.save(buf, "JPEG", quality=85)
       b64 = base64.b64encode(buf.getvalue()).decode()
       return f"data:image/jpeg;base64,{b64}"

   def identify_ingredients(image: Image.Image) -> list[str]:
       llm = get_vision_model()
       out = llm.create_chat_completion(messages=[
           {"role": "user", "content": [
               {"type": "image_url", "image_url": {"url": _img_to_data_url(image)}},
               {"type": "text", "text": PROMPT},
           ]}
       ], temperature=0.2, response_format={"type": "json_object"})
       data = json.loads(out["choices"][0]["message"]["content"])
       return [s.lower().strip() for s in data["ingredients"]]
   ```
4. Test locally with 5 sample fridge photos.

**Deliverable:** the function returns a non-empty English list with ≥80% precision on a clean fridge photo.

**Verify:** stash these 5 results in `tests/vision_smoke.json` for regression checks.

---

### Phase 3 — Day 3: Recipe Planner — MiniCPM-V-4 via llama.cpp + retrieval

**Goal:** given a list of ingredients (and optionally a chosen dish), return a fully structured `Recipe` JSON including steps, durations, visual descriptions, and nutritional values.

**Tasks**
1. Find or convert MiniCPM-V-4 to GGUF. Likely repo: `openbmb/MiniCPM-V-4-gguf` or community quants. Pick `Q4_K_M`.
2. Add to `src/models/loader.py` a `get_planner_model()` (same pattern as vision but without `chat_handler`).
3. Write `src/agents/recipe_planner.py`:
   - **Step A — propose:** call planner with `Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON.`
   - **Step B — retrieve:** for the chosen dish name, call `RecipeIndex.search(...)` and pick the closest match. Use it as a *grounded reference*.
   - **Step C — restructure:** prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical `Recipe` JSON schema below. The retrieval grounds the model and prevents hallucinated steps.
   - **Step D — nutrition:** from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5.
4. Define the canonical schema in `src/pipeline.py` using Pydantic:
   ```python
   from pydantic import BaseModel
   from typing import Optional

   class Step(BaseModel):
       n: int
       instruction: str       # English, imperative
       duration: str          # "4 minutes"
       visual: str            # English visual description for FLUX prompt
       tip: Optional[str] = None

   class Nutrition(BaseModel):
       calories: int          # per serving
       protein_g: float
       carbs_g: float
       fat_g: float
       fiber_g: float

   class Recipe(BaseModel):
       name: str
       cuisine: str
       servings: int
       total_time_minutes: int
       options: list[dict]    # only populated on "propose" call
       ingredients_have: list[str]
       ingredients_missing: list[str]
       substitutes: dict[str, list[str]]
       steps: list[Step]
       final_dish_visual: str
       nutrition_per_serving: Nutrition
   ```
5. Write the system prompt (`src/prompts/planner_system.txt`):
   - Persona: international chef
   - Hard rule: output JSON only, matching schema
   - Hard rule: prefer dishes feasible with available ingredients
   - Hard rule: 5–7 steps, each ≤ 25 words, each with a concrete `visual` field for image generation
   - Hard rule: include `nutrition_per_serving` (model is allowed to estimate; you'll override with `data/nutrition.py` for accuracy)
6. Use `response_format={"type": "json_object"}` in the chat completion call. Set `temperature=0.7, top_p=0.95, enable_thinking=True` for the propose step (creative); `temperature=0.4` for the structured-output step (deterministic).

**Deliverable:** for `["chicken","onion","tomato","tortilla","cheese"]` and chosen dish "chicken tinga", the function returns a valid `Recipe` Pydantic object with 5–7 steps.

**Verify:** the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds.

---

### Phase 3.5 — Day 3 (afternoon): Nutritional values

**Goal:** the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM).

**Approach:** small, embedded reference table beats LLM math.

**Tasks**
1. Bundle `data/nutrition_table.csv` — a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo.
2. Write `src/data/nutrition.py`:
   - `parse_quantity(line: str) -> (grams, ingredient_name)` — handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35).
   - `compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutrition` — sum per-100g values weighted by grams, divide by servings.
   - If a line cannot be parsed, skip it and log; don't crash.
3. After the planner returns a recipe, **overwrite** `recipe.nutrition_per_serving` with the computed value. Keep the LLM's value only as a fallback when the parser yields zero.

**Deliverable:** for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within ±25% of online references.

---

### Phase 4 — Day 4: Step Illustrator — FLUX.2 Klein 9B

**Goal:** generate an appetizing image for the final dish + one image per step.

**Constraint:** FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths:
- **Path A (recommended for the hackathon):** upgrade the Space to a GPU instance (T4 or A10G — paid, but $20 HF credits cover it for a week of development). Code stays unchanged.
- **Path B (fallback):** run FLUX in `enable_model_cpu_offload()` mode with `num_inference_steps=4` and accept ~3 min/image — only feasible for pre-rendered demo recipes, not live runs.

**Tasks**
1. Add to `requirements.txt`:
   ```text
   diffusers>=0.31
   transformers>=4.45
   accelerate
   torch
   safetensors
   ```
2. Write `src/models/illustrator.py`:
   ```python
   import torch
   from diffusers import Flux2KleinPipeline

   _pipe = None

   def get_flux():
       global _pipe
       if _pipe is None:
           dtype = torch.bfloat16
           _pipe = Flux2KleinPipeline.from_pretrained(
               "black-forest-labs/FLUX.2-klein-9B",
               torch_dtype=dtype,
           )
           _pipe.enable_model_cpu_offload()
       return _pipe

   def render(prompt: str, seed: int = 0) -> "PIL.Image.Image":
       pipe = get_flux()
       device = "cuda" if torch.cuda.is_available() else "cpu"
       img = pipe(
           prompt=prompt,
           height=1024, width=1024,
           guidance_scale=1.0,
           num_inference_steps=4,
           generator=torch.Generator(device=device).manual_seed(seed),
       ).images[0]
       return img
   ```
3. Write `src/agents/step_illustrator.py`:
   - For each `Step.visual`, build a prompt like:
     > `f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing."`
   - Generate the **final dish image first**, then the per-step images, all in **one Python loop** (no parallelism — FLUX holds the GPU).
   - Cache results on disk keyed by `hash(prompt)` to avoid re-renders on re-runs.
   - Emit Gradio progress updates so the UI doesn't appear frozen.
4. **Critical tuning:** keep `num_inference_steps=4` (Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale.

**Deliverable:** for a 5-step recipe, all 6 images (final + 5 steps) render in:
- < 1 minute on T4 GPU Space
- < 8 minutes on CPU offload (acceptable only for pre-cached demos)

**Verify:** show the 6 images to an unprompted human; ≥4 should be described as "appetizing".

---

### Phase 5 — Day 5: Narrator — VoxCPM2

**Goal:** every step's instruction is rendered to an MP3 in a warm, clear English voice.

**Tasks**
1. Confirm the exact VoxCPM2 repo name on HF (`openbmb/VoxCPM2` or similar). Read its README for the inference snippet — TTS APIs vary widely between models.
2. Add to `requirements.txt`: `soundfile`, `torchaudio`, `numpy`. If VoxCPM2 ships GGUF, use it via `llama-cpp-python` audio extension (if available); otherwise load via `transformers` directly.
3. Write `src/models/narrator.py`:
   ```python
   _tts = None

   def get_tts():
       global _tts
       if _tts is None:
           # placeholder — replace with the exact VoxCPM2 loading code from its README
           from transformers import AutoModel, AutoProcessor
           _tts = ... # load on CPU; VoxCPM2 is small (~1B)
       return _tts

   def synthesize(text: str, voice: str = "warm_female_en") -> bytes:
       """Returns MP3 bytes."""
       tts = get_tts()
       wav = tts.generate(text, voice=voice)  # API depends on VoxCPM2
       # encode wav -> mp3 with soundfile + ffmpeg-python or pydub
       return mp3_bytes
   ```
4. Write `src/agents/narrator.py`:
   - For each step, synthesize `step.instruction`. If `step.tip` is set, synthesize a separate "tip" clip.
   - Save MP3 files in a per-recipe temp directory; return file paths to Gradio.
5. Pre-render all step audio when the recipe is finalized — never stream per-step in the demo (too much UI lag).

**Deliverable:** clicking "Play" on step 1 in the UI plays clear English narration.

**Verify:** on a 5-step recipe, total TTS rendering time < 30 seconds on CPU.

---

### Phase 6 — Day 6: Gradio UI (Off-Brand)

**Goal:** the Space looks like a recipe magazine, not stock Gradio.

**Tasks**
1. Write `src/ui/theme.py`:
   ```python
   import gradio as gr

   theme = gr.themes.Soft(
       primary_hue="orange",
       neutral_hue="stone",
       font=[gr.themes.GoogleFont("Inter"), "sans-serif"],
       font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"],
   )

   CSS = """
   .gradio-container { background: #f5ecd9 !important; }
   .recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; }
   .recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; }
   .step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; }
   .nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; }
   .nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; }
   """
   ```
2. Write `app.py` with three tabs:
   - **Tab 1 — Cook**: fridge photo input → ingredient chips → 3 dish options → selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom.
   - **Tab 2 — Check Progress**: upload a progress photo + select active step → validator returns badge (`go/wait/fix`) + tip + audio.
   - **Tab 3 — About / Tech**: README-style explanation, badges, model list.
3. Use `gr.Blocks` with `gr.State` to hold the current `Recipe` Pydantic object across UI events. Serialize to/from `dict` since Pydantic objects don't survive Gradio state by default — wrap in `state.value = recipe.model_dump()`.
4. Wire callbacks:
   - `btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])`
   - `dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])`
   - `progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])`

**Deliverable:** end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere.

---

### Phase 7 — Day 7: Progress Validator (closed loop)

**Goal:** user uploads a progress photo, app says "go / wait / fix" with a voiced tip.

**Tasks**
1. Write `src/agents/progress_validator.py`:
   ```python
   PROMPT = """Compare these two cooking photos.
   Photo 1 (target): how it should look after the step "{instruction}".
   Photo 2 (user's pan/plate): the user's current progress.
   Reply strict JSON: {"verdict": "go|wait|fix", "feedback": "...", "tip": "..."}
   - "go": looks right, move to next step
   - "wait": needs more time, do not change anything yet
   - "fix": something is off; suggest a concrete adjustment in one sentence
   """
   def validate(target_img, user_img, step_instruction): ...
   ```
2. Use the same vision model singleton as Phase 2 — both calls share weights.
3. Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2.

**Deliverable:** running the validator on 5 real progress photos returns the correct verdict on ≥3.

---

### Phase 8 — Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge)

> **Important caveat:** The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still **requires GPU**, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does **not** violate the runtime constraint — only the build pipeline touches a GPU.

**Goal:** publish a fine-tuned Planner GGUF to the Hub and load it from the Space.

**Tasks**
1. **Build SFT dataset** (`scripts/build_sft_dataset.py`):
   - Load Kaggle `better-recipes` dataset.
   - For each recipe, build a `(prompt, completion)` pair where `prompt` is `"Available ingredients: X, Y, Z. Propose recipe."` and `completion` is the full canonical `Recipe` JSON.
   - Generate ~1000 pairs, push to `<you>/cook-with-me-sft` HF Dataset.
2. **LoRA training** (`scripts/train_planner.py` — to be run on a GPU machine, not the Space):
   ```python
   # peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4
   # r=16, alpha=32, lr=2e-4, epochs=2, batch=4
   # push_to_hub=True, hub_model_id="<you>/cook-with-me-planner-4b"
   ```
3. **Convert to GGUF** (Day 8 evening):
   - Use `llama.cpp/convert_hf_to_gguf.py` then `quantize` to `Q4_K_M`.
   - Push GGUF to `<you>/cook-with-me-planner-4b-gguf`.
4. Update `src/models/loader.py` to point at your GGUF instead of the base model.

**Deliverable:** the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set.

---

### Phase 9 — Day 9: End-to-end test, performance pass, pre-warm cache

**Goal:** the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware.

**Tasks**
1. Write `scripts/smoke_test.py` that runs the full pipeline on 3 sample fridge photos and asserts:
   - Each ingredient list is non-empty
   - Each recipe has 5–7 steps
   - Each step has a non-empty image and audio path
   - Nutrition has all 5 macros set
2. Implement **on-disk caching** for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to `~/.cache/cook-with-me/flux/`.
3. Pre-render and commit **3 fully-prepared demo recipes** (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click.
4. Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace.
5. Add a "Loading models..." progress bar on first request — first cold start can take 90s.

**Deliverable:** smoke test passes on the live Space.

---

### Phase 10 — Day 10: README, demo video, social post, submit

**Tasks**
1. Write `README.md` with the required HF Space frontmatter:
   ```yaml
   ---
   title: Cook With Me
   emoji: 🍲
   colorFrom: orange
   colorTo: yellow
   sdk: gradio
   sdk_version: 4.44.0
   app_file: app.py
   pinned: false
   license: apache-2.0
   ---
   ```
   Followed by:
   - One-paragraph pitch
   - 60-second demo video embed
   - Architecture diagram (export from `arquitectura.html` as PNG)
   - Section: "How closed-loop visual cooking guidance works"
   - Models used (with HF links + total parameter count)
   - Badges declared
   - Build / run instructions
2. Record a 60–90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera.
3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 — here's why" or "Closed-loop validation needs the same vision model on both sides").
4. Social post on X / LinkedIn with the demo video.
5. Submit on the hackathon platform.

---

## 4. Tools usage matrix (when to reach for what)

| Phase | Primary tools | Why |
|---|---|---|
| 0 — setup | HF CLI, Kaggle CLI, OpenAI Codex CLI | one-shot config |
| 1 — data | `kagglehub`, `pandas`, `sentence-transformers` | offline dataset prep |
| 2 — vision | `llama-cpp-python` + `MiniCPMv26ChatHandler` | runs inside Space, badge: Llama Champion |
| 3 — planner | `llama-cpp-python` + retrieval over local parquet | grounded JSON output |
| 3.5 — nutrition | local CSV + regex parser | reliable, no LLM math |
| 4 — illustrator | `diffusers` + `Flux2KleinPipeline` | sponsor model showcase |
| 5 — narrator | VoxCPM2 via `transformers` (or its native API) | local TTS |
| 6 — UI | `gradio` + custom CSS theme | Off-Brand badge |
| 7 — validator | same vision singleton as phase 2 | closed-loop innovation, Best Agent |
| 8 — fine-tune | `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine | Well-Tuned badge |
| 9 — test/cache | `pytest`, `hashlib`, on-disk FLUX cache | demo reliability |
| 10 — submit | HF Spaces, video tool, social | shipping |

---

## 5. Performance budget on the HF Space

| Operation | Target latency | Hardware needed |
|---|---|---|
| Vision: ingredient ID | < 8 s | CPU 4-thread |
| Planner: propose 3 dishes | < 12 s | CPU 4-thread |
| Planner: build full recipe JSON | < 20 s | CPU 4-thread |
| Nutrition computation | < 0.1 s | CPU |
| FLUX: 1 image (4 steps) | < 12 s on T4 / < 90 s on CPU offload | GPU strongly recommended |
| FLUX: 6 images (final + 5 steps) | < 80 s on T4 | GPU |
| VoxCPM2: 1 step narration | < 5 s | CPU |
| Validator: 1 progress check | < 8 s | CPU |
| **Full recipe end-to-end** | **< 2 min on T4 Space** | — |

**Hardware decision:** rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours.

---

## 6. Risks and mitigations (delta from `estrategia.md`)

| Risk | Mitigation |
|---|---|
| MiniCPM-V-4 has no public GGUF | Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. |
| llama-cpp-python's MiniCPM-V chat handler version mismatch | Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. |
| FLUX.2 Klein 9B too slow on free CPU Space | Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. |
| VoxCPM2 docs sparse | Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. |
| Kaggle dataset has format quirks (HTML in instructions, missing fields) | The Phase 1 normalization step handles this; budget 2 hours. |
| Nutrition CSV missing exotic ingredients | Skip-and-log strategy already designed; demo-day recipes use common ingredients only. |
| Total params >32B if VoxCPM2 turns out to be 7B | Check size in Phase 0; if too large, drop to a smaller TTS. |

---

## 7. "Day-1 hello world" checklist

Before writing any agent code, get this minimal end-to-end loop working — it proves your stack:

1. ☐ Empty Gradio Space deployed, shows "Hello"
2. ☐ `huggingface-cli login` works locally
3. ☐ `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds
4. ☐ `from llama_cpp import Llama` runs in your venv
5. ☐ Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip
6. ☐ Push the round-trip to the Space; confirm it answers in the cloud

**Only after all 6 are checked, start Phase 1.**

---

## 8. Where this plan differs from `estrategia.md` (deltas to communicate)

| Topic | `estrategia.md` (Spanish, Mexican-cuisine focus) | This document (current requirements) |
|---|---|---|
| Language | Spanish-first | **English only** |
| Cuisine | Mexican | **International** (Kaggle dataset) |
| Voice models | OpenBMB voice + Cohere Labs | **VoxCPM2** only (single voice) |
| Vision model | MiniCPM-V 2.6 / 4 | **MiniCPM-V-4.6** |
| Reasoning model | MiniCPM-4 4B | **MiniCPM-V-4** |
| FLUX runtime | Modal endpoint | **Inside Space (llama.cpp principle)**; Modal kept as a future migration target only |
| External APIs at runtime | Allowed (Modal, OpenAI optional) | **None** — full local inference inside Space |
| Nutritional info | Not specified | **Required** at end of recipe |
| Fine-tune dataset | 200 synthetic Mexican recipes | **Kaggle better-recipes (international)** |

If anything in `plan.md` or `estrategia.md` conflicts with this document, **this document wins** — it reflects the latest user requirements.

---

## 9. Definition of done

The implementation is complete when **all** of these are true:

- [ ] Public HF Space `https://huggingface.co/spaces/<you>/cook-with-me` loads
- [ ] App is fully in English
- [ ] Fridge photo → ingredient list → 3 dish options → full recipe with images, audio, and nutrition works end-to-end
- [ ] Progress validator returns sensible verdicts on 3+ test photos
- [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers — **no external API calls at runtime**
- [ ] Total parameters declared in README ≤ 32B
- [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge)
- [ ] Demo video (60–90s) recorded with a real person cooking
- [ ] Field Notes blog post published
- [ ] Submitted on the hackathon platform before deadline