Spaces:

build-small-hackathon
/

Cook_with_a_LLM

Running on Zero

App Files Files Community

Cook_with_a_LLM / Strategy /plan_implementacion.md

FredinVázquez

add strategy plan

bad5d84 4 days ago

preview code

raw

history blame contribute delete

34.1 kB

	# Implementation Plan — "Cook With Me"

	> Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces.
	>
	> Hackathon: Small models / Big adventures — June 2026
	> Read first: `plan.md` (the what and why) and `estrategia.md` (the how at a strategic level). This document is the how at a tactical level — turn this into code.

	---

	## 0. Locked decisions (do not re-discuss)

	\| Decision \| Value \| Reason \|
	\|---\|---\|---\|
	\| UI framework \| Gradio \| Hackathon requirement \|
	\| Hosting \| Hugging Face Space \| Hackathon requirement \|
	\| Inference runtime (text + vision) \| llama.cpp via `llama-cpp-python` \| Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal \|
	\| Image generation \| FLUX.2 Klein 9B (`black-forest-labs/FLUX.2-klein-9B`) \| Sponsor model; runs in the Space if a GPU Space is rented (or via `enable_model_cpu_offload()` as fallback). Plan to migrate this specific component to Modal post-hackathon \|
	\| Recipe planner / reasoning \| `openbmb/MiniCPM-V-4` (GGUF) \| Provided requirement \|
	\| Vision (ingredient ID + progress validator) \| `openbmb/MiniCPM-V-4.6` (GGUF) \| Provided requirement \|
	\| Text-to-speech \| OpenBMB VoxCPM2 \| Provided requirement \|
	\| Recipe dataset \| `thedevastator/better-recipes-for-a-better-life` (Kaggle) — international cuisine \| Provided requirement; not limited to Mexican food \|
	\| App language \| English only \| Provided requirement \|
	\| Final output \| Recipe + step images + voice + nutritional values \| Provided requirement \|
	\| External API calls at runtime \| None \| "llama.cpp inside the Space" mandate \|

	---

	## 1. Architecture (final, English-only, llama.cpp-first)

	```
	┌──────────────────────────────────────┐
	│ Hugging Face Space (Gradio) │
	│ (CPU + optional GPU upgrade) │
	├──────────────────────────────────────┤
	📸 Fridge photo ─────▶│ [Vision Agent] │
	│ MiniCPM-V-4.6 GGUF (llama.cpp) │
	│ → list[ingredient] │
	│ │ │
	│ ▼ │
	🥘 User picks dish ───▶│ [Recipe Planner] │
	│ MiniCPM-V-4 GGUF (llama.cpp) │
	│ + retrieval over Kaggle dataset │
	│ → Recipe JSON (steps, nutrition) │
	│ │ │
	│ ▼ │
	│ [Step Illustrator] │
	│ FLUX.2 Klein 9B (diffusers) │
	│ → PNG per step + final dish │
	│ │ │
	│ ▼ │
	│ [Narrator] │
	│ VoxCPM2 → MP3 per step │
	│ │ │
	│ ▼ │
	📸 Progress photo ────▶│ [Progress Validator] │
	│ MiniCPM-V-4.6 (vision compare) │
	│ → "go / wait / fix" + tip │
	└──────────────────────────────────────┘
	```

	Total parameter count (≤ 32B requirement):
	- MiniCPM-V-4 (reasoning) ≈ 4B
	- MiniCPM-V-4.6 (vision) ≈ 4.6B
	- FLUX.2 Klein ≈ 9B
	- VoxCPM2 ≈ 1B (estimate)
	- Total ≈ 18.6B ✓

	---

	## 2. Repository layout

	```
	cook-with-me/
	├── app.py # Gradio entrypoint (Space looks for this)
	├── requirements.txt
	├── packages.txt # apt packages (ffmpeg, libsndfile1)
	├── README.md # Space card (HF requires YAML frontmatter)
	├── .gitignore
	├── src/
	│ ├── __init__.py
	│ ├── config.py # paths, model IDs, constants
	│ ├── models/
	│ │ ├── __init__.py
	│ │ ├── vision.py # MiniCPM-V-4.6 wrapper (llama-cpp)
	│ │ ├── planner.py # MiniCPM-V-4 wrapper (llama-cpp)
	│ │ ├── illustrator.py # FLUX.2 Klein wrapper (diffusers)
	│ │ ├── narrator.py # VoxCPM2 wrapper
	│ │ └── loader.py # lazy singletons + GGUF download
	│ ├── agents/
	│ │ ├── mise_en_place.py # ingredient identification
	│ │ ├── recipe_planner.py # builds Recipe object
	│ │ ├── step_illustrator.py # per-step image gen
	│ │ ├── narrator.py # per-step TTS
	│ │ └── progress_validator.py
	│ ├── data/
	│ │ ├── recipe_index.py # loads Kaggle dataset, builds retrieval
	│ │ └── nutrition.py # USDA-style nutrition computation
	│ ├── pipeline.py # Recipe state machine, orchestration
	│ ├── prompts/
	│ │ ├── vision_prompt.txt
	│ │ ├── planner_system.txt
	│ │ └── validator_prompt.txt
	│ └── ui/
	│ ├── theme.py # custom CSS (Off-Brand badge)
	│ └── components.py # reusable Gradio Blocks pieces
	├── scripts/
	│ ├── download_models.py # pre-warms GGUF + Flux weights at build time
	│ ├── build_recipe_index.py # caches Kaggle dataset locally
	│ └── smoke_test.py # end-to-end validation before push
	└── assets/
	├── sample_fridge_1.jpg
	└── sample_progress_1.jpg
	```

	---

	## 3. Phase-by-phase plan (10 days)

	> Each phase has: goal, tasks, deliverable, verification check. Do not move to the next phase if verification fails.

	---

	### Phase 0 — Day 0 (½ day): Account + tooling setup

	Goal: every credential and CLI is ready before writing code.

	Tasks
	1. Create or confirm Hugging Face account; generate a write token (Settings → Access Tokens). Store as `HF_TOKEN` env var locally.
	2. Install Hugging Face CLI: `pip install -U huggingface_hub` then `huggingface-cli login`.
	3. Install Kaggle CLI: `pip install kaggle`. Place `kaggle.json` (Account → API → Create New Token) in `~/.kaggle/kaggle.json` with `chmod 600`.
	4. Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active.
	5. Install local Python 3.11 venv: `python -m venv .venv && source .venv/bin/activate`.
	6. Create the repo locally: `git init cook-with-me && cd cook-with-me`.
	7. Create an empty Hugging Face Space: huggingface.co → New Space → SDK = Gradio, Hardware = CPU basic (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it.
	8. Verify model availability: open in a browser and confirm pages exist:
	- `huggingface.co/openbmb/MiniCPM-V-4`
	- `huggingface.co/openbmb/MiniCPM-V-4-6`
	- `huggingface.co/openbmb/VoxCPM2` (or whatever the exact repo name is — search "VoxCPM" on HF)
	- `huggingface.co/black-forest-labs/FLUX.2-klein-9B`

	Deliverable: empty Space deployed showing "Hello World" Gradio.

	Verify: `https://huggingface.co/spaces/<you>/cook-with-me` loads.

	---

	### Phase 1 — Day 1: Project skeleton + recipe dataset ingestion

	Goal: the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval.

	Tasks
	1. Write `requirements.txt` (initial version — packages will be added as phases progress):
	```text
	gradio>=4.44
	huggingface_hub>=0.24
	llama-cpp-python>=0.3.2
	numpy
	pandas
	Pillow
	pydantic>=2
	sentence-transformers
	```
	2. Write `packages.txt`:
	```text
	ffmpeg
	libsndfile1
	```
	3. Write `scripts/build_recipe_index.py`:
	- Use `kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path)` — discover `file_path` by listing the dataset files first via `kagglehub.dataset_download`.
	- Normalize columns: `name`, `ingredients` (list[str]), `instructions` (list[str]), `cuisine` (str if present, else "international"), `prep_time`, `servings`.
	- Drop rows missing critical fields. Lowercase + strip ingredient strings.
	- Save to `data/recipes.parquet` (~5–50MB depending on dataset size).
	- Build sentence embeddings of the recipe name + first 3 ingredients using `sentence-transformers/all-MiniLM-L6-v2` and save to `data/recipes_emb.npy`.
	- This script runs once locally; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in `app.py`). If files exceed 100MB, push to a HF Dataset repo: `<you>/cook-with-me-recipes`.
	4. Write `src/data/recipe_index.py`:
	- `class RecipeIndex` with `.search(ingredients: list[str], top_k=5) -> list[RecipeRow]`.
	- Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k.

	Deliverable: `python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))"` prints 5 sensible recipes.

	Verify: at least 3 of the top-5 results contain ≥2 of the input ingredients.

	---

	### Phase 2 — Day 2: Vision agent (Mise en Place) — MiniCPM-V-4.6 via llama.cpp

	Goal: given a fridge photo, return a clean list of English ingredient names.

	Background: llama.cpp supports multimodal models through a vision projector (`mmproj-*.gguf`) plus the language model GGUF. MiniCPM-V family ships both files on the Hub.

	Tasks
	1. Find the GGUF release of MiniCPM-V-4.6. Search HF for `MiniCPM-V-4_6-gguf` or `openbmb/MiniCPM-V-4_6-gguf`. You need two files:
	- `Model-Q4_K_M.gguf` (or similar quant)
	- `mmproj-model-f16.gguf` (the vision projector)
	2. Write `src/models/loader.py`:
	```python
	from huggingface_hub import hf_hub_download
	from llama_cpp import Llama
	from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler # or matching handler

	_vision = None

	def get_vision_model():
	global _vision
	if _vision is None:
	model_path = hf_hub_download(
	repo_id="openbmb/MiniCPM-V-4_6-gguf", # confirm exact repo
	filename="Model-Q4_K_M.gguf",
	)
	mmproj_path = hf_hub_download(
	repo_id="openbmb/MiniCPM-V-4_6-gguf",
	filename="mmproj-model-f16.gguf",
	)
	handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path)
	_vision = Llama(
	model_path=model_path,
	chat_handler=handler,
	n_ctx=4096,
	n_threads=4,
	verbose=False,
	)
	return _vision
	```
	3. Write `src/agents/mise_en_place.py`:
	```python
	import base64, io, json
	from PIL import Image
	from src.models.loader import get_vision_model

	PROMPT = (
	"You are an ingredient detector. Look at the fridge/pantry photo and "
	"list every edible ingredient you can identify. Return strict JSON: "
	'{"ingredients": ["chicken", "onion", "tomato", ...]} '
	"Lowercase, English, no brand names, no containers."
	)

	def _img_to_data_url(img: Image.Image) -> str:
	buf = io.BytesIO(); img.save(buf, "JPEG", quality=85)
	b64 = base64.b64encode(buf.getvalue()).decode()
	return f"data:image/jpeg;base64,{b64}"

	def identify_ingredients(image: Image.Image) -> list[str]:
	llm = get_vision_model()
	out = llm.create_chat_completion(messages=[
	{"role": "user", "content": [
	{"type": "image_url", "image_url": {"url": _img_to_data_url(image)}},
	{"type": "text", "text": PROMPT},
	]}
	], temperature=0.2, response_format={"type": "json_object"})
	data = json.loads(out["choices"][0]["message"]["content"])
	return [s.lower().strip() for s in data["ingredients"]]
	```
	4. Test locally with 5 sample fridge photos.

	Deliverable: the function returns a non-empty English list with ≥80% precision on a clean fridge photo.

	Verify: stash these 5 results in `tests/vision_smoke.json` for regression checks.

	---

	### Phase 3 — Day 3: Recipe Planner — MiniCPM-V-4 via llama.cpp + retrieval

	Goal: given a list of ingredients (and optionally a chosen dish), return a fully structured `Recipe` JSON including steps, durations, visual descriptions, and nutritional values.

	Tasks
	1. Find or convert MiniCPM-V-4 to GGUF. Likely repo: `openbmb/MiniCPM-V-4-gguf` or community quants. Pick `Q4_K_M`.
	2. Add to `src/models/loader.py` a `get_planner_model()` (same pattern as vision but without `chat_handler`).
	3. Write `src/agents/recipe_planner.py`:
	- Step A — propose: call planner with `Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON.`
	- Step B — retrieve: for the chosen dish name, call `RecipeIndex.search(...)` and pick the closest match. Use it as a grounded reference.
	- Step C — restructure: prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical `Recipe` JSON schema below. The retrieval grounds the model and prevents hallucinated steps.
	- Step D — nutrition: from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5.
	4. Define the canonical schema in `src/pipeline.py` using Pydantic:
	```python
	from pydantic import BaseModel
	from typing import Optional

	class Step(BaseModel):
	n: int
	instruction: str # English, imperative
	duration: str # "4 minutes"
	visual: str # English visual description for FLUX prompt
	tip: Optional[str] = None

	class Nutrition(BaseModel):
	calories: int # per serving
	protein_g: float
	carbs_g: float
	fat_g: float
	fiber_g: float

	class Recipe(BaseModel):
	name: str
	cuisine: str
	servings: int
	total_time_minutes: int
	options: list[dict] # only populated on "propose" call
	ingredients_have: list[str]
	ingredients_missing: list[str]
	substitutes: dict[str, list[str]]
	steps: list[Step]
	final_dish_visual: str
	nutrition_per_serving: Nutrition
	```
	5. Write the system prompt (`src/prompts/planner_system.txt`):
	- Persona: international chef
	- Hard rule: output JSON only, matching schema
	- Hard rule: prefer dishes feasible with available ingredients
	- Hard rule: 5–7 steps, each ≤ 25 words, each with a concrete `visual` field for image generation
	- Hard rule: include `nutrition_per_serving` (model is allowed to estimate; you'll override with `data/nutrition.py` for accuracy)
	6. Use `response_format={"type": "json_object"}` in the chat completion call. Set `temperature=0.7, top_p=0.95, enable_thinking=True` for the propose step (creative); `temperature=0.4` for the structured-output step (deterministic).

	Deliverable: for `["chicken","onion","tomato","tortilla","cheese"]` and chosen dish "chicken tinga", the function returns a valid `Recipe` Pydantic object with 5–7 steps.

	Verify: the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds.

	---

	### Phase 3.5 — Day 3 (afternoon): Nutritional values

	Goal: the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM).

	Approach: small, embedded reference table beats LLM math.

	Tasks
	1. Bundle `data/nutrition_table.csv` — a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo.
	2. Write `src/data/nutrition.py`:
	- `parse_quantity(line: str) -> (grams, ingredient_name)` — handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35).
	- `compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutrition` — sum per-100g values weighted by grams, divide by servings.
	- If a line cannot be parsed, skip it and log; don't crash.
	3. After the planner returns a recipe, overwrite `recipe.nutrition_per_serving` with the computed value. Keep the LLM's value only as a fallback when the parser yields zero.

	Deliverable: for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within ±25% of online references.

	---

	### Phase 4 — Day 4: Step Illustrator — FLUX.2 Klein 9B

	Goal: generate an appetizing image for the final dish + one image per step.

	Constraint: FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths:
	- Path A (recommended for the hackathon): upgrade the Space to a GPU instance (T4 or A10G — paid, but $20 HF credits cover it for a week of development). Code stays unchanged.
	- Path B (fallback): run FLUX in `enable_model_cpu_offload()` mode with `num_inference_steps=4` and accept ~3 min/image — only feasible for pre-rendered demo recipes, not live runs.

	Tasks
	1. Add to `requirements.txt`:
	```text
	diffusers>=0.31
	transformers>=4.45
	accelerate
	torch
	safetensors
	```
	2. Write `src/models/illustrator.py`:
	```python
	import torch
	from diffusers import Flux2KleinPipeline

	_pipe = None

	def get_flux():
	global _pipe
	if _pipe is None:
	dtype = torch.bfloat16
	_pipe = Flux2KleinPipeline.from_pretrained(
	"black-forest-labs/FLUX.2-klein-9B",
	torch_dtype=dtype,
	)
	_pipe.enable_model_cpu_offload()
	return _pipe

	def render(prompt: str, seed: int = 0) -> "PIL.Image.Image":
	pipe = get_flux()
	device = "cuda" if torch.cuda.is_available() else "cpu"
	img = pipe(
	prompt=prompt,
	height=1024, width=1024,
	guidance_scale=1.0,
	num_inference_steps=4,
	generator=torch.Generator(device=device).manual_seed(seed),
	).images[0]
	return img
	```
	3. Write `src/agents/step_illustrator.py`:
	- For each `Step.visual`, build a prompt like:
	> `f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing."`
	- Generate the final dish image first, then the per-step images, all in one Python loop (no parallelism — FLUX holds the GPU).
	- Cache results on disk keyed by `hash(prompt)` to avoid re-renders on re-runs.
	- Emit Gradio progress updates so the UI doesn't appear frozen.
	4. Critical tuning: keep `num_inference_steps=4` (Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale.

	Deliverable: for a 5-step recipe, all 6 images (final + 5 steps) render in:
	- < 1 minute on T4 GPU Space
	- < 8 minutes on CPU offload (acceptable only for pre-cached demos)

	Verify: show the 6 images to an unprompted human; ≥4 should be described as "appetizing".

	---

	### Phase 5 — Day 5: Narrator — VoxCPM2

	Goal: every step's instruction is rendered to an MP3 in a warm, clear English voice.

	Tasks
	1. Confirm the exact VoxCPM2 repo name on HF (`openbmb/VoxCPM2` or similar). Read its README for the inference snippet — TTS APIs vary widely between models.
	2. Add to `requirements.txt`: `soundfile`, `torchaudio`, `numpy`. If VoxCPM2 ships GGUF, use it via `llama-cpp-python` audio extension (if available); otherwise load via `transformers` directly.
	3. Write `src/models/narrator.py`:
	```python
	_tts = None

	def get_tts():
	global _tts
	if _tts is None:
	# placeholder — replace with the exact VoxCPM2 loading code from its README
	from transformers import AutoModel, AutoProcessor
	_tts = ... # load on CPU; VoxCPM2 is small (~1B)
	return _tts

	def synthesize(text: str, voice: str = "warm_female_en") -> bytes:
	"""Returns MP3 bytes."""
	tts = get_tts()
	wav = tts.generate(text, voice=voice) # API depends on VoxCPM2
	# encode wav -> mp3 with soundfile + ffmpeg-python or pydub
	return mp3_bytes
	```
	4. Write `src/agents/narrator.py`:
	- For each step, synthesize `step.instruction`. If `step.tip` is set, synthesize a separate "tip" clip.
	- Save MP3 files in a per-recipe temp directory; return file paths to Gradio.
	5. Pre-render all step audio when the recipe is finalized — never stream per-step in the demo (too much UI lag).

	Deliverable: clicking "Play" on step 1 in the UI plays clear English narration.

	Verify: on a 5-step recipe, total TTS rendering time < 30 seconds on CPU.

	---

	### Phase 6 — Day 6: Gradio UI (Off-Brand)

	Goal: the Space looks like a recipe magazine, not stock Gradio.

	Tasks
	1. Write `src/ui/theme.py`:
	```python
	import gradio as gr

	theme = gr.themes.Soft(
	primary_hue="orange",
	neutral_hue="stone",
	font=[gr.themes.GoogleFont("Inter"), "sans-serif"],
	font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"],
	)

	CSS = """
	.gradio-container { background: #f5ecd9 !important; }
	.recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; }
	.recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; }
	.step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; }
	.nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; }
	.nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; }
	"""
	```
	2. Write `app.py` with three tabs:
	- Tab 1 — Cook: fridge photo input → ingredient chips → 3 dish options → selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom.
	- Tab 2 — Check Progress: upload a progress photo + select active step → validator returns badge (`go/wait/fix`) + tip + audio.
	- Tab 3 — About / Tech: README-style explanation, badges, model list.
	3. Use `gr.Blocks` with `gr.State` to hold the current `Recipe` Pydantic object across UI events. Serialize to/from `dict` since Pydantic objects don't survive Gradio state by default — wrap in `state.value = recipe.model_dump()`.
	4. Wire callbacks:
	- `btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])`
	- `dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])`
	- `progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])`

	Deliverable: end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere.

	---

	### Phase 7 — Day 7: Progress Validator (closed loop)

	Goal: user uploads a progress photo, app says "go / wait / fix" with a voiced tip.

	Tasks
	1. Write `src/agents/progress_validator.py`:
	```python
	PROMPT = """Compare these two cooking photos.
	Photo 1 (target): how it should look after the step "{instruction}".
	Photo 2 (user's pan/plate): the user's current progress.
	Reply strict JSON: {"verdict": "go\|wait\|fix", "feedback": "...", "tip": "..."}
	- "go": looks right, move to next step
	- "wait": needs more time, do not change anything yet
	- "fix": something is off; suggest a concrete adjustment in one sentence
	"""
	def validate(target_img, user_img, step_instruction): ...
	```
	2. Use the same vision model singleton as Phase 2 — both calls share weights.
	3. Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2.

	Deliverable: running the validator on 5 real progress photos returns the correct verdict on ≥3.

	---

	### Phase 8 — Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge)

	> Important caveat: The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still requires GPU, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does not violate the runtime constraint — only the build pipeline touches a GPU.

	Goal: publish a fine-tuned Planner GGUF to the Hub and load it from the Space.

	Tasks
	1. Build SFT dataset (`scripts/build_sft_dataset.py`):
	- Load Kaggle `better-recipes` dataset.
	- For each recipe, build a `(prompt, completion)` pair where `prompt` is `"Available ingredients: X, Y, Z. Propose recipe."` and `completion` is the full canonical `Recipe` JSON.
	- Generate ~1000 pairs, push to `<you>/cook-with-me-sft` HF Dataset.
	2. LoRA training (`scripts/train_planner.py` — to be run on a GPU machine, not the Space):
	```python
	# peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4
	# r=16, alpha=32, lr=2e-4, epochs=2, batch=4
	# push_to_hub=True, hub_model_id="<you>/cook-with-me-planner-4b"
	```
	3. Convert to GGUF (Day 8 evening):
	- Use `llama.cpp/convert_hf_to_gguf.py` then `quantize` to `Q4_K_M`.
	- Push GGUF to `<you>/cook-with-me-planner-4b-gguf`.
	4. Update `src/models/loader.py` to point at your GGUF instead of the base model.

	Deliverable: the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set.

	---

	### Phase 9 — Day 9: End-to-end test, performance pass, pre-warm cache

	Goal: the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware.

	Tasks
	1. Write `scripts/smoke_test.py` that runs the full pipeline on 3 sample fridge photos and asserts:
	- Each ingredient list is non-empty
	- Each recipe has 5–7 steps
	- Each step has a non-empty image and audio path
	- Nutrition has all 5 macros set
	2. Implement on-disk caching for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to `~/.cache/cook-with-me/flux/`.
	3. Pre-render and commit 3 fully-prepared demo recipes (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click.
	4. Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace.
	5. Add a "Loading models..." progress bar on first request — first cold start can take 90s.

	Deliverable: smoke test passes on the live Space.

	---

	### Phase 10 — Day 10: README, demo video, social post, submit

	Tasks
	1. Write `README.md` with the required HF Space frontmatter:
	```yaml
	---
	title: Cook With Me
	emoji: 🍲
	colorFrom: orange
	colorTo: yellow
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	---
	```
	Followed by:
	- One-paragraph pitch
	- 60-second demo video embed
	- Architecture diagram (export from `arquitectura.html` as PNG)
	- Section: "How closed-loop visual cooking guidance works"
	- Models used (with HF links + total parameter count)
	- Badges declared
	- Build / run instructions
	2. Record a 60–90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera.
	3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 — here's why" or "Closed-loop validation needs the same vision model on both sides").
	4. Social post on X / LinkedIn with the demo video.
	5. Submit on the hackathon platform.

	---

	## 4. Tools usage matrix (when to reach for what)

	\| Phase \| Primary tools \| Why \|
	\|---\|---\|---\|
	\| 0 — setup \| HF CLI, Kaggle CLI, OpenAI Codex CLI \| one-shot config \|
	\| 1 — data \| `kagglehub`, `pandas`, `sentence-transformers` \| offline dataset prep \|
	\| 2 — vision \| `llama-cpp-python` + `MiniCPMv26ChatHandler` \| runs inside Space, badge: Llama Champion \|
	\| 3 — planner \| `llama-cpp-python` + retrieval over local parquet \| grounded JSON output \|
	\| 3.5 — nutrition \| local CSV + regex parser \| reliable, no LLM math \|
	\| 4 — illustrator \| `diffusers` + `Flux2KleinPipeline` \| sponsor model showcase \|
	\| 5 — narrator \| VoxCPM2 via `transformers` (or its native API) \| local TTS \|
	\| 6 — UI \| `gradio` + custom CSS theme \| Off-Brand badge \|
	\| 7 — validator \| same vision singleton as phase 2 \| closed-loop innovation, Best Agent \|
	\| 8 — fine-tune \| `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine \| Well-Tuned badge \|
	\| 9 — test/cache \| `pytest`, `hashlib`, on-disk FLUX cache \| demo reliability \|
	\| 10 — submit \| HF Spaces, video tool, social \| shipping \|

	---

	## 5. Performance budget on the HF Space

	\| Operation \| Target latency \| Hardware needed \|
	\|---\|---\|---\|
	\| Vision: ingredient ID \| < 8 s \| CPU 4-thread \|
	\| Planner: propose 3 dishes \| < 12 s \| CPU 4-thread \|
	\| Planner: build full recipe JSON \| < 20 s \| CPU 4-thread \|
	\| Nutrition computation \| < 0.1 s \| CPU \|
	\| FLUX: 1 image (4 steps) \| < 12 s on T4 / < 90 s on CPU offload \| GPU strongly recommended \|
	\| FLUX: 6 images (final + 5 steps) \| < 80 s on T4 \| GPU \|
	\| VoxCPM2: 1 step narration \| < 5 s \| CPU \|
	\| Validator: 1 progress check \| < 8 s \| CPU \|
	\| Full recipe end-to-end \| < 2 min on T4 Space \| — \|

	Hardware decision: rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours.

	---

	## 6. Risks and mitigations (delta from `estrategia.md`)

	\| Risk \| Mitigation \|
	\|---\|---\|
	\| MiniCPM-V-4 has no public GGUF \| Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. \|
	\| llama-cpp-python's MiniCPM-V chat handler version mismatch \| Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. \|
	\| FLUX.2 Klein 9B too slow on free CPU Space \| Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. \|
	\| VoxCPM2 docs sparse \| Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. \|
	\| Kaggle dataset has format quirks (HTML in instructions, missing fields) \| The Phase 1 normalization step handles this; budget 2 hours. \|
	\| Nutrition CSV missing exotic ingredients \| Skip-and-log strategy already designed; demo-day recipes use common ingredients only. \|
	\| Total params >32B if VoxCPM2 turns out to be 7B \| Check size in Phase 0; if too large, drop to a smaller TTS. \|

	---

	## 7. "Day-1 hello world" checklist

	Before writing any agent code, get this minimal end-to-end loop working — it proves your stack:

	1. ☐ Empty Gradio Space deployed, shows "Hello"
	2. ☐ `huggingface-cli login` works locally
	3. ☐ `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds
	4. ☐ `from llama_cpp import Llama` runs in your venv
	5. ☐ Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip
	6. ☐ Push the round-trip to the Space; confirm it answers in the cloud

	Only after all 6 are checked, start Phase 1.

	---

	## 8. Where this plan differs from `estrategia.md` (deltas to communicate)

	\| Topic \| `estrategia.md` (Spanish, Mexican-cuisine focus) \| This document (current requirements) \|
	\|---\|---\|---\|
	\| Language \| Spanish-first \| English only \|
	\| Cuisine \| Mexican \| International (Kaggle dataset) \|
	\| Voice models \| OpenBMB voice + Cohere Labs \| VoxCPM2 only (single voice) \|
	\| Vision model \| MiniCPM-V 2.6 / 4 \| MiniCPM-V-4.6 \|
	\| Reasoning model \| MiniCPM-4 4B \| MiniCPM-V-4 \|
	\| FLUX runtime \| Modal endpoint \| Inside Space (llama.cpp principle); Modal kept as a future migration target only \|
	\| External APIs at runtime \| Allowed (Modal, OpenAI optional) \| None — full local inference inside Space \|
	\| Nutritional info \| Not specified \| Required at end of recipe \|
	\| Fine-tune dataset \| 200 synthetic Mexican recipes \| Kaggle better-recipes (international) \|

	If anything in `plan.md` or `estrategia.md` conflicts with this document, this document wins — it reflects the latest user requirements.

	---

	## 9. Definition of done

	The implementation is complete when all of these are true:

	- [ ] Public HF Space `https://huggingface.co/spaces/<you>/cook-with-me` loads
	- [ ] App is fully in English
	- [ ] Fridge photo → ingredient list → 3 dish options → full recipe with images, audio, and nutrition works end-to-end
	- [ ] Progress validator returns sensible verdicts on 3+ test photos
	- [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers — no external API calls at runtime
	- [ ] Total parameters declared in README ≤ 32B
	- [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge)
	- [ ] Demo video (60–90s) recorded with a real person cooking
	- [ ] Field Notes blog post published
	- [ ] Submitted on the hackathon platform before deadline