Cook_with_a_LLM / Strategy /plan_implementacion.md
FredinVΓ‘zquez
add strategy plan
bad5d84
# Implementation Plan β€” "Cook With Me"
> Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces.
>
> **Hackathon:** Small models / Big adventures β€” June 2026
> **Read first:** `plan.md` (the *what* and *why*) and `estrategia.md` (the *how* at a strategic level). This document is the *how* at a tactical level β€” turn this into code.
---
## 0. Locked decisions (do not re-discuss)
| Decision | Value | Reason |
|---|---|---|
| UI framework | **Gradio** | Hackathon requirement |
| Hosting | **Hugging Face Space** | Hackathon requirement |
| Inference runtime (text + vision) | **llama.cpp** via `llama-cpp-python` | Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal |
| Image generation | **FLUX.2 Klein 9B** (`black-forest-labs/FLUX.2-klein-9B`) | Sponsor model; runs in the Space if a GPU Space is rented (or via `enable_model_cpu_offload()` as fallback). Plan to migrate this specific component to Modal post-hackathon |
| Recipe planner / reasoning | **`openbmb/MiniCPM-V-4`** (GGUF) | Provided requirement |
| Vision (ingredient ID + progress validator) | **`openbmb/MiniCPM-V-4.6`** (GGUF) | Provided requirement |
| Text-to-speech | **OpenBMB VoxCPM2** | Provided requirement |
| Recipe dataset | **`thedevastator/better-recipes-for-a-better-life`** (Kaggle) β€” international cuisine | Provided requirement; not limited to Mexican food |
| App language | **English only** | Provided requirement |
| Final output | **Recipe + step images + voice + nutritional values** | Provided requirement |
| External API calls at runtime | **None** | "llama.cpp inside the Space" mandate |
---
## 1. Architecture (final, English-only, llama.cpp-first)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hugging Face Space (Gradio) β”‚
β”‚ (CPU + optional GPU upgrade) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
πŸ“Έ Fridge photo ─────▢│ [Vision Agent] β”‚
β”‚ MiniCPM-V-4.6 GGUF (llama.cpp) β”‚
β”‚ β†’ list[ingredient] β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
πŸ₯˜ User picks dish ───▢│ [Recipe Planner] β”‚
β”‚ MiniCPM-V-4 GGUF (llama.cpp) β”‚
β”‚ + retrieval over Kaggle dataset β”‚
β”‚ β†’ Recipe JSON (steps, nutrition) β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ [Step Illustrator] β”‚
β”‚ FLUX.2 Klein 9B (diffusers) β”‚
β”‚ β†’ PNG per step + final dish β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ [Narrator] β”‚
β”‚ VoxCPM2 β†’ MP3 per step β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
πŸ“Έ Progress photo ────▢│ [Progress Validator] β”‚
β”‚ MiniCPM-V-4.6 (vision compare) β”‚
β”‚ β†’ "go / wait / fix" + tip β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Total parameter count (≀ 32B requirement):**
- MiniCPM-V-4 (reasoning) β‰ˆ 4B
- MiniCPM-V-4.6 (vision) β‰ˆ 4.6B
- FLUX.2 Klein β‰ˆ 9B
- VoxCPM2 β‰ˆ 1B (estimate)
- **Total β‰ˆ 18.6B βœ“**
---
## 2. Repository layout
```
cook-with-me/
β”œβ”€β”€ app.py # Gradio entrypoint (Space looks for this)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ packages.txt # apt packages (ffmpeg, libsndfile1)
β”œβ”€β”€ README.md # Space card (HF requires YAML frontmatter)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ config.py # paths, model IDs, constants
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ vision.py # MiniCPM-V-4.6 wrapper (llama-cpp)
β”‚ β”‚ β”œβ”€β”€ planner.py # MiniCPM-V-4 wrapper (llama-cpp)
β”‚ β”‚ β”œβ”€β”€ illustrator.py # FLUX.2 Klein wrapper (diffusers)
β”‚ β”‚ β”œβ”€β”€ narrator.py # VoxCPM2 wrapper
β”‚ β”‚ └── loader.py # lazy singletons + GGUF download
β”‚ β”œβ”€β”€ agents/
β”‚ β”‚ β”œβ”€β”€ mise_en_place.py # ingredient identification
β”‚ β”‚ β”œβ”€β”€ recipe_planner.py # builds Recipe object
β”‚ β”‚ β”œβ”€β”€ step_illustrator.py # per-step image gen
β”‚ β”‚ β”œβ”€β”€ narrator.py # per-step TTS
β”‚ β”‚ └── progress_validator.py
β”‚ β”œβ”€β”€ data/
β”‚ β”‚ β”œβ”€β”€ recipe_index.py # loads Kaggle dataset, builds retrieval
β”‚ β”‚ └── nutrition.py # USDA-style nutrition computation
β”‚ β”œβ”€β”€ pipeline.py # Recipe state machine, orchestration
β”‚ β”œβ”€β”€ prompts/
β”‚ β”‚ β”œβ”€β”€ vision_prompt.txt
β”‚ β”‚ β”œβ”€β”€ planner_system.txt
β”‚ β”‚ └── validator_prompt.txt
β”‚ └── ui/
β”‚ β”œβ”€β”€ theme.py # custom CSS (Off-Brand badge)
β”‚ └── components.py # reusable Gradio Blocks pieces
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ download_models.py # pre-warms GGUF + Flux weights at build time
β”‚ β”œβ”€β”€ build_recipe_index.py # caches Kaggle dataset locally
β”‚ └── smoke_test.py # end-to-end validation before push
└── assets/
β”œβ”€β”€ sample_fridge_1.jpg
└── sample_progress_1.jpg
```
---
## 3. Phase-by-phase plan (10 days)
> Each phase has: **goal**, **tasks**, **deliverable**, **verification check**. Do not move to the next phase if verification fails.
---
### Phase 0 β€” Day 0 (Β½ day): Account + tooling setup
**Goal:** every credential and CLI is ready before writing code.
**Tasks**
1. Create or confirm Hugging Face account; generate a **write token** (Settings β†’ Access Tokens). Store as `HF_TOKEN` env var locally.
2. Install Hugging Face CLI: `pip install -U huggingface_hub` then `huggingface-cli login`.
3. Install Kaggle CLI: `pip install kaggle`. Place `kaggle.json` (Account β†’ API β†’ Create New Token) in `~/.kaggle/kaggle.json` with `chmod 600`.
4. Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active.
5. Install local Python 3.11 venv: `python -m venv .venv && source .venv/bin/activate`.
6. Create the repo locally: `git init cook-with-me && cd cook-with-me`.
7. Create an empty Hugging Face Space: huggingface.co β†’ New Space β†’ SDK = **Gradio**, Hardware = **CPU basic** (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it.
8. Verify model availability: open in a browser and confirm pages exist:
- `huggingface.co/openbmb/MiniCPM-V-4`
- `huggingface.co/openbmb/MiniCPM-V-4-6`
- `huggingface.co/openbmb/VoxCPM2` (or whatever the exact repo name is β€” search "VoxCPM" on HF)
- `huggingface.co/black-forest-labs/FLUX.2-klein-9B`
**Deliverable:** empty Space deployed showing "Hello World" Gradio.
**Verify:** `https://huggingface.co/spaces/<you>/cook-with-me` loads.
---
### Phase 1 β€” Day 1: Project skeleton + recipe dataset ingestion
**Goal:** the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval.
**Tasks**
1. Write `requirements.txt` (initial version β€” packages will be added as phases progress):
```text
gradio>=4.44
huggingface_hub>=0.24
llama-cpp-python>=0.3.2
numpy
pandas
Pillow
pydantic>=2
sentence-transformers
```
2. Write `packages.txt`:
```text
ffmpeg
libsndfile1
```
3. Write `scripts/build_recipe_index.py`:
- Use `kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path)` β€” discover `file_path` by listing the dataset files first via `kagglehub.dataset_download`.
- Normalize columns: `name`, `ingredients` (list[str]), `instructions` (list[str]), `cuisine` (str if present, else "international"), `prep_time`, `servings`.
- Drop rows missing critical fields. Lowercase + strip ingredient strings.
- Save to `data/recipes.parquet` (~5–50MB depending on dataset size).
- Build sentence embeddings of the recipe **name + first 3 ingredients** using `sentence-transformers/all-MiniLM-L6-v2` and save to `data/recipes_emb.npy`.
- This script runs **once locally**; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in `app.py`). If files exceed 100MB, push to a HF Dataset repo: `<you>/cook-with-me-recipes`.
4. Write `src/data/recipe_index.py`:
- `class RecipeIndex` with `.search(ingredients: list[str], top_k=5) -> list[RecipeRow]`.
- Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k.
**Deliverable:** `python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))"` prints 5 sensible recipes.
**Verify:** at least 3 of the top-5 results contain β‰₯2 of the input ingredients.
---
### Phase 2 β€” Day 2: Vision agent (Mise en Place) β€” MiniCPM-V-4.6 via llama.cpp
**Goal:** given a fridge photo, return a clean list of English ingredient names.
**Background:** llama.cpp supports multimodal models through a vision projector (`mmproj-*.gguf`) plus the language model GGUF. MiniCPM-V family ships both files on the Hub.
**Tasks**
1. Find the GGUF release of MiniCPM-V-4.6. Search HF for `MiniCPM-V-4_6-gguf` or `openbmb/MiniCPM-V-4_6-gguf`. You need **two** files:
- `Model-Q4_K_M.gguf` (or similar quant)
- `mmproj-model-f16.gguf` (the vision projector)
2. Write `src/models/loader.py`:
```python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler # or matching handler
_vision = None
def get_vision_model():
global _vision
if _vision is None:
model_path = hf_hub_download(
repo_id="openbmb/MiniCPM-V-4_6-gguf", # confirm exact repo
filename="Model-Q4_K_M.gguf",
)
mmproj_path = hf_hub_download(
repo_id="openbmb/MiniCPM-V-4_6-gguf",
filename="mmproj-model-f16.gguf",
)
handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path)
_vision = Llama(
model_path=model_path,
chat_handler=handler,
n_ctx=4096,
n_threads=4,
verbose=False,
)
return _vision
```
3. Write `src/agents/mise_en_place.py`:
```python
import base64, io, json
from PIL import Image
from src.models.loader import get_vision_model
PROMPT = (
"You are an ingredient detector. Look at the fridge/pantry photo and "
"list every edible ingredient you can identify. Return strict JSON: "
'{"ingredients": ["chicken", "onion", "tomato", ...]} '
"Lowercase, English, no brand names, no containers."
)
def _img_to_data_url(img: Image.Image) -> str:
buf = io.BytesIO(); img.save(buf, "JPEG", quality=85)
b64 = base64.b64encode(buf.getvalue()).decode()
return f"data:image/jpeg;base64,{b64}"
def identify_ingredients(image: Image.Image) -> list[str]:
llm = get_vision_model()
out = llm.create_chat_completion(messages=[
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": _img_to_data_url(image)}},
{"type": "text", "text": PROMPT},
]}
], temperature=0.2, response_format={"type": "json_object"})
data = json.loads(out["choices"][0]["message"]["content"])
return [s.lower().strip() for s in data["ingredients"]]
```
4. Test locally with 5 sample fridge photos.
**Deliverable:** the function returns a non-empty English list with β‰₯80% precision on a clean fridge photo.
**Verify:** stash these 5 results in `tests/vision_smoke.json` for regression checks.
---
### Phase 3 β€” Day 3: Recipe Planner β€” MiniCPM-V-4 via llama.cpp + retrieval
**Goal:** given a list of ingredients (and optionally a chosen dish), return a fully structured `Recipe` JSON including steps, durations, visual descriptions, and nutritional values.
**Tasks**
1. Find or convert MiniCPM-V-4 to GGUF. Likely repo: `openbmb/MiniCPM-V-4-gguf` or community quants. Pick `Q4_K_M`.
2. Add to `src/models/loader.py` a `get_planner_model()` (same pattern as vision but without `chat_handler`).
3. Write `src/agents/recipe_planner.py`:
- **Step A β€” propose:** call planner with `Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON.`
- **Step B β€” retrieve:** for the chosen dish name, call `RecipeIndex.search(...)` and pick the closest match. Use it as a *grounded reference*.
- **Step C β€” restructure:** prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical `Recipe` JSON schema below. The retrieval grounds the model and prevents hallucinated steps.
- **Step D β€” nutrition:** from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5.
4. Define the canonical schema in `src/pipeline.py` using Pydantic:
```python
from pydantic import BaseModel
from typing import Optional
class Step(BaseModel):
n: int
instruction: str # English, imperative
duration: str # "4 minutes"
visual: str # English visual description for FLUX prompt
tip: Optional[str] = None
class Nutrition(BaseModel):
calories: int # per serving
protein_g: float
carbs_g: float
fat_g: float
fiber_g: float
class Recipe(BaseModel):
name: str
cuisine: str
servings: int
total_time_minutes: int
options: list[dict] # only populated on "propose" call
ingredients_have: list[str]
ingredients_missing: list[str]
substitutes: dict[str, list[str]]
steps: list[Step]
final_dish_visual: str
nutrition_per_serving: Nutrition
```
5. Write the system prompt (`src/prompts/planner_system.txt`):
- Persona: international chef
- Hard rule: output JSON only, matching schema
- Hard rule: prefer dishes feasible with available ingredients
- Hard rule: 5–7 steps, each ≀ 25 words, each with a concrete `visual` field for image generation
- Hard rule: include `nutrition_per_serving` (model is allowed to estimate; you'll override with `data/nutrition.py` for accuracy)
6. Use `response_format={"type": "json_object"}` in the chat completion call. Set `temperature=0.7, top_p=0.95, enable_thinking=True` for the propose step (creative); `temperature=0.4` for the structured-output step (deterministic).
**Deliverable:** for `["chicken","onion","tomato","tortilla","cheese"]` and chosen dish "chicken tinga", the function returns a valid `Recipe` Pydantic object with 5–7 steps.
**Verify:** the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds.
---
### Phase 3.5 β€” Day 3 (afternoon): Nutritional values
**Goal:** the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM).
**Approach:** small, embedded reference table beats LLM math.
**Tasks**
1. Bundle `data/nutrition_table.csv` β€” a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo.
2. Write `src/data/nutrition.py`:
- `parse_quantity(line: str) -> (grams, ingredient_name)` β€” handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35).
- `compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutrition` β€” sum per-100g values weighted by grams, divide by servings.
- If a line cannot be parsed, skip it and log; don't crash.
3. After the planner returns a recipe, **overwrite** `recipe.nutrition_per_serving` with the computed value. Keep the LLM's value only as a fallback when the parser yields zero.
**Deliverable:** for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within Β±25% of online references.
---
### Phase 4 β€” Day 4: Step Illustrator β€” FLUX.2 Klein 9B
**Goal:** generate an appetizing image for the final dish + one image per step.
**Constraint:** FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths:
- **Path A (recommended for the hackathon):** upgrade the Space to a GPU instance (T4 or A10G β€” paid, but $20 HF credits cover it for a week of development). Code stays unchanged.
- **Path B (fallback):** run FLUX in `enable_model_cpu_offload()` mode with `num_inference_steps=4` and accept ~3 min/image β€” only feasible for pre-rendered demo recipes, not live runs.
**Tasks**
1. Add to `requirements.txt`:
```text
diffusers>=0.31
transformers>=4.45
accelerate
torch
safetensors
```
2. Write `src/models/illustrator.py`:
```python
import torch
from diffusers import Flux2KleinPipeline
_pipe = None
def get_flux():
global _pipe
if _pipe is None:
dtype = torch.bfloat16
_pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-9B",
torch_dtype=dtype,
)
_pipe.enable_model_cpu_offload()
return _pipe
def render(prompt: str, seed: int = 0) -> "PIL.Image.Image":
pipe = get_flux()
device = "cuda" if torch.cuda.is_available() else "cpu"
img = pipe(
prompt=prompt,
height=1024, width=1024,
guidance_scale=1.0,
num_inference_steps=4,
generator=torch.Generator(device=device).manual_seed(seed),
).images[0]
return img
```
3. Write `src/agents/step_illustrator.py`:
- For each `Step.visual`, build a prompt like:
> `f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing."`
- Generate the **final dish image first**, then the per-step images, all in **one Python loop** (no parallelism β€” FLUX holds the GPU).
- Cache results on disk keyed by `hash(prompt)` to avoid re-renders on re-runs.
- Emit Gradio progress updates so the UI doesn't appear frozen.
4. **Critical tuning:** keep `num_inference_steps=4` (Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale.
**Deliverable:** for a 5-step recipe, all 6 images (final + 5 steps) render in:
- < 1 minute on T4 GPU Space
- < 8 minutes on CPU offload (acceptable only for pre-cached demos)
**Verify:** show the 6 images to an unprompted human; β‰₯4 should be described as "appetizing".
---
### Phase 5 β€” Day 5: Narrator β€” VoxCPM2
**Goal:** every step's instruction is rendered to an MP3 in a warm, clear English voice.
**Tasks**
1. Confirm the exact VoxCPM2 repo name on HF (`openbmb/VoxCPM2` or similar). Read its README for the inference snippet β€” TTS APIs vary widely between models.
2. Add to `requirements.txt`: `soundfile`, `torchaudio`, `numpy`. If VoxCPM2 ships GGUF, use it via `llama-cpp-python` audio extension (if available); otherwise load via `transformers` directly.
3. Write `src/models/narrator.py`:
```python
_tts = None
def get_tts():
global _tts
if _tts is None:
# placeholder β€” replace with the exact VoxCPM2 loading code from its README
from transformers import AutoModel, AutoProcessor
_tts = ... # load on CPU; VoxCPM2 is small (~1B)
return _tts
def synthesize(text: str, voice: str = "warm_female_en") -> bytes:
"""Returns MP3 bytes."""
tts = get_tts()
wav = tts.generate(text, voice=voice) # API depends on VoxCPM2
# encode wav -> mp3 with soundfile + ffmpeg-python or pydub
return mp3_bytes
```
4. Write `src/agents/narrator.py`:
- For each step, synthesize `step.instruction`. If `step.tip` is set, synthesize a separate "tip" clip.
- Save MP3 files in a per-recipe temp directory; return file paths to Gradio.
5. Pre-render all step audio when the recipe is finalized β€” never stream per-step in the demo (too much UI lag).
**Deliverable:** clicking "Play" on step 1 in the UI plays clear English narration.
**Verify:** on a 5-step recipe, total TTS rendering time < 30 seconds on CPU.
---
### Phase 6 β€” Day 6: Gradio UI (Off-Brand)
**Goal:** the Space looks like a recipe magazine, not stock Gradio.
**Tasks**
1. Write `src/ui/theme.py`:
```python
import gradio as gr
theme = gr.themes.Soft(
primary_hue="orange",
neutral_hue="stone",
font=[gr.themes.GoogleFont("Inter"), "sans-serif"],
font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"],
)
CSS = """
.gradio-container { background: #f5ecd9 !important; }
.recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; }
.recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; }
.step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; }
.nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; }
.nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; }
"""
```
2. Write `app.py` with three tabs:
- **Tab 1 β€” Cook**: fridge photo input β†’ ingredient chips β†’ 3 dish options β†’ selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom.
- **Tab 2 β€” Check Progress**: upload a progress photo + select active step β†’ validator returns badge (`go/wait/fix`) + tip + audio.
- **Tab 3 β€” About / Tech**: README-style explanation, badges, model list.
3. Use `gr.Blocks` with `gr.State` to hold the current `Recipe` Pydantic object across UI events. Serialize to/from `dict` since Pydantic objects don't survive Gradio state by default β€” wrap in `state.value = recipe.model_dump()`.
4. Wire callbacks:
- `btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])`
- `dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])`
- `progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])`
**Deliverable:** end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere.
---
### Phase 7 β€” Day 7: Progress Validator (closed loop)
**Goal:** user uploads a progress photo, app says "go / wait / fix" with a voiced tip.
**Tasks**
1. Write `src/agents/progress_validator.py`:
```python
PROMPT = """Compare these two cooking photos.
Photo 1 (target): how it should look after the step "{instruction}".
Photo 2 (user's pan/plate): the user's current progress.
Reply strict JSON: {"verdict": "go|wait|fix", "feedback": "...", "tip": "..."}
- "go": looks right, move to next step
- "wait": needs more time, do not change anything yet
- "fix": something is off; suggest a concrete adjustment in one sentence
"""
def validate(target_img, user_img, step_instruction): ...
```
2. Use the same vision model singleton as Phase 2 β€” both calls share weights.
3. Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2.
**Deliverable:** running the validator on 5 real progress photos returns the correct verdict on β‰₯3.
---
### Phase 8 β€” Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge)
> **Important caveat:** The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still **requires GPU**, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does **not** violate the runtime constraint β€” only the build pipeline touches a GPU.
**Goal:** publish a fine-tuned Planner GGUF to the Hub and load it from the Space.
**Tasks**
1. **Build SFT dataset** (`scripts/build_sft_dataset.py`):
- Load Kaggle `better-recipes` dataset.
- For each recipe, build a `(prompt, completion)` pair where `prompt` is `"Available ingredients: X, Y, Z. Propose recipe."` and `completion` is the full canonical `Recipe` JSON.
- Generate ~1000 pairs, push to `<you>/cook-with-me-sft` HF Dataset.
2. **LoRA training** (`scripts/train_planner.py` β€” to be run on a GPU machine, not the Space):
```python
# peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4
# r=16, alpha=32, lr=2e-4, epochs=2, batch=4
# push_to_hub=True, hub_model_id="<you>/cook-with-me-planner-4b"
```
3. **Convert to GGUF** (Day 8 evening):
- Use `llama.cpp/convert_hf_to_gguf.py` then `quantize` to `Q4_K_M`.
- Push GGUF to `<you>/cook-with-me-planner-4b-gguf`.
4. Update `src/models/loader.py` to point at your GGUF instead of the base model.
**Deliverable:** the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set.
---
### Phase 9 β€” Day 9: End-to-end test, performance pass, pre-warm cache
**Goal:** the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware.
**Tasks**
1. Write `scripts/smoke_test.py` that runs the full pipeline on 3 sample fridge photos and asserts:
- Each ingredient list is non-empty
- Each recipe has 5–7 steps
- Each step has a non-empty image and audio path
- Nutrition has all 5 macros set
2. Implement **on-disk caching** for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to `~/.cache/cook-with-me/flux/`.
3. Pre-render and commit **3 fully-prepared demo recipes** (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click.
4. Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace.
5. Add a "Loading models..." progress bar on first request β€” first cold start can take 90s.
**Deliverable:** smoke test passes on the live Space.
---
### Phase 10 β€” Day 10: README, demo video, social post, submit
**Tasks**
1. Write `README.md` with the required HF Space frontmatter:
```yaml
---
title: Cook With Me
emoji: 🍲
colorFrom: orange
colorTo: yellow
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
---
```
Followed by:
- One-paragraph pitch
- 60-second demo video embed
- Architecture diagram (export from `arquitectura.html` as PNG)
- Section: "How closed-loop visual cooking guidance works"
- Models used (with HF links + total parameter count)
- Badges declared
- Build / run instructions
2. Record a 60–90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera.
3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 β€” here's why" or "Closed-loop validation needs the same vision model on both sides").
4. Social post on X / LinkedIn with the demo video.
5. Submit on the hackathon platform.
---
## 4. Tools usage matrix (when to reach for what)
| Phase | Primary tools | Why |
|---|---|---|
| 0 β€” setup | HF CLI, Kaggle CLI, OpenAI Codex CLI | one-shot config |
| 1 β€” data | `kagglehub`, `pandas`, `sentence-transformers` | offline dataset prep |
| 2 β€” vision | `llama-cpp-python` + `MiniCPMv26ChatHandler` | runs inside Space, badge: Llama Champion |
| 3 β€” planner | `llama-cpp-python` + retrieval over local parquet | grounded JSON output |
| 3.5 β€” nutrition | local CSV + regex parser | reliable, no LLM math |
| 4 β€” illustrator | `diffusers` + `Flux2KleinPipeline` | sponsor model showcase |
| 5 β€” narrator | VoxCPM2 via `transformers` (or its native API) | local TTS |
| 6 β€” UI | `gradio` + custom CSS theme | Off-Brand badge |
| 7 β€” validator | same vision singleton as phase 2 | closed-loop innovation, Best Agent |
| 8 β€” fine-tune | `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine | Well-Tuned badge |
| 9 β€” test/cache | `pytest`, `hashlib`, on-disk FLUX cache | demo reliability |
| 10 β€” submit | HF Spaces, video tool, social | shipping |
---
## 5. Performance budget on the HF Space
| Operation | Target latency | Hardware needed |
|---|---|---|
| Vision: ingredient ID | < 8 s | CPU 4-thread |
| Planner: propose 3 dishes | < 12 s | CPU 4-thread |
| Planner: build full recipe JSON | < 20 s | CPU 4-thread |
| Nutrition computation | < 0.1 s | CPU |
| FLUX: 1 image (4 steps) | < 12 s on T4 / < 90 s on CPU offload | GPU strongly recommended |
| FLUX: 6 images (final + 5 steps) | < 80 s on T4 | GPU |
| VoxCPM2: 1 step narration | < 5 s | CPU |
| Validator: 1 progress check | < 8 s | CPU |
| **Full recipe end-to-end** | **< 2 min on T4 Space** | β€” |
**Hardware decision:** rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours.
---
## 6. Risks and mitigations (delta from `estrategia.md`)
| Risk | Mitigation |
|---|---|
| MiniCPM-V-4 has no public GGUF | Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. |
| llama-cpp-python's MiniCPM-V chat handler version mismatch | Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. |
| FLUX.2 Klein 9B too slow on free CPU Space | Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. |
| VoxCPM2 docs sparse | Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. |
| Kaggle dataset has format quirks (HTML in instructions, missing fields) | The Phase 1 normalization step handles this; budget 2 hours. |
| Nutrition CSV missing exotic ingredients | Skip-and-log strategy already designed; demo-day recipes use common ingredients only. |
| Total params >32B if VoxCPM2 turns out to be 7B | Check size in Phase 0; if too large, drop to a smaller TTS. |
---
## 7. "Day-1 hello world" checklist
Before writing any agent code, get this minimal end-to-end loop working β€” it proves your stack:
1. ☐ Empty Gradio Space deployed, shows "Hello"
2. ☐ `huggingface-cli login` works locally
3. ☐ `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds
4. ☐ `from llama_cpp import Llama` runs in your venv
5. ☐ Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip
6. ☐ Push the round-trip to the Space; confirm it answers in the cloud
**Only after all 6 are checked, start Phase 1.**
---
## 8. Where this plan differs from `estrategia.md` (deltas to communicate)
| Topic | `estrategia.md` (Spanish, Mexican-cuisine focus) | This document (current requirements) |
|---|---|---|
| Language | Spanish-first | **English only** |
| Cuisine | Mexican | **International** (Kaggle dataset) |
| Voice models | OpenBMB voice + Cohere Labs | **VoxCPM2** only (single voice) |
| Vision model | MiniCPM-V 2.6 / 4 | **MiniCPM-V-4.6** |
| Reasoning model | MiniCPM-4 4B | **MiniCPM-V-4** |
| FLUX runtime | Modal endpoint | **Inside Space (llama.cpp principle)**; Modal kept as a future migration target only |
| External APIs at runtime | Allowed (Modal, OpenAI optional) | **None** β€” full local inference inside Space |
| Nutritional info | Not specified | **Required** at end of recipe |
| Fine-tune dataset | 200 synthetic Mexican recipes | **Kaggle better-recipes (international)** |
If anything in `plan.md` or `estrategia.md` conflicts with this document, **this document wins** β€” it reflects the latest user requirements.
---
## 9. Definition of done
The implementation is complete when **all** of these are true:
- [ ] Public HF Space `https://huggingface.co/spaces/<you>/cook-with-me` loads
- [ ] App is fully in English
- [ ] Fridge photo β†’ ingredient list β†’ 3 dish options β†’ full recipe with images, audio, and nutrition works end-to-end
- [ ] Progress validator returns sensible verdicts on 3+ test photos
- [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers β€” **no external API calls at runtime**
- [ ] Total parameters declared in README ≀ 32B
- [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge)
- [ ] Demo video (60–90s) recorded with a real person cooking
- [ ] Field Notes blog post published
- [ ] Submitted on the hackathon platform before deadline