Cook_with_a_LLM / Strategy /plan_implementacion.md
FredinVΓ‘zquez
add strategy plan
bad5d84

A newer version of the Gradio SDK is available: 6.18.0

Upgrade

Implementation Plan β€” "Cook With Me"

Step-by-step implementation guide for developers building the multimodal cooking sous-chef Gradio app for Hugging Face Spaces.

Hackathon: Small models / Big adventures β€” June 2026 Read first: plan.md (the what and why) and estrategia.md (the how at a strategic level). This document is the how at a tactical level β€” turn this into code.


0. Locked decisions (do not re-discuss)

Decision Value Reason
UI framework Gradio Hackathon requirement
Hosting Hugging Face Space Hackathon requirement
Inference runtime (text + vision) llama.cpp via llama-cpp-python Runs inside the Space CPU, no external APIs needed for now. Future: migrate to Modal
Image generation FLUX.2 Klein 9B (black-forest-labs/FLUX.2-klein-9B) Sponsor model; runs in the Space if a GPU Space is rented (or via enable_model_cpu_offload() as fallback). Plan to migrate this specific component to Modal post-hackathon
Recipe planner / reasoning openbmb/MiniCPM-V-4 (GGUF) Provided requirement
Vision (ingredient ID + progress validator) openbmb/MiniCPM-V-4.6 (GGUF) Provided requirement
Text-to-speech OpenBMB VoxCPM2 Provided requirement
Recipe dataset thedevastator/better-recipes-for-a-better-life (Kaggle) β€” international cuisine Provided requirement; not limited to Mexican food
App language English only Provided requirement
Final output Recipe + step images + voice + nutritional values Provided requirement
External API calls at runtime None "llama.cpp inside the Space" mandate

1. Architecture (final, English-only, llama.cpp-first)

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚     Hugging Face Space (Gradio)      β”‚
                          β”‚   (CPU + optional GPU upgrade)       β”‚
                          β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
   πŸ“Έ Fridge photo  ─────▢│  [Vision Agent]                      β”‚
                          β”‚   MiniCPM-V-4.6 GGUF (llama.cpp)     β”‚
                          β”‚   β†’ list[ingredient]                  β”‚
                          β”‚              β”‚                        β”‚
                          β”‚              β–Ό                        β”‚
   πŸ₯˜ User picks dish ───▢│  [Recipe Planner]                    β”‚
                          β”‚   MiniCPM-V-4 GGUF (llama.cpp)       β”‚
                          β”‚   + retrieval over Kaggle dataset    β”‚
                          β”‚   β†’ Recipe JSON (steps, nutrition)   β”‚
                          β”‚              β”‚                        β”‚
                          β”‚              β–Ό                        β”‚
                          β”‚  [Step Illustrator]                   β”‚
                          β”‚   FLUX.2 Klein 9B (diffusers)        β”‚
                          β”‚   β†’ PNG per step + final dish        β”‚
                          β”‚              β”‚                        β”‚
                          β”‚              β–Ό                        β”‚
                          β”‚  [Narrator]                           β”‚
                          β”‚   VoxCPM2 β†’ MP3 per step             β”‚
                          β”‚              β”‚                        β”‚
                          β”‚              β–Ό                        β”‚
   πŸ“Έ Progress photo ────▢│  [Progress Validator]                β”‚
                          β”‚   MiniCPM-V-4.6 (vision compare)     β”‚
                          β”‚   β†’ "go / wait / fix" + tip          β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total parameter count (≀ 32B requirement):

  • MiniCPM-V-4 (reasoning) β‰ˆ 4B
  • MiniCPM-V-4.6 (vision) β‰ˆ 4.6B
  • FLUX.2 Klein β‰ˆ 9B
  • VoxCPM2 β‰ˆ 1B (estimate)
  • Total β‰ˆ 18.6B βœ“

2. Repository layout

cook-with-me/
β”œβ”€β”€ app.py                      # Gradio entrypoint (Space looks for this)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ packages.txt                # apt packages (ffmpeg, libsndfile1)
β”œβ”€β”€ README.md                   # Space card (HF requires YAML frontmatter)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py               # paths, model IDs, constants
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ vision.py           # MiniCPM-V-4.6 wrapper (llama-cpp)
β”‚   β”‚   β”œβ”€β”€ planner.py          # MiniCPM-V-4 wrapper (llama-cpp)
β”‚   β”‚   β”œβ”€β”€ illustrator.py      # FLUX.2 Klein wrapper (diffusers)
β”‚   β”‚   β”œβ”€β”€ narrator.py         # VoxCPM2 wrapper
β”‚   β”‚   └── loader.py           # lazy singletons + GGUF download
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ mise_en_place.py    # ingredient identification
β”‚   β”‚   β”œβ”€β”€ recipe_planner.py   # builds Recipe object
β”‚   β”‚   β”œβ”€β”€ step_illustrator.py # per-step image gen
β”‚   β”‚   β”œβ”€β”€ narrator.py         # per-step TTS
β”‚   β”‚   └── progress_validator.py
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ recipe_index.py     # loads Kaggle dataset, builds retrieval
β”‚   β”‚   └── nutrition.py        # USDA-style nutrition computation
β”‚   β”œβ”€β”€ pipeline.py             # Recipe state machine, orchestration
β”‚   β”œβ”€β”€ prompts/
β”‚   β”‚   β”œβ”€β”€ vision_prompt.txt
β”‚   β”‚   β”œβ”€β”€ planner_system.txt
β”‚   β”‚   └── validator_prompt.txt
β”‚   └── ui/
β”‚       β”œβ”€β”€ theme.py            # custom CSS (Off-Brand badge)
β”‚       └── components.py       # reusable Gradio Blocks pieces
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ download_models.py      # pre-warms GGUF + Flux weights at build time
β”‚   β”œβ”€β”€ build_recipe_index.py   # caches Kaggle dataset locally
β”‚   └── smoke_test.py           # end-to-end validation before push
└── assets/
    β”œβ”€β”€ sample_fridge_1.jpg
    └── sample_progress_1.jpg

3. Phase-by-phase plan (10 days)

Each phase has: goal, tasks, deliverable, verification check. Do not move to the next phase if verification fails.


Phase 0 β€” Day 0 (Β½ day): Account + tooling setup

Goal: every credential and CLI is ready before writing code.

Tasks

  1. Create or confirm Hugging Face account; generate a write token (Settings β†’ Access Tokens). Store as HF_TOKEN env var locally.
  2. Install Hugging Face CLI: pip install -U huggingface_hub then huggingface-cli login.
  3. Install Kaggle CLI: pip install kaggle. Place kaggle.json (Account β†’ API β†’ Create New Token) in ~/.kaggle/kaggle.json with chmod 600.
  4. Install OpenAI Codex CLI (pair-programmer) and verify your $100 credit is active.
  5. Install local Python 3.11 venv: python -m venv .venv && source .venv/bin/activate.
  6. Create the repo locally: git init cook-with-me && cd cook-with-me.
  7. Create an empty Hugging Face Space: huggingface.co β†’ New Space β†’ SDK = Gradio, Hardware = CPU basic (upgrade later if you need GPU for FLUX). Clone it and copy your repo skeleton into it.
  8. Verify model availability: open in a browser and confirm pages exist:
    • huggingface.co/openbmb/MiniCPM-V-4
    • huggingface.co/openbmb/MiniCPM-V-4-6
    • huggingface.co/openbmb/VoxCPM2 (or whatever the exact repo name is β€” search "VoxCPM" on HF)
    • huggingface.co/black-forest-labs/FLUX.2-klein-9B

Deliverable: empty Space deployed showing "Hello World" Gradio.

Verify: https://huggingface.co/spaces/<you>/cook-with-me loads.


Phase 1 β€” Day 1: Project skeleton + recipe dataset ingestion

Goal: the Kaggle dataset is downloaded, parsed, and cached as a local artifact ready for retrieval.

Tasks

  1. Write requirements.txt (initial version β€” packages will be added as phases progress):
    gradio>=4.44
    huggingface_hub>=0.24
    llama-cpp-python>=0.3.2
    numpy
    pandas
    Pillow
    pydantic>=2
    sentence-transformers
    
  2. Write packages.txt:
    ffmpeg
    libsndfile1
    
  3. Write scripts/build_recipe_index.py:
    • Use kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, "thedevastator/better-recipes-for-a-better-life", file_path) β€” discover file_path by listing the dataset files first via kagglehub.dataset_download.
    • Normalize columns: name, ingredients (list[str]), instructions (list[str]), cuisine (str if present, else "international"), prep_time, servings.
    • Drop rows missing critical fields. Lowercase + strip ingredient strings.
    • Save to data/recipes.parquet (~5–50MB depending on dataset size).
    • Build sentence embeddings of the recipe name + first 3 ingredients using sentence-transformers/all-MiniLM-L6-v2 and save to data/recipes_emb.npy.
    • This script runs once locally; commit the parquet + npy files to the repo (or to a private HF Dataset, then download in app.py). If files exceed 100MB, push to a HF Dataset repo: <you>/cook-with-me-recipes.
  4. Write src/data/recipe_index.py:
    • class RecipeIndex with .search(ingredients: list[str], top_k=5) -> list[RecipeRow].
    • Build a query string from ingredients, embed it, cosine-similarity against the cached embeddings, return top-k.

Deliverable: python -c "from src.data.recipe_index import RecipeIndex; r=RecipeIndex(); print(r.search(['chicken','onion','tomato']))" prints 5 sensible recipes.

Verify: at least 3 of the top-5 results contain β‰₯2 of the input ingredients.


Phase 2 β€” Day 2: Vision agent (Mise en Place) β€” MiniCPM-V-4.6 via llama.cpp

Goal: given a fridge photo, return a clean list of English ingredient names.

Background: llama.cpp supports multimodal models through a vision projector (mmproj-*.gguf) plus the language model GGUF. MiniCPM-V family ships both files on the Hub.

Tasks

  1. Find the GGUF release of MiniCPM-V-4.6. Search HF for MiniCPM-V-4_6-gguf or openbmb/MiniCPM-V-4_6-gguf. You need two files:
    • Model-Q4_K_M.gguf (or similar quant)
    • mmproj-model-f16.gguf (the vision projector)
  2. Write src/models/loader.py:
    from huggingface_hub import hf_hub_download
    from llama_cpp import Llama
    from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler  # or matching handler
    
    _vision = None
    
    def get_vision_model():
        global _vision
        if _vision is None:
            model_path = hf_hub_download(
                repo_id="openbmb/MiniCPM-V-4_6-gguf",  # confirm exact repo
                filename="Model-Q4_K_M.gguf",
            )
            mmproj_path = hf_hub_download(
                repo_id="openbmb/MiniCPM-V-4_6-gguf",
                filename="mmproj-model-f16.gguf",
            )
            handler = MiniCPMv26ChatHandler(clip_model_path=mmproj_path)
            _vision = Llama(
                model_path=model_path,
                chat_handler=handler,
                n_ctx=4096,
                n_threads=4,
                verbose=False,
            )
        return _vision
    
  3. Write src/agents/mise_en_place.py:
    import base64, io, json
    from PIL import Image
    from src.models.loader import get_vision_model
    
    PROMPT = (
      "You are an ingredient detector. Look at the fridge/pantry photo and "
      "list every edible ingredient you can identify. Return strict JSON: "
      '{"ingredients": ["chicken", "onion", "tomato", ...]} '
      "Lowercase, English, no brand names, no containers."
    )
    
    def _img_to_data_url(img: Image.Image) -> str:
        buf = io.BytesIO(); img.save(buf, "JPEG", quality=85)
        b64 = base64.b64encode(buf.getvalue()).decode()
        return f"data:image/jpeg;base64,{b64}"
    
    def identify_ingredients(image: Image.Image) -> list[str]:
        llm = get_vision_model()
        out = llm.create_chat_completion(messages=[
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": _img_to_data_url(image)}},
                {"type": "text", "text": PROMPT},
            ]}
        ], temperature=0.2, response_format={"type": "json_object"})
        data = json.loads(out["choices"][0]["message"]["content"])
        return [s.lower().strip() for s in data["ingredients"]]
    
  4. Test locally with 5 sample fridge photos.

Deliverable: the function returns a non-empty English list with β‰₯80% precision on a clean fridge photo.

Verify: stash these 5 results in tests/vision_smoke.json for regression checks.


Phase 3 β€” Day 3: Recipe Planner β€” MiniCPM-V-4 via llama.cpp + retrieval

Goal: given a list of ingredients (and optionally a chosen dish), return a fully structured Recipe JSON including steps, durations, visual descriptions, and nutritional values.

Tasks

  1. Find or convert MiniCPM-V-4 to GGUF. Likely repo: openbmb/MiniCPM-V-4-gguf or community quants. Pick Q4_K_M.
  2. Add to src/models/loader.py a get_planner_model() (same pattern as vision but without chat_handler).
  3. Write src/agents/recipe_planner.py:
    • Step A β€” propose: call planner with Tengo: [ingredients]. Propose 3 dish options that fit. Reply JSON.
    • Step B β€” retrieve: for the chosen dish name, call RecipeIndex.search(...) and pick the closest match. Use it as a grounded reference.
    • Step C β€” restructure: prompt the planner with both the user's available ingredients and the retrieved reference recipe, asking it to output the canonical Recipe JSON schema below. The retrieval grounds the model and prevents hallucinated steps.
    • Step D β€” nutrition: from the recipe ingredients, compute approximate nutritional values per serving. See Phase 3.5.
  4. Define the canonical schema in src/pipeline.py using Pydantic:
    from pydantic import BaseModel
    from typing import Optional
    
    class Step(BaseModel):
        n: int
        instruction: str       # English, imperative
        duration: str          # "4 minutes"
        visual: str            # English visual description for FLUX prompt
        tip: Optional[str] = None
    
    class Nutrition(BaseModel):
        calories: int          # per serving
        protein_g: float
        carbs_g: float
        fat_g: float
        fiber_g: float
    
    class Recipe(BaseModel):
        name: str
        cuisine: str
        servings: int
        total_time_minutes: int
        options: list[dict]    # only populated on "propose" call
        ingredients_have: list[str]
        ingredients_missing: list[str]
        substitutes: dict[str, list[str]]
        steps: list[Step]
        final_dish_visual: str
        nutrition_per_serving: Nutrition
    
  5. Write the system prompt (src/prompts/planner_system.txt):
    • Persona: international chef
    • Hard rule: output JSON only, matching schema
    • Hard rule: prefer dishes feasible with available ingredients
    • Hard rule: 5–7 steps, each ≀ 25 words, each with a concrete visual field for image generation
    • Hard rule: include nutrition_per_serving (model is allowed to estimate; you'll override with data/nutrition.py for accuracy)
  6. Use response_format={"type": "json_object"} in the chat completion call. Set temperature=0.7, top_p=0.95, enable_thinking=True for the propose step (creative); temperature=0.4 for the structured-output step (deterministic).

Deliverable: for ["chicken","onion","tomato","tortilla","cheese"] and chosen dish "chicken tinga", the function returns a valid Recipe Pydantic object with 5–7 steps.

Verify: the JSON parses, each step has all required fields, and total inference time on Space CPU < 60 seconds.


Phase 3.5 β€” Day 3 (afternoon): Nutritional values

Goal: the recipe ends with reliable per-serving nutrition (not hallucinated by the LLM).

Approach: small, embedded reference table beats LLM math.

Tasks

  1. Bundle data/nutrition_table.csv β€” a 200-row CSV mapping common English ingredient names to per-100g macros (kcal, protein, carbs, fat, fiber). Source: USDA FoodData Central CSV download (free, public domain). Trim columns; commit to repo.
  2. Write src/data/nutrition.py:
    • parse_quantity(line: str) -> (grams, ingredient_name) β€” handle "2 cups flour", "200 g chicken", "1 tbsp olive oil". Use a small regex + a unit-to-grams table (cup=240, tbsp=15, tsp=5, oz=28.35).
    • compute_nutrition(ingredient_lines: list[str], servings: int) -> Nutrition β€” sum per-100g values weighted by grams, divide by servings.
    • If a line cannot be parsed, skip it and log; don't crash.
  3. After the planner returns a recipe, overwrite recipe.nutrition_per_serving with the computed value. Keep the LLM's value only as a fallback when the parser yields zero.

Deliverable: for a known recipe (e.g., spaghetti with tomato sauce, 4 servings), computed calories per serving is within Β±25% of online references.


Phase 4 β€” Day 4: Step Illustrator β€” FLUX.2 Klein 9B

Goal: generate an appetizing image for the final dish + one image per step.

Constraint: FLUX.2 Klein on CPU is impractical; on a free Space CPU it would take ~10 minutes per image. Two paths:

  • Path A (recommended for the hackathon): upgrade the Space to a GPU instance (T4 or A10G β€” paid, but $20 HF credits cover it for a week of development). Code stays unchanged.
  • Path B (fallback): run FLUX in enable_model_cpu_offload() mode with num_inference_steps=4 and accept ~3 min/image β€” only feasible for pre-rendered demo recipes, not live runs.

Tasks

  1. Add to requirements.txt:
    diffusers>=0.31
    transformers>=4.45
    accelerate
    torch
    safetensors
    
  2. Write src/models/illustrator.py:
    import torch
    from diffusers import Flux2KleinPipeline
    
    _pipe = None
    
    def get_flux():
        global _pipe
        if _pipe is None:
            dtype = torch.bfloat16
            _pipe = Flux2KleinPipeline.from_pretrained(
                "black-forest-labs/FLUX.2-klein-9B",
                torch_dtype=dtype,
            )
            _pipe.enable_model_cpu_offload()
        return _pipe
    
    def render(prompt: str, seed: int = 0) -> "PIL.Image.Image":
        pipe = get_flux()
        device = "cuda" if torch.cuda.is_available() else "cpu"
        img = pipe(
            prompt=prompt,
            height=1024, width=1024,
            guidance_scale=1.0,
            num_inference_steps=4,
            generator=torch.Generator(device=device).manual_seed(seed),
        ).images[0]
        return img
    
  3. Write src/agents/step_illustrator.py:
    • For each Step.visual, build a prompt like:

      f"Top-down photo of a kitchen pan or plate showing {visual}. {cuisine} home cooking, warm natural lighting, recipe magazine style, photorealistic, appetizing."

    • Generate the final dish image first, then the per-step images, all in one Python loop (no parallelism β€” FLUX holds the GPU).
    • Cache results on disk keyed by hash(prompt) to avoid re-renders on re-runs.
    • Emit Gradio progress updates so the UI doesn't appear frozen.
  4. Critical tuning: keep num_inference_steps=4 (Klein is distilled). Higher counts blow latency and offer minimal quality gain at this scale.

Deliverable: for a 5-step recipe, all 6 images (final + 5 steps) render in:

  • < 1 minute on T4 GPU Space
  • < 8 minutes on CPU offload (acceptable only for pre-cached demos)

Verify: show the 6 images to an unprompted human; β‰₯4 should be described as "appetizing".


Phase 5 β€” Day 5: Narrator β€” VoxCPM2

Goal: every step's instruction is rendered to an MP3 in a warm, clear English voice.

Tasks

  1. Confirm the exact VoxCPM2 repo name on HF (openbmb/VoxCPM2 or similar). Read its README for the inference snippet β€” TTS APIs vary widely between models.
  2. Add to requirements.txt: soundfile, torchaudio, numpy. If VoxCPM2 ships GGUF, use it via llama-cpp-python audio extension (if available); otherwise load via transformers directly.
  3. Write src/models/narrator.py:
    _tts = None
    
    def get_tts():
        global _tts
        if _tts is None:
            # placeholder β€” replace with the exact VoxCPM2 loading code from its README
            from transformers import AutoModel, AutoProcessor
            _tts = ... # load on CPU; VoxCPM2 is small (~1B)
        return _tts
    
    def synthesize(text: str, voice: str = "warm_female_en") -> bytes:
        """Returns MP3 bytes."""
        tts = get_tts()
        wav = tts.generate(text, voice=voice)  # API depends on VoxCPM2
        # encode wav -> mp3 with soundfile + ffmpeg-python or pydub
        return mp3_bytes
    
  4. Write src/agents/narrator.py:
    • For each step, synthesize step.instruction. If step.tip is set, synthesize a separate "tip" clip.
    • Save MP3 files in a per-recipe temp directory; return file paths to Gradio.
  5. Pre-render all step audio when the recipe is finalized β€” never stream per-step in the demo (too much UI lag).

Deliverable: clicking "Play" on step 1 in the UI plays clear English narration.

Verify: on a 5-step recipe, total TTS rendering time < 30 seconds on CPU.


Phase 6 β€” Day 6: Gradio UI (Off-Brand)

Goal: the Space looks like a recipe magazine, not stock Gradio.

Tasks

  1. Write src/ui/theme.py:
    import gradio as gr
    
    theme = gr.themes.Soft(
        primary_hue="orange",
        neutral_hue="stone",
        font=[gr.themes.GoogleFont("Inter"), "sans-serif"],
        font_mono=[gr.themes.GoogleFont("JetBrains Mono"), "monospace"],
    )
    
    CSS = """
    .gradio-container { background: #f5ecd9 !important; }
    .recipe-hero { background:#fffbf0; border-radius:14px; padding:28px; }
    .recipe-hero h1 { font-family:'Lora',serif!important; font-size:36px!important; color:#6b4a2a!important; }
    .step-card { background:#fffbf0; border-left:4px solid #a85c2a; border-radius:8px; padding:18px 22px; margin:12px 0; }
    .nutri-grid { display:grid; grid-template-columns:repeat(5,1fr); gap:12px; margin-top:24px; }
    .nutri-cell { background:#fffbf0; border:1px solid #d8c9ad; border-radius:10px; padding:12px; text-align:center; }
    """
    
  2. Write app.py with three tabs:
    • Tab 1 β€” Cook: fridge photo input β†’ ingredient chips β†’ 3 dish options β†’ selected recipe card with hero image, steps (image + text + audio play button each), nutrition grid at the bottom.
    • Tab 2 β€” Check Progress: upload a progress photo + select active step β†’ validator returns badge (go/wait/fix) + tip + audio.
    • Tab 3 β€” About / Tech: README-style explanation, badges, model list.
  3. Use gr.Blocks with gr.State to hold the current Recipe Pydantic object across UI events. Serialize to/from dict since Pydantic objects don't survive Gradio state by default β€” wrap in state.value = recipe.model_dump().
  4. Wire callbacks:
    • btn_propose.click(fn=on_propose, inputs=[fridge_photo], outputs=[ingredient_chips, dish_options, state])
    • dish_options.select(fn=on_pick_dish, inputs=[state, picked_dish], outputs=[recipe_card, hero_img, steps_column, nutrition_grid, state])
    • progress_image.upload(fn=on_validate, inputs=[state, current_step_idx, progress_image], outputs=[verdict_md, tip_audio])

Deliverable: end-to-end run from a sample fridge photo to a fully rendered recipe card with audio and nutrition. No Gradio default look anywhere.


Phase 7 β€” Day 7: Progress Validator (closed loop)

Goal: user uploads a progress photo, app says "go / wait / fix" with a voiced tip.

Tasks

  1. Write src/agents/progress_validator.py:
    PROMPT = """Compare these two cooking photos.
    Photo 1 (target): how it should look after the step "{instruction}".
    Photo 2 (user's pan/plate): the user's current progress.
    Reply strict JSON: {"verdict": "go|wait|fix", "feedback": "...", "tip": "..."}
    - "go": looks right, move to next step
    - "wait": needs more time, do not change anything yet
    - "fix": something is off; suggest a concrete adjustment in one sentence
    """
    def validate(target_img, user_img, step_instruction): ...
    
  2. Use the same vision model singleton as Phase 2 β€” both calls share weights.
  3. Render the verdict as a colored badge (green/amber/red) and play the tip via VoxCPM2.

Deliverable: running the validator on 5 real progress photos returns the correct verdict on β‰₯3.


Phase 8 β€” Day 8: Fine-tune the Planner on the Kaggle dataset (Well-Tuned badge)

Important caveat: The user instruction says "for now keep inference on llama.cpp inside HF Space, future migration to Modal." Fine-tuning still requires GPU, so training itself happens on Modal (one-shot, offline) or on a rented Colab/Lambda GPU. Inference of the resulting model stays on llama.cpp inside the Space (as GGUF). This does not violate the runtime constraint β€” only the build pipeline touches a GPU.

Goal: publish a fine-tuned Planner GGUF to the Hub and load it from the Space.

Tasks

  1. Build SFT dataset (scripts/build_sft_dataset.py):
    • Load Kaggle better-recipes dataset.
    • For each recipe, build a (prompt, completion) pair where prompt is "Available ingredients: X, Y, Z. Propose recipe." and completion is the full canonical Recipe JSON.
    • Generate ~1000 pairs, push to <you>/cook-with-me-sft HF Dataset.
  2. LoRA training (scripts/train_planner.py β€” to be run on a GPU machine, not the Space):
    # peft + trl SFTTrainer, base = openbmb/MiniCPM-V-4
    # r=16, alpha=32, lr=2e-4, epochs=2, batch=4
    # push_to_hub=True, hub_model_id="<you>/cook-with-me-planner-4b"
    
  3. Convert to GGUF (Day 8 evening):
    • Use llama.cpp/convert_hf_to_gguf.py then quantize to Q4_K_M.
    • Push GGUF to <you>/cook-with-me-planner-4b-gguf.
  4. Update src/models/loader.py to point at your GGUF instead of the base model.

Deliverable: the Space loads your fine-tuned Planner GGUF and produces JSON recipes that are noticeably better-formatted than the base model on a held-out test set.


Phase 9 β€” Day 9: End-to-end test, performance pass, pre-warm cache

Goal: the Space loads in <60s and a full recipe (text + 5 images + 5 audios + nutrition) renders in <2 minutes on the chosen hardware.

Tasks

  1. Write scripts/smoke_test.py that runs the full pipeline on 3 sample fridge photos and asserts:
    • Each ingredient list is non-empty
    • Each recipe has 5–7 steps
    • Each step has a non-empty image and audio path
    • Nutrition has all 5 macros set
  2. Implement on-disk caching for FLUX outputs (key = SHA256 of prompt) so re-runs of the same recipe are instant. Save to ~/.cache/cook-with-me/flux/.
  3. Pre-render and commit 3 fully-prepared demo recipes (chicken tinga, pasta carbonara, chicken tikka) so judges see results in <5s on first click.
  4. Add error handling at every UI boundary: a model failure should display a friendly message, not a stack trace.
  5. Add a "Loading models..." progress bar on first request β€” first cold start can take 90s.

Deliverable: smoke test passes on the live Space.


Phase 10 β€” Day 10: README, demo video, social post, submit

Tasks 1. Write README.md with the required HF Space frontmatter: ```yaml

title: Cook With Me emoji: 🍲 colorFrom: orange colorTo: yellow sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0

Followed by:
- One-paragraph pitch
- 60-second demo video embed
- Architecture diagram (export from `arquitectura.html` as PNG)
- Section: "How closed-loop visual cooking guidance works"
- Models used (with HF links + total parameter count)
- Badges declared
- Build / run instructions
2. Record a 60–90 second demo video: real person cooks a recipe end-to-end with the app guiding via voice, ending with the cooked plate on camera.
3. Write the Field Notes blog post: one of the engineering surprises (e.g., "FLUX.2 step images at 4 steps look better than 8 β€” here's why" or "Closed-loop validation needs the same vision model on both sides").
4. Social post on X / LinkedIn with the demo video.
5. Submit on the hackathon platform.

---

## 4. Tools usage matrix (when to reach for what)

| Phase | Primary tools | Why |
|---|---|---|
| 0 β€” setup | HF CLI, Kaggle CLI, OpenAI Codex CLI | one-shot config |
| 1 β€” data | `kagglehub`, `pandas`, `sentence-transformers` | offline dataset prep |
| 2 β€” vision | `llama-cpp-python` + `MiniCPMv26ChatHandler` | runs inside Space, badge: Llama Champion |
| 3 β€” planner | `llama-cpp-python` + retrieval over local parquet | grounded JSON output |
| 3.5 β€” nutrition | local CSV + regex parser | reliable, no LLM math |
| 4 β€” illustrator | `diffusers` + `Flux2KleinPipeline` | sponsor model showcase |
| 5 β€” narrator | VoxCPM2 via `transformers` (or its native API) | local TTS |
| 6 β€” UI | `gradio` + custom CSS theme | Off-Brand badge |
| 7 β€” validator | same vision singleton as phase 2 | closed-loop innovation, Best Agent |
| 8 β€” fine-tune | `peft`, `trl`, `llama.cpp` convert/quantize, on a GPU machine | Well-Tuned badge |
| 9 β€” test/cache | `pytest`, `hashlib`, on-disk FLUX cache | demo reliability |
| 10 β€” submit | HF Spaces, video tool, social | shipping |

---

## 5. Performance budget on the HF Space

| Operation | Target latency | Hardware needed |
|---|---|---|
| Vision: ingredient ID | < 8 s | CPU 4-thread |
| Planner: propose 3 dishes | < 12 s | CPU 4-thread |
| Planner: build full recipe JSON | < 20 s | CPU 4-thread |
| Nutrition computation | < 0.1 s | CPU |
| FLUX: 1 image (4 steps) | < 12 s on T4 / < 90 s on CPU offload | GPU strongly recommended |
| FLUX: 6 images (final + 5 steps) | < 80 s on T4 | GPU |
| VoxCPM2: 1 step narration | < 5 s | CPU |
| Validator: 1 progress check | < 8 s | CPU |
| **Full recipe end-to-end** | **< 2 min on T4 Space** | β€” |

**Hardware decision:** rent a T4 Space (~$0.40/hr) for the demo week. The $20 HF credits cover ~50 hours.

---

## 6. Risks and mitigations (delta from `estrategia.md`)

| Risk | Mitigation |
|---|---|
| MiniCPM-V-4 has no public GGUF | Convert yourself with `llama.cpp/convert_hf_to_gguf.py`. Allow a half-day buffer in Phase 2. |
| llama-cpp-python's MiniCPM-V chat handler version mismatch | Pin `llama-cpp-python==0.3.2` minimum; test the handler import on Day 2. If it fails, fall back to MiniCPM-V-2.6 GGUF (well-supported) for vision and document the swap. |
| FLUX.2 Klein 9B too slow on free CPU Space | Upgrade to a paid GPU Space (~$10 for the demo week). Document this in the README so judges expect it. |
| VoxCPM2 docs sparse | Drop to Kokoro-82M or Piper TTS as a backup. Lose the OpenBMB voice angle but keep the audio. |
| Kaggle dataset has format quirks (HTML in instructions, missing fields) | The Phase 1 normalization step handles this; budget 2 hours. |
| Nutrition CSV missing exotic ingredients | Skip-and-log strategy already designed; demo-day recipes use common ingredients only. |
| Total params >32B if VoxCPM2 turns out to be 7B | Check size in Phase 0; if too large, drop to a smaller TTS. |

---

## 7. "Day-1 hello world" checklist

Before writing any agent code, get this minimal end-to-end loop working β€” it proves your stack:

1. ☐ Empty Gradio Space deployed, shows "Hello"
2. ☐ `huggingface-cli login` works locally
3. ☐ `kaggle datasets download thedevastator/better-recipes-for-a-better-life` succeeds
4. ☐ `from llama_cpp import Llama` runs in your venv
5. ☐ Download one tiny GGUF (e.g., TinyLlama Q4) and call it from a Gradio textbox round-trip
6. ☐ Push the round-trip to the Space; confirm it answers in the cloud

**Only after all 6 are checked, start Phase 1.**

---

## 8. Where this plan differs from `estrategia.md` (deltas to communicate)

| Topic | `estrategia.md` (Spanish, Mexican-cuisine focus) | This document (current requirements) |
|---|---|---|
| Language | Spanish-first | **English only** |
| Cuisine | Mexican | **International** (Kaggle dataset) |
| Voice models | OpenBMB voice + Cohere Labs | **VoxCPM2** only (single voice) |
| Vision model | MiniCPM-V 2.6 / 4 | **MiniCPM-V-4.6** |
| Reasoning model | MiniCPM-4 4B | **MiniCPM-V-4** |
| FLUX runtime | Modal endpoint | **Inside Space (llama.cpp principle)**; Modal kept as a future migration target only |
| External APIs at runtime | Allowed (Modal, OpenAI optional) | **None** β€” full local inference inside Space |
| Nutritional info | Not specified | **Required** at end of recipe |
| Fine-tune dataset | 200 synthetic Mexican recipes | **Kaggle better-recipes (international)** |

If anything in `plan.md` or `estrategia.md` conflicts with this document, **this document wins** β€” it reflects the latest user requirements.

---

## 9. Definition of done

The implementation is complete when **all** of these are true:

- [ ] Public HF Space `https://huggingface.co/spaces/<you>/cook-with-me` loads
- [ ] App is fully in English
- [ ] Fridge photo β†’ ingredient list β†’ 3 dish options β†’ full recipe with images, audio, and nutrition works end-to-end
- [ ] Progress validator returns sensible verdicts on 3+ test photos
- [ ] All inference (vision, planner, TTS) runs through llama.cpp / local diffusers β€” **no external API calls at runtime**
- [ ] Total parameters declared in README ≀ 32B
- [ ] Fine-tuned Planner GGUF published to HF Hub (Well-Tuned badge)
- [ ] Demo video (60–90s) recorded with a real person cooking
- [ ] Field Notes blog post published
- [ ] Submitted on the hackathon platform before deadline